Thanks to:
Dr. Tom Blinn
Alan Davis
for responding.
It turned out, that, indeed, despite the seeming similarity in
the two crash dump files, the cause was not the same. Analysis of
the contents of EI_STAT register showed the difference:
Nov. '97 case:
EI_STAT = FFFFFFF014FFFFFF decodes as follows:
EI_STAT<27:24> = 4 CHIP_ID (Alpha 21264 Pass 4)
EI_STAT<28:28> = 1 Bcache Tag Parity Error
Our case (analysis by a colleague of Tom Blinn):
EI_STAT = FFFFFFF005FFFFFF decodes as follows:
EI_STAT<27:24> = 5 CHIP_ID (Alpha 21264A Pass 2)
In this case, no error(s) are reported. The contents of this register
simply indicates a different revision of the processor. The
conclusion that the problem is the same in both cases is wrong.
The first case clearly points to a problem with the Bcache, the
second case shows no problems whatsoever with the external
interface. EI_STAT<35:28> would have to be non-zero to indicate
a problem with the external interface.
The second case also states that the machine check code was 0x98,
processor hard error. This machine check code (0x98) is a "catch all"
for miscellaneous processor detected errors. In other words, some
event internal to the processor caused a trap to the machine check
entry point in PALcode. The PALcode handler then parsed all the
internal IPRs in the CPU that contain error state and did not find any
specific error bits set. The end result is that it loads machine check
code 0x98 to indicate an unknown processor hard error was detected
and the system crashes since these types of errors are always fatal.
Since the error was detected internal to the CPU, I would try to swap
out the CPU on this system and see if the problem goes away.
And that is what we will do.
Thanks,
Kevin Tyle <kevin_at_meso.com>
MESO, Inc.
Troy, NY USA
--------------------------------------------------------------------------------------
Original Message:
> Hi Managers,
>
> Over the last two weeks, our Digital AlphaPC 164LX 533 MHz machine
> has crashed three times. After analyzing the binary error log and
> the crash-data files, it looks like the cause is the same in
> each case. In fact, the symptoms look virtually identical to a case
> posted in the archives on Nov, 14 1997.
>
> In the poster's case, the diagnosis was:
>
> FINAL CONCLUSION: The most probable cause is over heating of the
> Bcache Simms.
>
> To put it simply, in each case I see the following--exactly the same as
> the
> referenced case, except for one slight(?) difference:
>
> SOME INFO FROM THE CRASH DATA FILES: What you may see if you have this
> problem:
> 1) a line in the crash data file that has: EI_STAT reg =
> fffffff014ffffff
>
> (in our case it appears as: EI_STAT reg =
> fffffff005ffffff)
>
> 2) a panic string of: "Processor Machine Check"
> 3) lines in succession with the following:
> Machine Check Processor Fatal Abort
> Machine Check Code = 98
> Machine Check Code = 98
> Processor detected hard error
>
> My question: can I therefore confidently conclude that the problem is the
> same as what was
> diagnosed in the previously posted case? Or might there be other
> possibilities,
> and if so, how might I diagnose it? The overheating problem is possible,
> though
> the overall room environment has not changed in the past year. Perhaps a
> case fan
> might not be working properly, or could the Bcache simms simply be wearing
> out (the
> machine is 2.5 years old and does a lot of memory-intensive number
> crunching).
>
> Thanks,
>
> Kevin Tyle <kevin_at_meso.com>
> MESO, Inc.
> Troy, NY USA
>
>
Received on Mon Feb 07 2000 - 15:10:51 NZDT