SUMMARY: Analyzing hw error on 21164LX from Kevin Tyle on 2000-02-08 (tru64-unix-managers)

From: Kevin Tyle <kevin_at_meso.com>
Date: Mon, 07 Feb 2000 10:09:50 -0500 (EST)

Thanks to:

        Dr. Tom Blinn
        Alan Davis

for responding.

It turned out, that, indeed, despite the seeming similarity in
the two crash dump files, the cause was not the same. Analysis of
the contents of EI_STAT register showed the difference:

Nov. '97 case:

        EI_STAT = FFFFFFF014FFFFFF decodes as follows:

        EI_STAT<27:24> = 4 CHIP_ID (Alpha 21264 Pass 4)
        EI_STAT<28:28> = 1 Bcache Tag Parity Error

Our case (analysis by a colleague of Tom Blinn):

        EI_STAT = FFFFFFF005FFFFFF decodes as follows:

        EI_STAT<27:24> = 5 CHIP_ID (Alpha 21264A Pass 2)

        In this case, no error(s) are reported. The contents of this register
        simply indicates a different revision of the processor. The
        conclusion that the problem is the same in both cases is wrong.
        The first case clearly points to a problem with the Bcache, the
        second case shows no problems whatsoever with the external
        interface. EI_STAT<35:28> would have to be non-zero to indicate
        a problem with the external interface.

        The second case also states that the machine check code was 0x98,
        processor hard error. This machine check code (0x98) is a "catch all"
        for miscellaneous processor detected errors. In other words, some
        event internal to the processor caused a trap to the machine check
        entry point in PALcode. The PALcode handler then parsed all the
        internal IPRs in the CPU that contain error state and did not find any
        specific error bits set. The end result is that it loads machine check
        code 0x98 to indicate an unknown processor hard error was detected
        and the system crashes since these types of errors are always fatal.

        Since the error was detected internal to the CPU, I would try to swap
        out the CPU on this system and see if the problem goes away.

And that is what we will do.

Thanks,

Kevin Tyle <kevin_at_meso.com>
MESO, Inc.
Troy, NY USA

--------------------------------------------------------------------------------------

Original Message:

> Hi Managers,
>
> Over the last two weeks, our Digital AlphaPC 164LX 533 MHz machine
> has crashed three times. After analyzing the binary error log and
> the crash-data files, it looks like the cause is the same in
> each case. In fact, the symptoms look virtually identical to a case
> posted in the archives on Nov, 14 1997.
>
> In the poster's case, the diagnosis was:
>
> FINAL CONCLUSION: The most probable cause is over heating of the
> Bcache Simms.
>
> To put it simply, in each case I see the following--exactly the same as
> the
> referenced case, except for one slight(?) difference:
>
> SOME INFO FROM THE CRASH DATA FILES: What you may see if you have this
> problem:
> 1) a line in the crash data file that has: EI_STAT reg =
> fffffff014ffffff
>
> (in our case it appears as: EI_STAT reg =
> fffffff005ffffff)
>
> 2) a panic string of: "Processor Machine Check"
> 3) lines in succession with the following:
> Machine Check Processor Fatal Abort
> Machine Check Code = 98
> Machine Check Code = 98
> Processor detected hard error
>
> My question: can I therefore confidently conclude that the problem is the
> same as what was
> diagnosed in the previously posted case? Or might there be other
> possibilities,
> and if so, how might I diagnose it? The overheating problem is possible,
> though
> the overall room environment has not changed in the past year. Perhaps a
> case fan
> might not be working properly, or could the Bcache simms simply be wearing
> out (the
> machine is 2.5 years old and does a lot of memory-intensive number
> crunching).
>
> Thanks,
>
> Kevin Tyle <kevin_at_meso.com>
> MESO, Inc.
> Troy, NY USA
>
>
Received on Mon Feb 07 2000 - 15:10:51 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT