SUMMARY: Alphastation 500 5/266 CPU exceptions.

From: Tony McElhill <tony.mcelhill_at_uk.airsysatm.thomson-csf.com>
Date: Fri, 24 Nov 2000 10:06:16 +0000

Hi,

Thanks to:

Alan_at_nabeth.cxo.dec.com
Olle Eriksson
Tom Blinn

First off, this was the wrong forum to raise this kind of issue, which
was basically a hardware error that I should have gone to Compaq with. I
also managed to send the original message out in plain text and html!
Sorry on both counts.

The reason that I didn't go straight to Compaq on this one is that I
knew this box was out of warranty and not under a support contract, and
could be pricey to get sorted on the basis of a callout. However, Compaq
have been very helpful, and the result is that they think the system
board needs to be replaced, but as the total cost will be about 66% of
the price of an XP900, I will just keep the Alphastation 500 for spares.
They will only charge on the basis of a visit, plus transit costs etc.
etc.


...................................................................................................................

You have a problem with memory or cache memory. The messages indicate
that the errors have been corrected by ECC but that thera are a lot of
errors.

Olle
.............................

.... you either have failing memory or a failing memory controller.
If it's failing memory, moving it around isn't going to fix it.

If you don't know how to interpret a binary error log to figure out
what component is failing, then call up a trained service engineer.
Very few of the people reading the mailing list have documentation
on decoding the error log entries.

If you prefer to try self maintenance, then go purchase some known to
be good ECC memory (most commodity PC memory is NOT adequate) and try
moving it around until you isolate the failing parts. Just moving bad
memory around within the system is a losing approach.

Tom
.............................

The Bcache sits somewhere between the CPU and memory, but
        in front of the on-CPU cache. It uses Error Correction
        Codes (ECC) to detect and fix single-bit errors and detect
        double bit errors. It sounds like you're getting enough
        of them that the logging of the errors is impact the
        overall system performance. I think the cache is built
        into the system board, so fixing would a matter of replacing
        the board. I don't have an AlphaStation handy to look
        closely, but a diagram of the system board make it look
        like the cache chips are soldered.

        If the system has a support contract, a service call
        is warranted.

Alan
...............................................................................................................

Q:

Hi,

I have a test "rig" with various Alphastations, and have recently
upgraded some of them from 4.0D
to 4.0F. Whilst in the process of doing this, one particular box gave me
problems, and would not
boot off the (firmware) CDROM. I tried creating a boot-floppy, but that
didn't work either -
message was "Can not read from device dva0" (/dka? for CD) or similar. I
went ahead and did an
installupdate to 4.0F, and pk3 but I'm not too optimistic about the
future of this box!

At the console level on power-up, you see:

Processor Detected - BCache single bit ECC error.
*** Unexpected interrupt through vector 0000067
IPRs:
EXC_ADDR: 000....12BA3C EXC_SUM: .........
ALCOR Error CSRs (CPU 0)
CIA_ERR:0000....0 ERR_STAT:....
warning - HWRPB is invalid.
I/O CSRs:
MEMORY BASE ADDRESS CSRs
MBA: 0008011
.....................
Processor Detected - Memory single bit ECC error.
IPRs:
EXC_ADDR: 000....12BA3C EXC_SUM: .........
etc. etc.

Processor correctable error through vector 00000063
EI_STAT: FFFFFFF484FFFFFF EI_ADDR: FFFFFF00000C5EAF
FILL_SYN: 00..........3100 ISR: 0000000100..0MCES4
Error on fill data from Main mem
data bit 14 530 bank 0
bad page in concole mem cluster [0]

This latter error continues with some variation during the boot, and
suprisingly the system does go
to multi-user, and has been usable, though today one of the developers
told me they weren't able
to use this box at all.
I assume that this is what it says it is- and that it's either a CPU
and/or motherboard problem, but
as it's probably going to be a fairly expensive fix, I thought I'd run
it past you lot anyway. I've
tried removing all the memory modules and trying them in different
slots, but get the same error.

Any suggestions?

BTW - it's a PB540-A9 system at firmware 6.8-2.


TIF,

Tony.

uerf messages followed ........

--
  ---------------------oooOOOooo---------------------
  Tony McElhill
  Development Support Engineer
  Airsys ATM
  Oakcroft Road, Chessington, Surrey KT9 1QZ England.
  Tel: 020-8391-6438
  Fax: 020-8391-6137
  e-mail: tony.mcelhill_at_uk.airsysatm.thomson-csf.com
  ---------------------oooOOOooo---------------------
Received on Fri Nov 24 2000 - 10:12:09 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT