SUMMARY (2) Processor corrected errors

From: Jeff Higgins <HIGGINS_at_aces.k12.ct.us>
Date: Mon, 19 Aug 1996 13:41:11 -0500 (EST)

Since posting my summary on the message, "WARNING: too many processor corrected
errors detected on cpu0", I have had additional contacts, which I want to share
.

The gist of my first summary was alan_at_nabeth.cxo.dec.com's reply:

> It is UNIX reporting that there are lots of correctable
> errors on CPU0. The machine check is a hardware error
> as well, but it was corrected. Correctable errors are
> things like:

> o Cache read failures that can be refilled from memory.
> o Cache write failures that can be written directly to
> memory.
> o Data paths protected by ECC memory for which errors
> were detected and corrected.

Based on this reply and the fact that the message didn't return after
rebooting, I relaxed. After summarizing, however, I heard from Dave C. Boyle
(dboyle_at_liquidaccess.net):

>I didn't see your original post, but here is my experience of corrected
>errors. Our machine is a 1000 4/166 and we started having that error a few
>months ago. Rebooting the box (cause at times it would crash because of the
>error) corrected the problems for awhile so I didn't think anything of it.
>However after awhile rebooting would only work about 50% of the time (and I
>had to reboot until it came up). Finally it wouldn't come up at all (the
>errors would be reported during bootup and the machine would halt without
>booting, about 1/3 through the boot up). It turned out that replacing the
>motherboard solved the problem. DEC did the switch and they said it was
>either bad simms (incorrect) or bad simm "seats" which were corrected by
>the motherboard switch.

>Just thought you should know. Cause I didn't think anything of it till the
>machine just whouldn't come up.

I finally caught up with Digital Support, who examined my
/var/adm/binary.errlog and /var/adm/messages. Had there been a crash, DEC would
have examined /var/adm/crash/crash-data.date-of-crash. In my case, there was
not enough register data to make a determination, and we will simply keep an
eye on it for now. I did learn that DEC was looking to see if the cpu exception
was memory- or module- (cache) related, and that there are instances when the
cpu card has to be replaced.

I hope this info is worthwhile to someone! At least it's a start in the event
of such a message coming up.

Jeff Higgins
Received on Mon Aug 19 1996 - 20:16:10 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT