SUMM: Zillions of CPU exceptions

From: Guy Dallaire <dallaire_at_total.net>
Date: Tue, 27 Aug 1996 09:29:46 -0400 (EDT)

Thanks to the many who responded !

First, my original post:

-------------------------------BEGIN----------------------------------------

I'm back from vacation and when I did a routine check on my system with
uerf, I got a zillion CPU EXCEPTIONS. My /var/adm/binary.errlog file is now
70Mb !

Here is a excerpt from the uerf -R command:

********************************* ENTRY 1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 59573.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Fri Aug 23 01:26:43 1996
OCCURRED ON SYSTEM dgeux2
SYSTEM ID x00050009 CPU TYPE: DEC 2100
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000000

----- UNIT INFORMATION -----

UNIT CLASS CPU

********************************* ENTRY 2. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 59572.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Fri Aug 23 01:26:43 1996
OCCURRED ON SYSTEM dgeux2
SYSTEM ID x00050009 CPU TYPE: DEC 2100
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000001

----- UNIT INFORMATION -----

UNIT CLASS CPU

********************************* ENTRY 3. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 59571.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Sun Aug 18 10:37:40 1996
OCCURRED ON SYSTEM dgeux2
SYSTEM ID x00050009 CPU TYPE: DEC 2100
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000000

...
...

Most of these errors occured on August 18 from 9:11Am to 10:37 am (There are
a ton of them!) and others occured on August 5 and 23. I've got another
identical server Alpha 2100 5/250RM (which has 1 CPU instead of 2) and the
log file on that other server is normal (341Kb with no CPU exception) Both
systems run the same OS: DU 3.2D-1

What is this ? Any help would be appreciated.

How can I reduce the size of my binary.errlog file ?

---------------------------------END------------------------------------------

Between my post and my call to DEC tech support (2 Days), the binary.errlog
file grew to 197 MegaBytes!

Most of you suggested that it was a 'soft' memory error, which is
correctable. For some vague reason, uerf list that kind of memory error as a
CPU EXCEPTION instead of a MEMORY ERROR. I think this is a BUG in uerf.

After talking to a DEC technician who examined my system with a remote
connection, he confirmed that it was a soft memory error and suggested me to
reboot the machine to correct it. All the errors occured at the same memory
locations and were probably caused by the system trying to _read_ that
location from time to time, at least that's what the DEC technician told me.
He also told me that rebooting and power cycling the machine would rewrite
that memory location and fix the problem. I think it's a weird theory but I
tried it anyway.

Since the reboot, no memory errors occured. I hope it will stay the same.
But some of you who had similar problems told me that it did not take long
for them to resurface...

Also, someone told me that a CPU XECEPTION (AKA memory error) can occur from
time to time and is not harmful (If it doesn happen 80000 times in 5 minutes)

As for the truncation of the binary.errlog, the solution is to

1) Stop the binary logger with /sbin/init.d/syslog stop
2) Save the /var/adm/binary.errlog file to a convenient place
3) Zap it with cat /dev/null > /var/adm/binary.errlog
4) Restart the binary logger with /sbin/init.d/syslog start

I hope that summary will be useful. A lot of people have experienced that
kind of problems with memory and you can also search the archives. BTW, DEC
doesn't seem to like to replace memory boards, they will do ANYTHING to
prevent the replacement of a board. They once told a friend of mine (who had
a VAX which was dead) that the problem was due to some 'cosmic' particle
coming from the sun who passed 'through' the main board and destyroyed the
CPU. _Some_ DEC engineers should start writing plots for B Movies, others
are generally good.

                                Have fun !
Received on Tue Dec 17 1996 - 00:55:49 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT