weird, regularly-occurring CPU exceptions on AlphaServer 2100 4/200

From: Mark Bartelt <sysmark_at_cita.utoronto.ca>
Date: Thu, 2 May 1996 08:54:11 -0400

We have a pair of AlphaServer 2100 4/200 systems, nearly identical (one has
256 Mb and a DAT drive, the other has 1024 Mb and more disks than the small
system). This morning I noticed that the binary error log on the smaller of
the two systems had grown to 12 Mb! So I used uerf to look through it, and
found that there were thousands of entries of the following sort:

  EVENT CLASS ERROR EVENT
  OS EVENT TYPE 100. CPU EXCEPTION
  SEQUENCE NUMBER 10599.
  OPERATING SYSTEM DEC OSF/1
  OCCURRED/LOGGED ON Fri Apr 26 14:00:17 1996
  OCCURRED ON SYSTEM seal
  SYSTEM ID x00020009 CPU TYPE: DEC 2100
  SYSTYPE x00000000
  PROCESSOR COUNT 4.
  PROCESSOR WHO LOGGED x00000003

These things come in clusters of four (one from each CPU) with identical time
stamps, at a couple seconds past the hour, followed by a second cluster about
fifteen seconds later. Then nothing until the next hour. For example, here
are some of the "OCCURRED/LOGGED ON" entries ...

  OCCURRED/LOGGED ON Fri Apr 26 04:00:03 1996
  OCCURRED/LOGGED ON Fri Apr 26 04:00:19 1996
  OCCURRED/LOGGED ON Fri Apr 26 05:00:02 1996
  OCCURRED/LOGGED ON Fri Apr 26 05:00:17 1996
  OCCURRED/LOGGED ON Fri Apr 26 06:00:01 1996
  OCCURRED/LOGGED ON Fri Apr 26 06:00:16 1996
  OCCURRED/LOGGED ON Fri Apr 26 07:00:02 1996
  OCCURRED/LOGGED ON Fri Apr 26 07:00:16 1996
  OCCURRED/LOGGED ON Fri Apr 26 08:00:03 1996
  OCCURRED/LOGGED ON Fri Apr 26 08:00:20 1996

Going back through the log, I see that this started on March 4, then stopped
just as mysteriously on April 26. We do have something that gets started up
by cron every hour (a shell script which monitors disk use), but the crontab
entry has been there since long before this weirdness started.

I'm completely baffled by this, and haven't a clue about how to figure out why
this was happening. (Especially now that it's stopped!) And why would it be
happening on one of our systems, but not on the other?

It seems that the "brief" output from uerf doesn't really tell me *what* the
exception was, and the "full" output is so verbose I haven't a clue as to how
to interpret it. Any suggestions?

Mark Bartelt 416/978-5619
Canadian Institute for mark_at_cita.toronto.edu
Theoretical Astrophysics mark_at_cita.utoronto.ca

"Sheep not busy being shorn are busy frying" -- Dylan, at a NZ lamb barbecue
             [ singing "It's all right, ma (I'm only bleating)" ]
Received on Thu May 02 1996 - 15:29:06 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT