SUMMARY#2:CPU exception Event - NO crash

From: Ronald D. Bowman <rdbowma_at_tsi.clemson.edu>
Date: Mon, 02 Mar 1998 14:36:00 -0500

Okay, this is the last messge from me today.
This is the information that Kurt Carlson sent me concerning
the CPU Exception. I wanted to get his Okay for the posting
before sending it out to everyone. This will probably
help those who are new to the job of sys admin, or it could
serve as a refresher for those of you who have been doing
this for some time.

Kurt's reply:

>>Feb 23 09:39:47 tsi vmunix: Machine Check error corrected by processor
>Feb 23 09:39:47 tsi vmunix: Physical address of error ffffff00045560ef Corrected
> ECC Error in B-Cache during D-Cache fill

This is a single-bit correctable error, processor cache. Not a problem
unless it's a recurrent event. The more common 'CPU EXCEPTION' is
a single-bit recoverable memory error.

20 of them in 6 hours is enough to be modestly concerned that I'd suggest
you contact Digital if you have a support contract. If they stopped, you'll
probably not get any further action then reseating boards.

You should probably switch to dia (DecEvent) vs. uerf. uerf just doesn't
know how to interpret the binary error reports. In general you should
have a regular analysis plan of the binary error log. We have a daily
job emailing a summary report from all our systems. We pipe the uerf
output to a filtering program which does a one-line-per-event summary
of significant events. Yes, I said uerf not dia... dia didn't exist
when I wrote it and it works well enough that I haven't redone it... did
have to add a sed script (yuck) filtering down dia cpu exception and
disk bbr reports to something humanly readable.

If you want the program & scripts as a starter, you can find them within:

        ftp://raven.alaska.edu/pub/sois/README.uakpacct
  kit: ftp://raven.alaska.edu/pub/sois/uakpacct-v1.8.tar.Z

you'd be looking for ua_uerf (there's a couple other things in that kit).

Besides the daily (and weekly, and monthy management reports) from
cron jobs, I keep a couple aliases for ad hoc checks:

snkac_at_glacier: alias | grep uerf
UA7UERF='uerf -c err,oper -o full -t s:`ua_date -uerf -7` | /usr/local/sbin/ua_uerf -a'
UAXUERF='uerf -c err,oper -o full -t s:`ua_date -uerf -30` | /usr/local/sbin/ua_uerf -a'
UA_UERF='uerf -c err,oper -o full -t s:`ua_date -uerf -1` | /usr/local/sbin/ua_uerf -a'
snkac_at_glacier: UA7UERF
#glacier Mon Feb 23 1998
#glacier Mon Feb 23 1998 12:26:01 1 300 SYSTEM STARTUP
>12:26:30 2 199 Bus:01 lu:12.0 R=cam_disk_unit_atten:::Event - Unit Attention
>12:26:49 3 199 Bus:03 lu:27.1 R=cam_disk_unit_atten:::Event - Unit Attention
#glacier Sun Mar 1 1998
#glacier Sun Mar 1 1998 09:32:28 1 300 SYSTEM STARTUP
>09:33:12 2 199 Bus:01 lu:12.1 R=cam_disk_unit_atten:::Event - Unit Attention

Error reading syserr file.
#glacier Sun Mar 1 1998 10:25:36 3 301 SYSTEM SHUTDOWN |halted by sxan: Disk stuff

Summary:
     Total 2 300 SYSTEM STARTUP
     Total 1 199 Bus:01 lu:12.0 R=cam_disk_unit_atten:::Event - Unit Attention
     Total 1 199 Bus:03 lu:27.1 R=cam_disk_unit_atten:::Event - Unit Attention
     Total 1 199 Bus:01 lu:12.1 R=cam_disk_unit_atten:::Event - Unit Attention
     Total 1 301 SYSTEM SHUTDOWN

There's a daily, weekly, and monthly alias, i just executed the weekly. kurt


Sincerely,
Ron Bowman
Techno-Sciences, Inc.
rdbowma_at_tsi.clemson.edu
864-646-4028

Alpha EB 21164, 333MHz, 1 CPU
DU 4.0 (564) Patch #6 installed
Received on Mon Mar 02 1998 - 20:36:11 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT