[SUMMARY]: machine check error

From: Sung Moon Kang <sung.kang.bk.94_at_aya.yale.edu>
Date: Mon, 19 Nov 2001 17:24:13 -0800

I'd like to thank the following folks for their helpful suggestions:

alan_at_xxx.xxx.xxx
"Vlack, Jay" <Jay.Vlack_at_xxx.xxx.xxx>
Paul <tru64_at_xxxx.xxx.xxx>

However, I'd like to specifically single out Jay Vlack for taking the
time to actually look thru my binary error log and guestimating the
problem as being an correctable memory error.

Basically, I should look out for more of these errors (which I haven't
gotten any since last Friday morning at 5:20 am), and if I should
start getting them more regularly, that's when I should worry... and
will have more errors to determine exactly which component I should
replace.

Jay Vlack's message:

Date: Sat, 17 Nov 2001 10:32:14 -0700
From: "Vlack, Jay" <Jay.Vlack_at_xxx.xxx.xxx>
Subject: RE: machine check error

Sung,

I looked at your MCHK error this morning. I found out that DECevent
isn't supported on the 433au which is why you only get a raw hex dump.
However, I managed to decode the registers in your 433au machine check
by hand. The MCHK code is 86 (offset 0x48 into the hex dump), which
indicates a CPU detected correctable ECC error (memory). The faulting
FRU could be the MLB (main logic board), Bcache, or a DIMM. The error
was in the low quadword, and the Fill Syndrome indicates that the error
involved more than one bit (fill_syn <7:0> = 00000100). Unfortunately I
can't narrow it down further than that with the data that's in this hex
dump. (Note: I've labeled the key registers in your hex dump below, if
you're interested).

If this is the only error of this type that you've seen on the system,
it's probably not a big deal. These Alpha systems are designed to
correct memory errors when possible. However, if you see a lot of these
errors in a short amount of time, or if they are being logged regularly
over a period of days or weeks you should probably replace the offending
FRU. With more machine check entries to analyze it might be possible to
callout a specific FRU (Field Replacable Unit). I hope this helps.

Jay Vlack
Technical Account Manager
Compaq Services
Tru64 UNIX Platinum Support
jay.vlack_at_xxx.xxx.xxxx


**** V3.3 **************** ENTRY 26 *****************************

Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 2363.
Timestamp of occurrence 16-NOV-2001 06:20:35
Host name britomart

System type register x0000001E Systype 30. (Miata)
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 100. Machine Check Error - (major class)
                                   4. - (minor class)
========================
Raw Event Data Dump
========================

Entry# (record in file) 26.

Entry Body Size: x000000A0
Entry body:
           15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order
  0000: 3BF512A3 00060101 0007001E 093B00A0 *..;............;*
  0010: 00000001 00000074 72616D6F 74697262 *britomart.......*
  0020: 00000000 1A040064 00000000 00000001 *........d.......*
  0030: 80000000 00000068 00000000 00000000 *........h.......*
  0040: 00000000 00000086 00000038 00000018 *....8...........*
           ^^^^^^^^^^^^^^^^^^
           MCHK Code
  0050: 00000000 00000100 FFFFFF00 09AA83EF *................*
           ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
           fill_synd ei_addr
  0060: 00000001 00000000 FFFFFFF0 C5FFFFFF *................*
           ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
           isr ei_stat
  0070: 00000000 00000000 00000000 00000000 *................*
  0080: 00000000 00000000 00000000 00000000 *................*
  0090: 003C7E25 00000000 00000000 00000000 *............%~<^*

At 12:28 PM -0800 16/11/01, I originally wrote:
>I got a rather ominous message from our old Digital Personal
>Workstation 433au. After searching the archives, it looks like
>the problem has been dealt with before by various folks on the
>list, but alas, I can not seem to dig a summary, only the question.
>
>So, I'm afraid I have to ask again.
>
>The message I got:
>
>======================= Binary Error Log event =======================
>EVM event name: sys.unix.binlog.hw.machine_check
>
> Binary error log events are posted through the binlogd daemon, and
> stored in the binary error log file, /var/adm/binary.errlog. This
> event type reports a serious system error.
>
> Action : Contact your service provider.
>
>======================================================================
>
>Formatted Message:
> CPU machine check/exception - CPU 0
>
>Event Data Items:
> Event Name : sys.unix.binlog.hw.machine_check
> Priority : 700
> PID : 722
> PPID : 1
> Event Id : 4706
> Timestamp : 16-Nov-2001 05:20:35
> Host IP address : <ip address deleted>
> Host Name : britomart
> User Name : root
> Format : CPU machine check/exception - CPU $subid_num
> Reference : cat:evmexp.cat:300
>
>Variable Items:
> subid_class (INT32) = 100
> subid_num (INT32) = 0
> subid_unit_num (INT32) = 0
> subid_type (INT32) = 4
> binlog_event (OPAQUE) = [OPAQUE VALUE: 160 bytes]
>
>============================ Translation =============================
>binlogshow: DECevent handshake protocol error
>======================================================================
>
>DECevent (which I understand is no longer supported but I can't seem
>to find anyone at Compaq that can help me install WEBES properly) spits
>out the following:
>
>**** V3.3 ********************* ENTRY 26 ********************************
>
>Logging OS 2. Digital UNIX
>System Architecture 2. Alpha
>Event sequence number 2363.
>Timestamp of occurrence 16-NOV-2001 05:20:35
>Host name britomart
>
>System type register x0000001E Systype 30. (Miata)
>Number of CPUs (mpnum) x00000001
>CPU logging event (mperr) x00000000
>
>Event validity 1. O/S claims event is valid
>Event severity 1. Severe Priority
>Entry type 100. Machine Check Error - (major class)
> 4. - (minor class)
>========================
>Raw Event Data Dump
>========================
>
>Entry# (record in file) 26.
>
>Entry Body Size: x000000A0
>Entry body:
>
> 15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order
> 0000: 3BF512A3 00060101 0007001E 093B00A0 *..;............;*
> 0010: 00000001 00000074 72616D6F 74697262 *britomart.......*
> 0020: 00000000 1A040064 00000000 00000001 *........d.......*
> 0030: 80000000 00000068 00000000 00000000 *........h.......*
> 0040: 00000000 00000086 00000038 00000018 *....8...........*
> 0050: 00000000 00000100 FFFFFF00 09AA83EF *................*
> 0060: 00000001 00000000 FFFFFFF0 C5FFFFFF *................*
> 0070: 00000000 00000000 00000000 00000000 *................*
> 0080: 00000000 00000000 00000000 00000000 *................*
> 0090: 003C7E25 00000000 00000000 00000000 *............%~<^*
>
>From what I could find, it apparently is a relatively benign error
>(something to do with memory), but I just wanted to see what everyone
>thought before I have to deal with the h*ll that which is the Compaq
>support. I never seem to be able get a quick answer for them....
>
>Thanx.
Received on Tue Nov 20 2001 - 01:25:43 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT