I am having some trouble with my AS2100A. I have rebuilt this server,
upgrading the 2 processors from EV4/275's to EV5/375's, and adding a
KZPSC-XE swxcr RAID controller card. I had DEC install the hardware and
upgrade me to v5.1 firmware.
Shortly after upgrade I began seeing b-cache errors logged by cpu1.
Assuming my cpu was bad, I had DEC replace it. Several days later now,
cpu0 is now logging these errors!
I've considered all sorts of bizarre possibilites: and Oracle 7.3.4
bug, something wrong with the firmware upgrade, reseating the cpu
module. Oddly, I was unable to upgrade to DecEvent2.8. I received an
error during bit-to-text translation (didn't capture it), so I rolled
back to 2.6. I wonder if this is somehow related.
Before I have DEC replace the other CPU, I want to know if anyone has
seen this issue before. I found notes in the archives for b-cache
errors, but in that instance it was causing kernel panics and my errors
are not. As far as I can tell, the system is ignoring the errors.
Someone even suggested to me that this is an error caused by code
compiled for 32-bit operating systems that is "out-of-sync" with my
64-bit architecture. That sounded a little off to me, but I am willing
to consider anything at this point.
My output from DecEvent:
******************************** ENTRY 1
********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 39.
Timestamp of occurrence 18-JUN-1998 15:28:33
Host name robin
System type register x00000018 AlphaServer 2000A or 2100A
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 3. Bcache error (630 entry)
Entry Body Size: x00000078
Entry body:
15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order
0000: 80000000 00000060 00000060 00000023 *#...`...`.......*
0010: 00000000 00000086 00000038 00000018 *....8...........*
0020: 00000000 000000F4 FFFFFF00 1E654AAF *.Je.............*
0030: 00000001 00000000 FFFFFFF0 81FFFFFF *................*
0040: 00000000 00000000 480013F2 48001002 *...H...H........*
0050: 00000000 00000000 000000E1 00000061 *a...............*
0060: 00000000 00000000 B800000A B800000A *................*
0070: 5E3C7E25 00000000 * ....%~<^*
************************************************************************
*****************
/var/adm/messages says:
Jun 18 15:14:06 robin vmunix: Machine Check error corrected by processor
Jun 18 15:19:18 robin vmunix: Machine Check error corrected by processor
Jun 18 15:28:33 robin vmunix: Machine Check error corrected by processor
Jun 18 15:28:33 robin vmunix: Machine Check error corrected by processor
TIA
susrod_at_hbsi.com
Received on Fri Jun 19 1998 - 00:46:42 NZST