SUMMARY: CPU PANIC Crashes on Alphaserver 1000 from Blue Moon Network Administrator on 1997-04-12 (tru64-unix-managers)

From: Blue Moon Network Administrator <root_at_net.bluemoon.net>
Date: Fri, 11 Apr 1997 16:28:26 -0400 (EDT)

Thanks to all who responded, there was some quite useful information on
debugging this type of crash.

I have included responses from:

jgmicon_at_sass165.sandia.gov
Kurt Carlson <sxkac_at_java.sois.alaska.edu>
Dave Cherkus <cherkus_at_homerun.unimaster.com>
Olle Eriksson <olle_at_cb.uu.se>
"Knut =?iso-8859-1?Q?Helleb=F8?=" <Knut.Hellebo_at_nho.hydro.com>
TetraPakDA_at_t-online.de (Tetra Pak APS GmbH)
"Dr. Tom Blinn, 603-881-0646" <tpb_at_zk3.dec.com>

I apologize if I have left anyone out!

I haven't done anything except charge up my portable air tank in our shop and
gove the AS1000 a heavy duty blowing out and the hardware errors have not
returned. I think we'll climate control this equipment room soon to help get
rid of some of the dust and particulate buildup.

During our next extended maintenance window I plan to take the box apart and do
the reseating game on the boards, simms and chips checking for clean contacts
as well as a full "sweeping out" of anything which doesn't belong in there :)

I'll make a point of adding that to a regular maintenance schedule.

J. Henry Priebe Jr. President & Network Administrator
root_at_net.bluemoon.net Blue Moon Internet Services
sysop_at_bbs.bluemoon.net Blue Moon Online System
http://www.bluemoon.net Try MoonMUD! "telnet mud.bluemoon.net 4000"

Here's the collected responses I received:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Sender: jgmicon_at_sass165.sandia.gov

> Could errant software cause this or should I be
> looking for loose chips, simms or cards?

Believe it or not, our 2100 was hanging a lot and it
was due to excess dust! Pull out the CPU and memory,
blow off with dry, compressed air, and reseat. Works
like a champ!

Jeff

-- 
  __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ 
     Jeffrey G. Micono                            505.844.6767
     Ktech Corporation                            505.268.3379
  __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/
From: Kurt Carlson <sxkac_at_java.sois.alaska.edu>
>Apr  8 15:28:21 net vmunix: WARNING: too many Processor corrected errors
>detected on cpu 0. Reporting suspended.
>Apr  8 15:28:21 net vmunix: Machine Check error corrected by processor
>Apr  8 15:32:44 net vmunix: fffd4
>Apr  8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error
You have a hardware problem, probably memory or cpu.
Check your errorlog... with uerf you'll see CPU exceptions, with
decevent you may be able to tell exactly what.
>Could errant software cause this 
no
>or should I be looking for loose chips, simms or cards? 
reseating cards might help, might not.
you need to find out what's having the problem... check the
error logs.
_____________________________________________________________________
Kurt Carlson,      University of Alaska SOIS/TS,        (907)474-6266
sxkac_at_alaska.edu   910 Yukon Drive #105.63, Fairbanks,  AK 99775-6200
From: Dave Cherkus <cherkus_at_homerun.unimaster.com>
Blue Moon Network Administrator writes:
|> 
|> This has happened to me twice now.
|> 
|> the machine has been running with no problems for months. Solid as a rock.
|> 
|> Now I get this twice in a row within 60 minutes of each other:
|> 
|> Apr  8 15:28:21 net vmunix: WARNING: too many Processor corrected errors
|> detected on cpu 0. Reporting suspended.
|> Apr  8 15:28:21 net vmunix: Machine Check error corrected by processor
|> Apr  8 15:32:44 net vmunix: fffd4
|> Apr  8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error
|> 
|> Could errant software cause this or should I be looking for loose chips, simms
|> or cards? The last thing we did was hack wuftpd to not allow logins for users
|> with the shell /bin/nologin which is for a POP only type of account.
This is definitely a hardware problem.
|> The machine and cables haven't been moved at all in a while and nothing major
|> has changed since it was stable.
|> 
|> I do have the messages file with all the addresses, but our service contract
|> with those thieves at DEC has expired and the dump info is just so much
|> gibberish.
Well, those theives might also suggest you install a program called
DEC-Event that at least tries to decode the gibberish for you.
|> Do I have a definate hardware fault here or can software ellicit such a crash?
|> Dust in the case of the 1000? Loose simms? Dirty card edges? Power is
|> conditioned through a UPS.
Hardware.  The cpu can correct errors in the cache chips or the memory
simms.  Try reseating the card that the CPU is on and memory chips.
Don't try reseating the CPU itself - you will almost certainly bend a
pin and ruin the CPU.
I hate to say but I have had problems with this pattern and did have
to have the CPU card replaced.
-- 
Dave Cherkus ------- UniMaster, Inc. ------ Contract Software Development
Specialties: UNIX Internals/Kernel TCP/IP Alpha Clusters Performance ISDN
Email: cherkus_at_UniMaster.COM  When the music's over, turn out the lights!
From: Olle Eriksson <olle_at_cb.uu.se>
You have a bad memory card or a bad cache memory.
From: "Knut =?iso-8859-1?Q?Helleb=F8?=" <Knut.Hellebo_at_nho.hydro.com>
Regards,
Try shutting down to PROM mode and do 'set d_group field;memory' to test
the memory. If anything fails do 'showit' to get the status from the
memory test (and hopefully the failing SIMM(s)). To interrupt the test
you have to hard reset. Good Luck ;-)
-- =
      ******************************************************************
      *         Knut Helleb=F8                     | DAMN GOOD COFFEE !! =
*
      *         Norsk Hydro a.s                  | (and hot too)       *
      * Phone: +47 55 996870, Fax: +47 55 996342 |                     *
      * Cellular Phone: +47 93092402             |                     *
      * E-mail: Knut.Hellebo_at_nho.hydro.com       | Dale Cooper, FBI    *
      ******************************************************************
From: TetraPakDA_at_t-online.de (Tetra Pak APS GmbH)
Hi,
I'd got a similar problem on my Alpha Server. 
Changing the simms solved it (we had DEC and third party simms) 
In console mode init the system, if it says something
about corretable error, try and excange the simms.
Best regards
Claudia
From: "Dr. Tom Blinn, 603-881-0646" <tpb_at_zk3.dec.com>
> Now I get this twice in a row within 60 minutes of each other:
> 
> Apr  8 15:28:21 net vmunix: WARNING: too many Processor corrected errors
> detected on cpu 0. Reporting suspended.
> Apr  8 15:28:21 net vmunix: Machine Check error corrected by processor
> Apr  8 15:32:44 net vmunix: fffd4
> Apr  8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error
> 
> Could errant software cause this or should I be looking for loose chips, simms
> or cards? The last thing we did was hack wuftpd to not allow logins for users
> with the shell /bin/nologin which is for a POP only type of account.
I passed your message along to one of the engineers who works with the
platform you are seeing the problem on, and he pointed out some things
you might not be aware of.
One class of processor correctable errors is single-bit memory errors.
These are corrected by the ECC logic, but an event is logged through
the "Machine Check" logic (an interface between the PALcode that runs 
in interrupt mode to deal with hardware problems) and the operating
system.  When there are large numbers of Processor corrected errors
in a short period, you get the message about reporting being turned
off (so the error logs don't grow without bound, but you'll know we
stopped logging the errors).
A double bit error would NOT be corrected, and would panic the system,
perhaps with a "Machine check - Hardware error".
You need to run UERF or DECevent (see the reference pages) against the
binary error log (which gets updated after the panic during the reboot
with the data from the hardware logout frame) and get the detailed log
of what made the system fail.  (Unfortunately, on your system, there
is no UERF support, so you have to use DECevent.)
Once you have that information, it's possible to tell exactly what went
wrong.
If you have a hardware support contract, I'd recommend you call this in
and ask that your system be repaired, because something's broken in the
hardware.  If you are self-maintenance, then you need to do the analysis
of the error log yourself and decide what components to replace.
Tom
 
 Dr. Thomas P. Blinn, UNIX Software Group, Digital Equipment Corporation
  110 Spit Brook Road, MS ZKO3-2/U20   Nashua, New Hampshire 03062-2698
   Technology Partnership Engineering           Phone:  (603) 881-0646
    Internet: tpb_at_zk3.dec.com           Digital's Easynet: alpha::tpb
     ACM Member: tpblinn_at_acm.org         PC_at_Home: tom_at_felines.mv.net
  Worry kills more people than work because more people worry than work.
      Keep your stick on the ice.        -- Steve Smith ("Red Green")
     My favorite palindrome is: Satan, oscillate my metallic sonatas.
                                         -- Phil Agre, pagre_at_ucsd.edu
  Opinions expressed herein are my own, and do not necessarily represent
  those of my employer or anyone else, living or dead, real or imagined.
 
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Received on Fri Apr 11 1997 - 22:50:55 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT