Hi,
Thanks to the following people for the quick reply's...
- Pat O'Brien
- Dr. Thomas Blinn
- Selden E Ball Jr
- Peter Stern
Original question:
-----Original Message-----
Hello all,
We've had a couple of sudden panic's on our ES 40's , and can't figure out
exactly what has caused it.
The machines went down at different times, probably within 12 hours of each
other.
There's nothing in the crash files, sys_check hasn't found anything, and
DEC Event (ie. dia command) has no events/information listed.
The /var/adm/messages file shows no indication of anything going wrong - It
just suddenly shows the entries for the system coming up again.
The console messages appearing were as follows:
halt code = 6
double error halt
PC = fffffc00004ced40
P00>>>
halted CPU 1
and then on the other machine ...
halt code = 5
HALT instruction executed
PC = fffffc00004d0040
This happened SUDDENLY, no warning, just BANG! and systems halted (at
different times).
There's no clustering, and the systems run independently of each other.
The ONLY guess we've had is that the main air conditioning unit failed
recently. One of the guys reported that the boxes still up (and running)
were HOT to touch. Not burning hot, but hot.
Could this just be a CPU overheating and shutting down? Like I've
mentioned previously , there's no real help from the O.S level, as the
crash seems to have happened so suddenly that the O.S. didn't pick anything
up. We're running Tru64 5.1 , just the basic vanilla O.S, not extra
packages have been added (apart from DEC EVENT), and the kernel contains
nothing fancy.
I thinking it looks like a hardware related type of panic, but exactly
what I'm not sure. Only real guess would be a cpu overheating and shutting
down to protect itself ????
---------
SUMMARY
---------
Well it's good - as the problem hasn't happened again, but bad - because it
means we really don't know what caused it!
However as explained to me, the double halt error basically means that
while the system was handling the first hardware error, a second hardware
error came along, and the system then crashes back to the console.
There's no logs or anything similar, because the OS simply doesn't get a
chance to do anything.
As for the temperature overheat, we're not really sure. The air-con was
down, but the faults happened more than 12 hours apart, so the air-con was
probably ok during one of the system crashes.
But to check the temperature, you can issue the command:
sysconfig -q envmon
Which tells you the current temperature, and the threshold temperature.
We're now monitoring this, and also using mrtg to record graphs of the
temperatures of the machines. This way at least we can eliminate the
temperature problem.
Oh, you need to have the Environmental Monitoring Subsystem configured for
the above to work, and the system has to also support temperature
monitoring.
Another handy suggestion if the problem was to occur again, was to attach
a dumb serial terminal (plus a printer if possible).
This is a great way to capture any important information that may otherwise
fly off the screen during a system panic/dump.
Luckily we haven't' had to do this (yet) but it is a great way to get more
information about what caused a crash.
But basically crashes like these can be very hard to track down. If the
problem did re-occur frequently, and we could be 100% sure it was
air-conditioning related, then it would be a good bet that some hardware
component had cooked itself. However, finding out what particular hardware
component had burnt out would be the tricky bit.
Also a good idea to check your air-con on a regular basis, and make sure
that the climate is monitored, and that there are alerts in place, and that
they ACTUALLY *WORK*. Air-Con seems to be one of those things that people
forget totally about until it fails. Redundant cooling units are a good
idea, and if that's too much $$$, than at least some sort of alarms
(software and/or hardware). I guess the lucky side is that it's warned us
that if the air-con does actually fail, it could cause SERIOUS hardware
damage and major system downtime.
As for the actual errors, in summary, they're hard to track down as there's
no real logging info. If the problem persists, the serial terminal would be
the best idea, that way at least you've got more chance to trap the error
messages.
Thanks again to the above mentioned people for their help.
Regards,
Dirk.
Received on Mon Feb 04 2002 - 16:06:24 NZDT