SUMMARY: getting info after a crash

From: Peyton Bland <bland_at_umich.edu>
Date: Tue, 31 Oct 2000 12:15:49 -0500

Hi,

Here is a belated summary. First the original question...

=====================

We are running 4.0F on a DS20 and have had no strange problems for the
first 6 months of operation. But within the last 2 weeks, the system has
twice stopped dead -- no response at the console or via the network. The
console screen is black, nothing diagnostic is written to
/usr/adm/binary.errorlog or messages or to any of the syslog logfiles, and
the front panel display showed only "AlphaServer DS20". Is there anywhere
else I can look for a clue as to what happened? (If not, I guess I'll have
to learn how to read a crash dump!)

But why did it crash? If we were using the deferred mode of swap
allocation, could that cause a crash of this sort? We are actually using
immediate mode on this system (ie, a sym link exists from /sbin/swapdefault
to /dev/rz8b), but I've always been curious about the ramifications of
using deferred mode (which we use on other systems). By the way, we have
2Gb memory and over 7Gb swap on this system.

=====================

I hope the following summary isn't too long! I learned A LOT about Tru64
from the replies I received, so perhaps this information will be useful to
others. Here goes...

---------------

Tom Blinn ("Dr. Tom Blinn, 603-884-0646" <tpb_at_doctor.zk3.dec.com>) and Nick
Hill ("Hill, NM (Nick) " <N.M.Hill_at_rl.ac.uk>) pointed-out that it was
probably not a crash but a hang. A crash would produce in the
/var/adm/crash directory a "crash-data.<n>" file with a date and time stamp
that matches when the system was rebooted; I found no such files. In the
case of a hang, Tom writes...

"The system has a "halt" button, I believe, that should allow you to get
it into the console firmware; read the hardware documentation. If you
get into this hung state, and can force the system into the SRM console
firmware using the "halt" button, then you can issue a "crash" command
in the firmware, and that should force a crash, which should get you a
crash-data file (and a memory image and a copy of the running kernel) in
your /var/adm/crash directory. Then someone who knows how to interpret
that data might be able to figure out why the system is hanging."

Nick's advice was similar, as was Rubén Cortegoso's (Rubén Cortegoso
<rcortegoso_at_uolmail.com.ar>). Alan Davis sent a lengthy and very thorough
reply ("Davis, Alan" <Davis_at_tessco.com>); rather than reproduce it here, I
will make it available on request. Thanks also to Emmanuel Bove
(EBOVE_at_bouyguestelecom.fr).

-----------------

David Rabjohns ("Rabjohns, David" <David.Rabjohns_at_logistics.nhs.uk>) passed
along the following:

"We had a very similar problem with a DS20 which, after lengthy investigation
turned out to be a fault with one of the cpu's. If you are getting no syslog
messages or crash dumps then this may very well be the case. One main
symptom of this kind of fault is that you will be unable to "halt" the
system when it hangs - if this is happening then I suggets you get an
engineer to look at the CPU(s)."

-------------------

Narendra Ravi ("Narendra Ra[a]vi" <narendra_at_spiff.hr.att.com>) wrote:

Most software typically leaves a footprint. We have been plagued with such
unexplainable crashes for our ES40s and 99% of the time we had to re-seat
or replace some piece of hardware (CPU boards, memory boards, etc).

To eliminate software/OS issues we have done the following:

1. Hard halt the system (hit the HALT button, not the power/reset buttons)
2. At the hardware prompt, run the following and record the results
>>> e pc
>>> e ps
>>> e sp
>>> e r26
3. Force a crash
>>> crash
4. Reboot the system
>>> boot
5. Run the sys_check utility after booting
   # sys_check -escalate

See if you can get compaq support to analyze the results. Send them
a. the escalate.tar file that is generated,
b. the vmunix,
c. the vmzcore and
d. the results of the register information gathered in step 2.

--------------------

Others had advice pertinent to a "real" crash and other things to check. I
will pass it along here:

1. Check kern.log
2. Use sys_check
3. Call Compaq and let them analyze the problem
4. Look at the SRM prompt for power/power supply information ("show power"
and "more el")
5. Try the dia utility

For these replies, thanks to:

Stan Horwitz <stan_at_astro.ocis.temple.edu>
Jim.R.Jones_at_Cummins.com
Colin Walters" <walters_at_zk3.dec.com>
"Calvin Coghlan" <ccoghlan_at_ascensionhealth.org>
Nikola Milutinovic <Nikola.Milutinovic_at_ev.co.yu>

-----------------

I haven't had a chance to try too many of these suggestions as things are
going well now. In any case my _sincere_ thanks to all who replied.

Peyton Bland


University of Michigan, Radiology
voice: 734-647-0849
FAX: 734-764-8541
e-mail: bland_at_umich.edu
URL: http://www.med.umich.edu/dipl/
Received on Tue Oct 31 2000 - 17:16:48 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT