I sent mail about a Dec 2000/300 crashing (message below).
The 3 responses I got suggested replacing the disk. Unfortunately
I forgot to mention in my initial message that I had replaced
the system disk.
I will now try to follow Dr. Tom Blinn's suggestion of connecting a
logging terminal as the console so I can get the actual error
when the machine crashes. As it is set up now, it is scrolled off
the console so I cannot see it.
Thanks for the replies - the complete reply from Dr. Tom Blinn
is included below.
---------- Forwarded message ----------
Date: Fri, 26 Mar 99 08:40:06 -0500
From: "Dr. Tom Blinn, 603-884-0646" <tpb_at_doctor.zk3.dec.com>
To: Valerie Caro <valerie_at_cs.umass.edu>
Cc: tru64-unix-managers_at_ornl.gov
Subject: Re: Dec 2000/300 hanging and crashing
> I have a Dec 2000 model 300 running Digital Unix 4.0D + patchset 3.
> It has been giving me problems for the past few months. At
> first it would just lock up. We would need to turn it off, and
> back on to reboot it. There would be no errors anywhere.
>
> Now it keeps crashing. The past few times, I get an error on the
> screen:
>
> Machine Check - Hardware error
>
> Dump.prom: I/O error in SCSI 1 6 0 0 0 0 JANS-IO, block 262144
> Dump : Failed to write initial header to primary swap 0x800001, err 5
>
> It does not reboot by itself, and usually will not reboot on the
> first try.
>
> I get no errors in any of the log files - and no crash dump.
>
> We have replaced the motherboard, the video board, the disk controller,
> the ethernet board, the power supply, the mouse. We have tested the memory
> and moved it around.
>
> Does anyone have any ides on what the problem might be?
> Why is it getting an error doing the crash dump?
>
> Any assistance or suggestions would be greatly appreciated.
>
> ---
> Valerie Caro Computer Science Computing Facility,
> valerie_at_cs.umass.edu LGRC Room A313,
> University of Massachusetts
> Amherst, MA 01003
>From the information you provide, it is impossible to tell what's really
going wrong. When you say you saw the message "Machine Check - Hardware
error" "on the screen", I assume you are running the system with console set
to "graphics". I have to strongly recommend you get a logging terminal (a
hardcopy terminal or a PC with a logging terminal emulator that can talk to
its serial port, or a VT series terminal with an attached printer), and you
start capturing ALL the messages that come out on the console.
The "Machine Check - Hardware error" message means you're getting a hardware
error -- that's a no brainer.
When the kernel tries to panic, it's going to try to call back into console
firmware to write the crash dump into the swap space, because the console is
supposed to be able to do I/O to all of the disks, and the kernel assumes it
may have trashed its own I/O subsystem including its disk drivers.
The message "Dump.prom: I/O error in SCSI 1 6 0 0 0 0 JANS-IO, block 262144"
is coming out of the console firmware. If you look at your running system,
and use for example "swapon -s" or look for the symbolic link in /sbin that
points to the primary swap area (ls -l /sbin/swapdefault) you should be able
to determine where the primary swap is located. It's usually partition "b"
on the boot disk.
The message "Dump : Failed to write initial header to primary swap 0x800001,
err 5" is probably coming out of the kernel, from the routine that's called
down into the firmware, because the firmware will have returned an error to
the kernel.
In the default file system layout, the /var hierarchy (where your log files
are located on disk) is in the /usr hierarchy, and /usr is in partition "g"
on the boot/root disk.
If you can't do I/O to that disk, because of a hardware failure (such as a bad
disk controller, or a bad cable, or a bad disk, which you don't say that you
have replaced), then the system will try to panic, but it won't be able to do
a crash dump, it won't be able to update the on-disk message log, it won't be
able to write a disk I/O error into the binary error log, it will just fail,
and it might not restart until it's been power cycled if the disk error won't
clear on a bus reset.
So, an EDUCATED GUESS (absent more information) is you've got a bad disk as
your root/boot/system disk, and you should move the system software off the
current disk and see if the problems go away.
Tom
Dr. Thomas P. Blinn + UNIX Software Group + Compaq Computer Corporation
110 Spit Brook Road, MS ZKO3-2/U20 Nashua, New Hampshire 03062-2698
Technology Partnership Engineering Phone: (603) 884-0646
Internet: tpb_at_zk3.dec.com Digital's Easynet: alpha::tpb
ACM Member: tpblinn_at_acm.org PC_at_Home: tom_at_felines.mv.net
Worry kills more people than work because more people worry than work.
Keep your stick on the ice. -- Steve Smith ("Red Green")
My favorite palindrome is: Satan, oscillate my metallic sonatas.
-- Phil Agre, pagre_at_ucsd.edu
Yesterday it worked / Today it is not working / UNIX is like that
-- apologies to Margaret Segall
Opinions expressed herein are my own, and do not necessarily represent
those of my employer or anyone else, living or dead, real or imagined.
Received on Fri Mar 26 1999 - 13:59:51 NZST