Digital UNIX abrupt halt under heavy I/O load

From: Norman Wilson <norman_at_hprc.utoronto.ca>
Date: Sun, 12 Nov 95 14:32:32 -0500

I have a new Aspen Alpine Alpha system which occasionally halts
abruptly under heavy I/O load. It's not clear yet whether the
trouble is in the hardware or the software; if you've had a similar
experience, or have thoughts about what might be going awry, please
let me know.

The system is an Alpine 275XS, with a 275MHz EV45 (21064A), 2MB of
secondary cache, and 64MB of main memory. Digital's software (both
UNIX and the console) call it a PCI Evaluation Board; I don't know
whether that means it's DEC's board, or just DEC's console ROM.
The system has an on-board SCSI host adapter (logically connected
via PCI), and three PCI slots (occupied by an SMC Ethernet card and
a Qlogic SCSI adapter). The only SCSI peripherals at the moment are
an RRD42 and a Seagate Hawk 1GB disk.

`Abrupt halt' means
halt code = 7
machine check while in PAL mode
PC = 14a14
>>>

(I don't know enough about Alphas or Digital UNIX yet to know where
in memory to poke around to learn more about the machine check, or
if there's some way I can force a crash dump. Suggestions welcome.)

`Heavy I/O load' means that the problem shows up
- while installing Digital UNIX from the RRD42 to the Seagate disk.
The system sometimes halts while setld is installing software (more
often during the `Base System' subsets, sometimes in later subsets);
once it got all the way through, but halted during the automatic kernel
build. Reasonably often--at least half the time--the installation
completes; this is an intermitted problem.
- under synthetic heavy I/O load: while UNIX is running, if one does
        # while :; do dd </dev/rrz0c >/dev/null bs=100k; done &
        # while :; do (cd /usr/bin; cat * >/dev/null); done &
        # while :; do (cd /usr/sys/BINARY; cat * >/dev/null); done &
        # while :; do du -a / >/dev/null; done &
the system will halt after a few hours. (I don't know how much of
the parallel I/O activity is really needed; there is so much of it
because I want to be sure the I/O comes from the disk, not from 64MB
of buffer cache).

`Occasionally' means it doesn't always happen, but it happens often
enough that I am worried about putting the system to real use before
I understand the problem better.

I have tried Digital UNIX 3.2B and 3.2C, and tried plugging the SCSI
peripherals into the motherboard host adapter and the Qlogic board
(and, in the former case, have run the system with the unused Qlogic
board both present and absent from the system). I don't have enough
data to make real statements, but it feels like the halts are less
likely when the system is running 3.2C, and when the peripherals are
plugged into the motherboard SCSI adapter, and perhaps when the
Qlogic board isn't in the system at all.

To confuse matters further, we recently upgraded from SRM console
X4.1 to X4.4, which definitely made the Qlogic SCSI adapter work
better. Under X4.1, the system exhibited a number of odd symptoms
when the Qlogic board was in the system: the console couldn't see
any SCSI devices for a minute after power-up or init; complaints
of read timeouts while booting. Under X4.4, those symptoms have
vanished, leaving only the mysterious halts.

Norman Wilson
High Performance Research Computing
University of Toronto
norman_at_hprc.utoronto.ca
Received on Sun Nov 12 1995 - 21:03:23 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT