The issue was eventually resolved with the awesome
assistance of Alex on third shift software support
at Compaq. The symptoms eventually boiled down to
a failure to recognize the swap devices as valid
devices. The hypothesis was that the device
databse had become corrupted. How this could have
happened as a result of a runaway Samba daemon is
open to conjecture and I'd welcome any further
ideas anyone has.
The final fix involved booting from the Emergency
Repair disk and restoring the cluster_root#root
file system from the day before. On a personal
note of defeat, I didn't save the corrupted files
for further forensic analysis.
We did not attempt a network boot since it would
have required reconfiguring one of the happy
servers and we don't actually own these machines
yet.
Thanks to Paul Henderson and Larry Clegg who
offered advice.
Julie L.C. Thomas
UNIX Systems Administrator
ERCOT
Taylor, TX
512.248.3117
---Original post---
Hi there,
1) We have two ES40's in a cluster, Tru64 v5.1,
TruCluster 5.1. This morning when we arrived there
were about 2000 instances of the SAMBA daemon
running. They couldn't be killed since the other
node respawned them. All system resources were
consumed such that we couldn't log in, so we
attempted a hard reboot on each machine
individually. The boot hangs after a SCSI bus
reset and the quorum disk is added to the cluster.
Any ideas as to what has happened?
2) It has been suggested that I set up a boot
across the network since I have two identical
redundant machines at my site (the problem
children are at our backup site). Has anybody done
this? We're going to try to boot from CD-ROM
first, but I'd like to have the network boot as a
last resort and potential source of a "good"
kernel file. I'll be RTFM, but any and all
pointers would be most appreciated. I'll summarize
when my blood pressure returns to normal.
Received on Tue Apr 17 2001 - 22:49:35 NZST