Hello managers,
Today our AlphaServer hung, failing to service any NFS requests. It
still showed a minimal sign of life. The pointer moved on the local X
server (though it couldn't do anything). I did not have time to look
for much else.
This machine does not have any "user" accounts. The only local activity that
was going on was the completion of a Networker (single save and restore)
backup that had been delayed a few days due to lack of media. I inserted
used media, mounted and labeled it and networker proceeded catch up (using
interleaved session mode) with the backlog. I watched the networker display
for a minute or two, it indicated a throughput of about 700 KB/sec (DLT!).
Not more than a minute after that the whole plant noticed "NFS server xxxx
not responding"!. I returned to the console and looked: the disk drive LED's
were quiet, the Networker backup window indicated that EOF had been written,
but I could do nothing. Pressing the halt button was the only way to recover.
I looked at all the logs I know of, messages, syslog.dated and uerf but
no errors were logged. No core files either (I forgot to "crash" it!).
During the 4 day interval of waiting for new DLT media to come in, I had
mounted some new filesystems (and renamed one AdvFS). I also added another
600 MBs of new data and removed an equivalent amount from another partition.
My (somewhat lame) theory is that somehow networker caused this because it
was surprised by the changes in (particularly the filesystem) configuration.
Any theories, suggestions, opinions or similar stories would be greatly
appreciated. I will summarize. Details follow:
I have an AlphaServer 2100 4/200 with one CPU running DEC OSF/1 V3.2A
(Rev. 17) Firmware revision: 3.9 and PALcode: OSF version 1.35. It has
128 MB of memory and a SWXCR RAID controller. The / and /usr filesystems
are AdvFS and RAID level 0 (2 - 1GB DEC disks) and 128 MB swap. It is
a NFS server for about 60 workstations and 60 X Terminals.
On the internal/external non-RAID SCSI bus we have 5 devices (probably 2
too many), an RD43 CD-ROM, TZ87 DLT tape and 3 9GB disks. On a number of
occasions during a reboot the CAM subsystem reports an unreadable block on
one of the disk drives, unfortunately this is inconsistent! In the past,
after cycling power and rebooting a few times we have been able to get past
this, get the disk back on line and fsck it.
We have 17 partitions, a mixture of UFS and AdvFS with roughly 18 GB of
data.
The Networker 3.0A product is the single save/restore version that came with 3.2.
I seem to recall adding a kernel patches before installing this.
--
---------------------------------------------------------------------------
Ted Asocks tra_at_ucolick.org
Systems Administrator VOICE: (408)459-4020
UCO/Lick Observatory FAX: (408)454-9863
Received on Thu Oct 19 1995 - 01:53:26 NZDT