The problem originally appeared to be hardware, but was posted here
because it has been months-long and DEC was scratching their heads.
It now appears to have been a processor card. There were at least two
bad spares in DEC Denver which led to much trial and error debugging and 50
or more crashes.
Allan Rollow pointed out that machine checks were almost always hardware.
I know, given the size of the news spool and number of files on it,
that fsck on DU is fast and UFS is hardy. (given the tenuous nature of
this post, I should say "appears to be.")
This problem was a continuation of something which started up last summer.
This is what may have happened:
1. The machine crashed hard - hung beyond the halt button bringing up >>>.
The frequency of this crash ranged from a few hours to a few weeks. It
started after setting up a 7-disk LSM volume. What a tempting red
herring that was - just blame LSM.
A console printer was set up, because it was the only way of capturing
any information on the crashes.
2. DEC decided the problem was processor or memory and did a lot of card
swapping. They introduced a bad processor card.
3. The machine began crashing every few hours and by elimination it didn't
appear to be processor or memory, so DEC replaced the i/o module, CPU
backplane, and tried each of the two power supplies. That fixed the
original problem of the infrequent crash, but the machine was still crashing
do to one of the new, but faulty processor cards.
4. DEC went back to believing there was a processor card problem and
ordered one from out-of-state.
5. The machine appears to work but it will take a couple of months to
really know.
Suggestions which I'll keep for future crashes are:
power (George Gallen)
SMP locking (Karl Marble)
ping (Kurt Carlson)
audit daemon with -d (Jon Buchanan)
John Nebel
The original post was:
A 2100 with DU 3.2c crashes frequently (only several hours up time) with
two processors and runs longer, days or weeks with one processor. DEC has
tried replacing processors, memory, i/o module, cpu backplane and power
supplies to no avail.
Machine check info is printed on the console printer and the system just
hangs up that point and it is impossible to get a crash dump - a hard
reset is necessary to get the thing going again.
The application is innd (7xrz29 LSM news spool). There are a few netscape
clients (output to x-terms) running from time to time and DECnet is running
but not terribly active except when backing up / and /usr to VMS. I use
tkined for network monitoring so occasionally that is running too. The 3
PCI slots have an FDDI controller and 2 plain disk controllers. There are no
external terminators on the PCI disk controllers rear bulkhead connectors,
but they are strapped for active termination. The disks are internal. The
standard i/o module does have an external rear bulkhead terminator.
DEC has the console logs and has been asked several times whether they
think the problem is software and the answer is no.
Any ideas other than moving the applications to another machine?
Thanks.
John Nebel
Received on Sat Dec 14 1996 - 19:42:57 NZDT