Here's one for you to mull over. I've gone through three shifts of 
software support guys (18 hours and counting) and one field service 
engineer and we're no closer to finding a solution than when we started.
Yesterday just after noon (and again at 5) 7 of my 8 Tru64 boxes died. 
Six of them are 5.1A machines that make up 3 2-node clusters. The 7th is 
an old Alphaserver 3000/900 running 4.0d. Some of the nodes hung and 
others crashed. The box that didn't crash is a Decstation 500 running 
5.1. The 5.1A boxes had no patch kits on them.
The boxes that crashed had the following error:
vmunix: panic (cpu 0): ics_unable_to_make_progress: input thread stalled
There is a patch in Patchkit 2 to deal with that. The machines that hung 
were still running (more or less) but interactive sessions got hung up. 
 From the console I can open new decterms but issuing 'ps' or 'w' would 
lock up that session. I have since decided that maybe it's access to 
/proc that's actually hanging it. 'who' usually works but 'w', which 
shows the processes each id is using doesn't.
The big problem is that these issues have become permanent. I have, 
after some magical incantations I guess, gotten one of the clusters 
running again. A second cluster will only run one node at a time. The 
second node hangs at boot, usually after the line:
CNX QDISK: Successfully claimed quorum disk, adding 1 vote.
After some period of time, the running node will start to hang again and 
we have to shut it down. "shutdown" usually doesn't work as the node 
hangs on the way down.
The third cluster and the standalone machine just won't run at all. 
They'll both come up but immediately hang.
Another mystery. After the incident all of the nodes sporatically report
vmunix: malloc_wait:1: no space in map
I never saw this error before and neither have most of the people at 
Compaq apparently. The '1' is a counter and I've seen it over 75000 on 
one of the nodes.
Since patches are always the solution they gave us early access to patch 
kit 2. We installed it on one of the nodes, the other one failed the 
upgrade. So we deleted the cluster member and re-added it and now that 
node won't boot at all. We're supposed to boot genvmunix but it hangs 
after the 'claimed quorum' line. The machine we put the patchkit on no 
longer panics with the ics thread problem but it's still suffering from 
the hanging and 'no space in map' problem.
It seems pretty likely that we got hit with a network event of some 
kind, though our intrusion detection system didn't pick up anything.
Does anyone have any ideas as to more things we can try? We're getting 
pretty desperate here.
-- 
_______________________________________________________________________
   Rick Beebe                                            (203) 785-6416
   Manager, Systems & Network Engineering           FAX: (203) 785-3481
   ITS-Med Production Systems                    Richard.Beebe_at_yale.edu
   Yale University School of Medicine
   Suite 124, 100 Church Street South           http://its.med.yale.edu
   New Haven, CT 06519
_______________________________________________________________________
Received on Fri May 24 2002 - 10:49:44 NZST