Thanks for the many great suggestions. It took us about 26 hours to
solve this one--I haven't stayed up that long since College! In hindsite
(of course) there were some obvious clues, but it took a couple fresh
eyes to get me back on track. But the whole story is very, um
convoluted.
As expected it did turn out to be a network issue. Our network is very
closely integrated with the Yale-New Haven Hospital network. In fact,
they VLAN some of our address space onto their wires for University folk
who have offices in hospital space. They were having major network
problems yesterday (none of which appeared to affect us). We found out,
however, that they had power-cycled some Ethernet to ATM converters at
noon and at 5. The exact time our boxes crashed. See, things get
weirder. The details are sketchy but their ultimate problem turned out
to be a Sun with two ethernet cards acting as a router between multiple
segments on their network. But how could cycling those things crash our
machines? See, weird.
Anyway, they had a Cisco engineer on site today and he offered to come
help with our problem, focusing on the network.
The key turned out to be this error: vmunix: malloc_wait:1: no space in
map
I discovered that the standalone box would boot more-or-less normally
with the network disconnected. We could bang on it and it wouldn't hang.
But within 90 seconds of plugging the cable back in it would spit out
the above error and then any attempt to access process information--by
running 'ps' or 'w'--would cause the machine to hang. We now had a
reasonable target for sniffing--a mere 90 seconds worth to one machine.
After we caught a trace we put the Sniffer and the Tru64 box on a
private switch and replayed the traffic from the Sniffer. Tru64 crashed
reliably. Then it was just a matter of eliminating packet types until we
isolated the killer. Ready....?
We found an Apple Airport that was sending 8000+ multicast packets per
minute to the NFS port. Probably sucking up resources until the machine
died. As best we can figure out, the Airport was somehow driven insane
by the events on the Hospital network though we're still trying to
figure that one out. The tight time correlation seems to point to that
link, though. Once we pulled the Airport off the network all the
machines came back up normally.
Compaq is naturally quite interested in this. It's somewhat embarrassing
that of the 85 machines of various OSs in our data center, only the
Tru64 boxes collapsed from this.
So that's the saga. Thanks again for the help (there were some really
good ideas there) and hope you all have a nice Memorial Day weekend away
from the office. Me, I'm going to sleep through it :-) And watch out for
those Airports.
--Rick
> -----Original Message-----
>
> Here's one for you to mull over. I've gone through three shifts of
> software support guys (18 hours and counting) and one field service
> engineer and we're no closer to finding a solution than when we started.
>
> Yesterday just after noon (and again at 5) 7 of my 8 Tru64 boxes died.
> Six of them are 5.1A machines that make up 3 2-node clusters. The 7th is
> an old Alphaserver 3000/900 running 4.0d. Some of the nodes hung and
> others crashed. The box that didn't crash is a Decstation 500 running
> 5.1. The 5.1A boxes had no patch kits on them.
>
> The boxes that crashed had the following error:
>
> vmunix: panic (cpu 0): ics_unable_to_make_progress: input thread stalled
>
> There is a patch in Patchkit 2 to deal with that. The machines that hung
> were still running (more or less) but interactive sessions got hung up.
> From the console I can open new decterms but issuing 'ps' or 'w' would
> lock up that session. I have since decided that maybe it's access to
> /proc that's actually hanging it. 'who' usually works but 'w', which
> shows the processes each id is using doesn't.
>
> The big problem is that these issues have become permanent. I have,
> after some magical incantations I guess, gotten one of the clusters
> running again. A second cluster will only run one node at a time. The
> second node hangs at boot, usually after the line:
>
> CNX QDISK: Successfully claimed quorum disk, adding 1 vote.
>
> After some period of time, the running node will start to hang again and
> we have to shut it down. "shutdown" usually doesn't work as the node
> hangs on the way down.
>
> The third cluster and the standalone machine just won't run at all.
> They'll both come up but immediately hang.
>
> Another mystery. After the incident all of the nodes sporatically report
>
> vmunix: malloc_wait:1: no space in map
>
> I never saw this error before and neither have most of the people at
> Compaq apparently. The '1' is a counter and I've seen it over 75000 on
> one of the nodes.
>
> Since patches are always the solution they gave us early access to patch
> kit 2. We installed it on one of the nodes, the other one failed the
> upgrade. So we deleted the cluster member and re-added it and now that
> node won't boot at all. We're supposed to boot genvmunix but it hangs
> after the 'claimed quorum' line. The machine we put the patchkit on no
> longer panics with the ics thread problem but it's still suffering from
> the hanging and 'no space in map' problem.
>
> It seems pretty likely that we got hit with a network event of some
> kind, though our intrusion detection system didn't pick up anything.
>
> Does anyone have any ideas as to more things we can try? We're getting
> pretty desperate here.
--
_______________________________________________________________________
Rick Beebe (203)
785-6416
Manager, Systems & Network Engineering FAX: (203)
785-3978
ITS-Med Production Services
Richard.Beebe_at_yale.edu
Yale University School of Medicine
Suite 214, 100 Church Street South, New Haven, CT 06519
_______________________________________________________________________
Received on Sat May 25 2002 - 02:54:59 NZST