Hello,
First let me give you some background, the Tru64 relevance comes
near the end...
I have a Sun Ultra 10 running Solaris 7.0 that is a NFS server to 47
computers, about half being Alphas running Tru64 4.0D through 5.0 and
the other half being Red Hat Linux 6.2 boxes, with one other Solaris 7
thrown in for good measure. This collection of computers is used by
Physicists that do a good deal of computational work with some heavy
I/O. I try to get them to keep their computational I/O local to the
machine but this doesn't always work. At any rate, this is a busy
network that for the vast majority of the time works very well.
However.... on two occasions, separated by 6 months in time, I have run
up against a situation where the NFS server (Sun box) is being hammered,
the kernel is running at 90-99% CPU. Needless to say, during this time,
the whole network dance comes to a grinding halt. I discovered that if
I stop NFS on the Sun, the Sun becomes happy. If I started it, even with
only one file system exported to one machine, the kernel usage once again
goes through the roof ( all the other clients still expect to be served
by the Sun ).
During the first episode (6/00), I assumed that the problem was with the
Sun and worked with Sun Microsystems which resulted in the installation
of a patch that relates to Solaris auto negotiating through a Cisco router.
We had just upgraded our Cisco and the patch appeared to have fixed the
problem. In hindsight I now realize that I also rebooted the alphas at
the same time - see below.
But lo and behold the problem with the Sun kernel running at 95+% reared
its head once again this week. This time I noticed that the switch that
eleven of the alphas are connected to was being hammered (new switch since
the first episode with pretty lights). I started poking around on the
alphas and saw that the NFS client calls (nfsstat -c) were astronomical,
in the millions. I'd issue an nfsstat -z to clear the counters and then
immediately do a nfsstat -c and the client calls would be in the thousands
if not ten thousands in a 15 second time span! No wonder the poor Sun
was swamped.
I tried stopping NFS client services on the alphas and then restarting
them but this did not fix the problem. I ended up having to reboot each
alpha. For each one that was rebooted, the kernel usage on the Sun would
drop about 10%. Now that all the alphas are rebooted, everything is back
to normal and running smoothly. So this really is looking like a problem
with Tru64 to me.
Now the question, has anyone seen this sort of behavior? Is there a patch
out there that everyone but me knows about? Any advice will be greatly
appreciated!
--
Tom Combs E-mail: combs_at_magnet.fsu.edu
National High Magnetic Field Laboratory Phone: (850) 644-1657
1800 E. Paul Dirac Drive Tallahassee, FL 32310
Received on Wed Jan 10 2001 - 16:19:00 NZDT