New Years Eve probably isn't the optimum time to be looking for
answers, but twice in the last couple of days our main NFS server has
hung up. It's a DS20 (2/500) with 2.5G memory, Tru64 v5.1A PK4. Logging
in to the server, things appear more or less normal but all clients
report "NFS server xxx not responding" - we normally see this
occasionally, but in this case it never recovers.
Running /sbin/init.d/nfs stop/start fails to recover. syslog shows:
Dec 31 15:03:55 spartha nfsd:[111457]: Can't bind UDP addr: Address already in use
probably because if I look at the output of "ps", I see "nfsd" in state
"U" - the old nfsd is failing to exit. Unfortunately I don't know what
state it was in before I tried stopping it...
Finally, halting the system also fails - it hangs (no messages visible
- blue screen after X shuts down).
I probably should also have looked at the output of "ps axml" to see
the state of the kernel threads, but I only looked at this part of the
man page after restarting, so will have to wait for next time...
The server does have a lot of NFS clients. It was running with 32 each
of TCP/UDP clients, though as most of the clients are UDP, I may reduce
the TCP thread count and raise UDP.
Some local software (in /usr/local) was updated over the past few days
- things like perl, openssl, stunnel, and so on - but it's hard for me
to image how that could be related.
Any ideas on a possible cause (or solution)?
G.
--
-------------------------------------------------------------------------
Graham Allan - I.T. Manager - gta_at_umn.edu - (612) 624-5040
School of Physics and Astronomy - University of Minnesota
-------------------------------------------------------------------------
Received on Wed Dec 31 2003 - 21:33:22 NZDT