We have a mission-critical general login machine that is suffering from
poor NFS client performance. What (if anything) can we to do improve
response times?
Details: The general login machine is an AlphaServer 4100 with 2
processors and 1GB of RAM running Digital UNIX v3.2G. It has three 10
megabit Ethernet interfaces - one to our general net, one to a net
dedicated for our primary NFS fileserver (serving home directories) and a
third full-duplex link connected directly to our mailserver.
Our mailserver, another AlphaServer 4100 (2 processors, 512MB RAM, DU
3.2F) is presently NFS-serving our /mailshare/spool directory to all of
our UNIX hosts. It also has three network interfaces; two to general nets,
and one the other half of the full-duplex link to our main login server.
Mail is being served on a 16-GB RAID 5 array. Access to the /mailshare
filesystem from this server (local - not NFS) is quick even during times
of high client load.
Our main problem occurs when 350 or more users log in to our general login
machine. 90% of our users use PINE, so most of our users log in and fire
up a session that makes extensive use of our mailserver. Once we have
about 350-400 users logged in, performance (delays while PINE checks for
new mail) becomes poor. After 450 users log in, performance is horrible;
logins take forever while tcsh checks for new mail, and all mail-related
activities are unusable. Performance on all of our other hosts (with loads
varying from 5 users to 150 users) is quite good. It is ONLY our general
login server that grinds to a halt.
It is the last fact which has me believing the problem to be a client-side
issue. I'm more than willing to be proven wrong, however.
[Even as I type this message, my screen is being disrupted with messages
like the following:
Oct 1 00:29:08 fas vmunix: NFS3 RFS3_LOOKUP failed for server husc-33:
RPC: Timed out
Oct 1 00:29:55 fas vmunix: NFS3 RFS3_CREATE failed for server husc-33:
RPC: Timed out]
What we've tried:
-- upping the NFSD's on the server - it helped when we went from 24->32;
it's now at 56, and performance is about the same as it was when it
was set to 32.
-- raising and lowering the number of NFSIOD's - we're currently at 20. I
tried setting it to 64 to see what would happen, and nothing really
noticeable occurred. I also tried 7; again, it didn't make a big
difference
-- adding "timeo=300" to our mount - the mount is a _soft_ mount; this
option seems to help mask the symptom - the NFS3 timeout messages. I
don't think it is a solution, however.
-- adding "retrans=3" - this may be helping a _little_. Some of our
current nfsstat server and client side stats are below my sig.
-- NFSv2 vs. NFSv3 - we tried NFSv2, and performance was about the same
if not a bit worse
I apologize for the length of this post; my colleagues and I are extremely
frustrated with this problem. The only response I've received (thus far)
from Digital support was to increase our timeo value. :-(
If moving to Digital UNIX v4.0a (and therefore, to NFSv3 over TCP) will
help, we'll consider that. I don't particularly wish pull more
all-nighters, but I'll do whatever is necessary to improve our situation.
Any and all suggestions will be most warmly received!!!
---
Todd V. Minnella
UNIX Systems Analyst, UNIX Systems Group
Harvard University Faculty of Arts and Sciences Computer Services
---
fas:~ # nfsstat -cri10 [on our general login server]
----------------------------------------
Client rpc:
calls badcalls retrans badxid timeout wait newcred
badverfs timers
5816737 2382 24462 6731 24872 0 0
0 119579
----------------------------------------
Client rpc:
calls badcalls retrans badxid timeout wait newcred
badverfs timers
1312 0 2 0 2 0 0
0 13
----------------------------------------
husc.harvard.edu:~ % nfsstat -sri10 [on our mailserver]
----------------------------------------
Server rpc:
calls badcalls nullrecv badlen xdrcall
5395094 0 0 0 0
----------------------------------------
Server rpc:
calls badcalls nullrecv badlen xdrcall
595 0 0 0 0
Received on Tue Oct 01 1996 - 07:06:59 NZST