Here is a long standing problem with NFS. Any comments, suggestions
gratefully accepted.
cheers
-George
Server: Alpha 800 5/333, RAID striped ADvFS. all NFSv3 (now) with
attribute caching. rpc.lockd and rpc.statd both running and
rpc.lockd seen to be 'a bit of a hog/dog process from time to time.
Server load consistently under 1. usually 0.4 or less.
% system time variable. baseline <10% but frequent peaks over 40%.
32 nfsd threads. ps axml shows heaps in "I" and "S" state.
quotas enabled.
NO OTHER LOAD ie we wish we'd bought a NetAPP NFS toaster and
we run the NFS server as if it was.
60+ clients. mix of 10/100 speeds. All homedirs, all binary paths
for DU clients, all work dirs for compile/edit cycles, email delivery.
Clients: Mix of Alpha/DU4 and Solaris 2.6/2.5 including the new U-10
which are IDE bus Suns, with a 100mbit ethernet card. These latter
do see other problems with network speed.
Network: Cat5000/1900/2900 fully switched. All servers and most clients
set to 100mbit full duplex.
Network is not congested. drops are less than 0.1% of traffic on
most hosts.
Behaviour:
Take a dir of around 1500 files. Make it be 'quiescent' ie not
looked at on any host.
Take any pair of clients A,B.
on B, cd ~path/dir; /bin/time ls -F > /dev/null
first fetch time is around 20-40sec variant.
repeat fetch time is around 0.2sec on DU with attr
cache, 2-4sec stable on any other host.
on A, cd ~path/dir; /bin/time ls -F > /dev/null
time is around 2-4sec initially, then 0.2sec
in client side cache if DU with attr cache.
on server, same dir is 1-2sec to ls -F > /dev/null
ignoring host B, behaviour is same on A ie for a quiescent
filesystem, initial dirscan is 10x slower than subsequent
scans if not in client side cache, and not in (presumed)
server side cache.
time to go quiescent (ie initial slow load time) varies
but can be as little as 40min.
A (considerably more lightly used) Sun server has nothing like
this initial delay. Ever.
Issues:
The users really really really dislike that initial 20-40sec delay.
This happens on ALL clients. DU4, Solaris, HP-UX. It happens for
any large directory. It happens for all path-completion requests
which are scanning across the $PATH elements, and for random in
cwd path completions. It slows login down to around a 1 min delay
from time to time. Its immensely user-visible.
DEC have a call open on this. I'm not saying they aren't trying
but this problem has been extant for around 6 months, and I believe
has been mentioned on this list for some considerable time across
DU 4.0B-D release and before (ie 3.2c)
Theory:
Because a host B can affect a host A's time to load, there is a
very strong suspicion that this is some nfsd server-side issue
with a cache of open dir/files.
* is this a side-effect of ADvFS?
* is this a side-effect of RAID?
* is this a side-effect of NFSv3 (we don't think so. the
difference between v2 and v3 is small)
* is this a tuning thing? (we've applied DECs recommended
tuning from a system scan -escalate. It made no
difference)
* is this something that more memory on server fixes
(even if we never swap, and always have 2-8Mb free)
* is this a load thing? (how come we don't see high CPU load)
* why does rpc.lockd take so much time, and why does system
time rise above 30% when we have these problems?
Other problems:
Suns can wedge rpc.lockd trivially.
When Suns do top or dmesg, truss shows calls to kvm_open()
with a RDONLY lock call. Thats when the process hangs, and
suspicously, rpc.lockd on DEC server goes active.
Why the $#%! does a Sun ask a network lock daemon
if it can open its own memory ?????
Kernel can hang serving NFS.
the informal not-yet-released malloc_mem_alloc patch to
network layer code... we think it may not entirely work.
we still get malloc_mem_alloc hangs.
Userland client calls to glob can fail.
some client-side usercode fails. gnumalloc calling glob
functions fails 2 times in 4 to scan a dir/*/file glob
request. (this is impossible to recreate in shell, but
definately exists. Two sequential calls of the make command
succeed the second time. Bizarre!)
Received on Wed Sep 02 1998 - 00:14:43 NZST