Not many responses to this one, I'm afraid. Thanks go to Harvey Rarback (SLAC)
who reported the same symptoms, and to Ric Werme (ZK3) who kindly looked
at some of my packet captures.
1. Systems known to be affected:
Server: V4.0F PK8 (BL22), V4.0G PK4 (BL22)
Client: Solaris 7 and 8 with current patches
2. Symptoms: NFS over TCP hangs with packets accumulating in the receive
queue on the server. Kernel NFS thread is in S state, indicating recent
activity (so it did wake up). Among the accumulated packets are some
calls for RPC 100227 (nfs_acl), which is surprising since the Tru64
portmapper doesn't advertise this program number.
3. Recovery: /sbin/init.d/nfs {stop,start} doesn't help, the client
promptly retransmits and restores the statu quo. Rebooting the client
sometimes clears the problem, but not always on the first attempt.
(As luck would have it, the problem went away just as I was trying to
get some clean packet captures from the initial TCP handshake onwards.)
4. Workaround: use NFS over UDP. It's solid.
5. Conjectures as to the cause:
Although both Harvey and I are running AFS, the module for which
might conceivably interfere with the rest of the kernel, that seems a
less likely culprit than Sun's NFS ACL support. Clearly Sun's
implementation is less than perfect: it ought to check with the
portmapper before attempting the nfs_acl RPC calls, and Solaris'
"nfsstat -m" reports ACL support for all mounts, even those from
servers (e.g., Linux) that definitely don't support NFS ACLs.
There doesn't seem to be a mount option to disable ACL support either.
In Ric Werme's words: "we won't recognize [the ACL protocol],
there's a decent chance the server is getting confused and discards
everything else that comes along. There were bugs in that area,
it may be that porting the fix from V5.1x didn't work."
Received on Thu Sep 18 2003 - 22:10:09 NZST