I've recently run into serious NFS problems here, and Compaq support has
been less then helpful to resolve them:
We're running a DEC 3000/500 and an AlphaServer 2100 4/200, until recently
with Digital UNIX V4.0B patch kit 8. Every once in a while, the nfs_tcp
kernel threads vanish without a trace, as verified by ps axlm, while
nfs_udp threads still exist and old Ultrix V4.3 clients can still access
the servers via NFSv2/UDP. There are no kernel messages about this, and
we've not yet been able to identify the root cause of this. It doesn't
seem to be a simple load problem, but may sometimes be caused by serious VM
exhaustion. This happened twice last week; afterwards we installed patch
kit 10 and haven't observed this failure since (although that patch kit
broke mountd: mount requests from some multihomed clients are rejected;
reinstalling the old mountd didn't help; only mountd from V4.0D patch
640.00 works ;-). The bug may be fixed, but the patch kit release notes
aren't conclusive in this respect.
The nfsd cannot be killed and restarted in this situation, so we need to
reboot the affected servers. And now the other part of the problem
(unrelated but considerably worsening the situation) sets in: while the
server is still shutting down, several of it's clients (running a mixture
of V4.0B patch kits 8 and 9, V4.0D patch kits 2 and 4) start bombarding the
server's nfs/tcp port with connection requests at a rate of serveral
thousands per second. The same happened to a Sun Enterprise 150 running
Solaris 2.5.1 upon a scheduled reboot, so this isn't DU specific. Even a
single such client saturates a 10 mbit/s ethernet link, and about half of
our 20 clients show this problem. The affected servers take more than 30
minutes to boot under this load, if they even manage to boot at all. The
clients are completely unresponsive and need to be turned off.
Exactly the same behavior has been reported to alpha-osf-managers back in
Apri 1997, but triggered no response:
http://www.ornl.gov/its/archives/mailing-lists/alpha-osf-managers/1997/04/msg00142.html
The last time the problem occured, I've been able to capture the start of
this from a Sun SPARCstation 20/50 running Solaris 2.5.1. The following
extracts from a snoop trace illustrate what happens:
About 18:10, aiserv (the DEC 3000/500) shutdown started. The machine
closes some open NFSv3/TCP connections, but already rejects DNS requests
(ICMP port unreachable):
aiserv: nfs server, DEC 3000/500, Digital UNIX V4.0B Patchkit 8
sequoia: nfs server, Sun Enterprise 3000, Solaris 2.5.1
liszt, mozart: nfs clients, DIGITAL Personal Workstation 500, DIGITAL UNIX
V4.0D Patchkit 2
491 18:10:46.17663 sequoia -> aiserv TCP D=1023 S=2049 Fin Ack=784488958 Seq=1850738396 Len=0 Win=8760
492 18:10:46.17791 aiserv -> sequoia TCP D=2049 S=1023 Ack=1850738397 Seq=784488958 Len=0 Win=33580
493 18:10:46.32634 liszt -> aiserv DNS C port=1239
494 18:10:46.37727 aiserv -> liszt ICMP Destination unreachable (Bad port)
but still seems to respond to some NFS/TCP requests:
505 18:10:46.50245 mozart -> aiserv TCP D=2049 S=1017 Ack=1156967478 Seq=1978534677 Len=1 Win=33580
506 18:10:46.50288 aiserv -> mozart TCP D=1017 S=2049 Ack=1978534677 Seq=1156967478 Len=0 Win=0
For about 60 seconds, the trace shows no response from aiserv to any
reqest, but many NFS retransmits by clients.
Afterwards, the syn flooding client closes an NFS/TCP connection and
immediately afterwards starts bombarding aiserv's nfs/tcp port with tcp
segments with the syn flag set and increasing sequence numbers, just from
another source port:
750 18:11:52.77453 liszt -> aiserv TCP D=2049 S=1021 Rst Ack=914064411 Seq=439703764 Len=0 Win=33580
751 18:11:52.77462 liszt -> aiserv TCP D=2049 S=1023 Syn Seq=1599542320 Len=0 Win=32768
752 18:11:52.77469 liszt -> aiserv TCP D=2049 S=1023 Syn Seq=1599606320 Len=0 Win=32768
753 18:11:52.77474 liszt -> aiserv TCP D=2049 S=1023 Syn Seq=1599670320 Len=0 Win=32768
Those segments arrive at about 50 to 60 microsecond intervals!
This is snoop -v output from the first one of those Syn packets:
ETHER: ----- Ether Header -----
ETHER:
ETHER: Packet 156 arrived at 18:11:52.77
ETHER: Packet size = 62 bytes
ETHER: Destination = 8:0:2b:34:61:18, DEC
ETHER: Source = 0:0:f8:76:3d:ae,
ETHER: Ethertype = 0800 (IP)
ETHER:
IP: ----- IP Header -----
IP:
IP: Version = 4
IP: Header length = 20 bytes
IP: Type of service = 0x00
IP: xxx. .... = 0 (precedence)
IP: ...0 .... = normal delay
IP: .... 0... = normal throughput
IP: .... .0.. = normal reliability
IP: Total length = 48 bytes
IP: Identification = 31253
IP: Flags = 0x4
IP: .1.. .... = do not fragment
IP: ..0. .... = last fragment
IP: Fragment offset = 0 bytes
IP: Time to live = 60 seconds/hops
IP: Protocol = 6 (TCP)
IP: Header checksum = c1ab
IP: Source address = 129.70.128.23, liszt
IP: Destination address = 129.70.128.99, aiserv
IP: No options
IP:
TCP: ----- TCP Header -----
TCP:
TCP: Source port = 1023
TCP: Destination port = 2049
TCP: Sequence number = 1599542320
TCP: Acknowledgement number = 0
TCP: Data offset = 28 bytes
TCP: Flags = 0x02
TCP: ..0. .... = No urgent pointer
TCP: ...0 .... = No acknowledgement
TCP: .... 0... = No push
TCP: .... .0.. = No reset
TCP: .... ..1. = Syn
TCP: .... ...0 = No Fin
TCP: Window = 32768
TCP: Checksum = 0x8190
TCP: Urgent pointer = 0
TCP: Options: (8 bytes)
TCP: - Maximum segment size = 1460 bytes
TCP: - No operation
TCP: - Window scale = 0
TCP:
I've really no idea how to fix this; maybe switching to NFS/UDP could help
as a last resort, although I fear this could lead to serious congestion on
the 10 mbit/s link to the DEC 3000/500.
Any suggestions?
Rainer
-----------------------------------------------------------------------------
Rainer Orth, Faculty of Technology, Bielefeld University
Email: ro_at_TechFak.Uni-Bielefeld.DE
Received on Fri Oct 22 1999 - 15:37:20 NZDT