Sorry for the extended delay in this summary. I now know more than
I ever cared to know about the underbelly of NFS. Too bad though,
because my problem turned out to be my network. It took 3 weeks and
finally one good DEC engineer to sniff it out.
One of my hubs was generating late collisions, a rare but fatal
network disease! It wasn't generating enough for any machines to
care about until 3.2C came along with a highly optimized
ethernet interface driver. My other two 3000/400's were virtually
unaffected by the malady. The DEC engineer didn't suspect the
network until I told him that my 250 4/266 was having the same
problems. The problem became clear; Really fast interfaces get
tripped up easily by any amount of late collisions. A simple
netstat -i (which I should have done weeks ago!) turned up high I/O
errors on the port. I replaced the hub, and while none of the other
machines have gotten any faster, the two troubled beasts are
troubled no longer and faster than ever.
Thanks to the folks who sent suggestions, most all of which
I tried!
rusb_at_redac.co.uk (Russ Bevan)
"Craig I. Hagan" <hagan_at_ttgi.com>
benites_at_cs.unca.edu
brock_at_cs.unca.edu
John Stoffel <john_at_WPI.EDU>
"Chandra R. Chegireddy" <chandra_at_phys.ufl.edu>
Doug Gould <dgj_at_omega.rtpnc.epa.gov>
Andrew Gallatin <gallatin_at_isds.Duke.EDU>
Doug Apel <103153.264_at_compuserve.com>
***** Responses Follow *******
rusb_at_redac.co.uk (Russ Bevan)
We encountered the same problem, and its because the default NFS
version on OSF/1 3.2 is Version 3. If you are mounting from an NFS version
2 machine (Which includes Solaris 2.[1234] you need to specify NFS V2 in
the mount command.
Solaris 2.5 supports NFS V3 so the problem then goes away.
Russ.
*********************
"Craig I. Hagan" <hagan_at_ttgi.com>
betcha that the alpha running 3.2D is running (or trying to run) nfs v3.
what you will want to do is waddle through the man pages and/or
kernel conf (or the FAQ) and disable this.
-- craig
*********************
benites_at_cs.unca.edu
You might want to change your /etc/fstab to use NFS Version 2. We just
changed our clients from the default Version 3 to 2 and the difference
is astounding. What was horrible performance, has returned to what was
"normal" before 3.2C.
A sample line from /etc/fstab:
/users_at_server.edu /server/users nfs rw,bg,nfsv2 0 0
^^^^^^
-- bb
*********************
John Stoffel <john_at_WPI.EDU>
I don't have any magic suggestions for your NFS problems, but I do
have some thoughts that might help. But I'd still need to know more
information, such as what options you are using on the clients (DU) to
mount the directories on the server (SUN).
Have you tried changing the 'timeo' and 'retrans' options to mount?
See the mount(8) man page for what you can change. But this is what I
would work on changing.
What does nfsstat say on both the Suns and the Alphas? This might
give a hint to where the problem is.
John
*********************
"Chandra R. Chegireddy" <chandra_at_phys.ufl.edu>
I am currently using Digital Unix 3.2C as a nfs client and server to
another sparcserver 690 running a heavily patched Solaris 2.4. The
performance seems to be decent both directions. I am planning to do the
upgrade to Digital Unix 3.2D and Solaris 2.5 soon. There were several nfs
related patches to Solaris 2.4 that we put in. You may want to look into
that. Also please post a summary of responses you get to this query as I
am interested in finding out the experiences others are having with this
combination.
*********************
brock_at_cs.unca.edu
One thing I did yesterday was to change my Digital Unix client
to request NFS version 2 instead of version 3. It now seems faster.
To do this, you just add an "nfsv2" option in /etc/fstab.
The line will look something like
/users_at_tryon /tryon/users nfs rw,bg,intr,nfsv2 0 0
*********************
Doug Gould <dgj_at_omega.rtpnc.epa.gov>
In your shoes, I'd look at my 1.2 system tunables (/sys/conf/*)
and see if there's anything omitted from your OS config file which
should be included, also for parameters that are significantly
different and understand why. Next I'd be tempted to grab a
packet analyzer and see if there are bad packets on the network
which for whatever reason are un-detected/corrected by your host.
I saw poor performance on a system here when some idiot used a
category 1 cable instead of a category 5 cable to connect a
system to the hub. Packets were dropped on the floor and there
was no apparent reason, until we questioned the network.
I'd guess that it's either a network problem or a timing issue
(like TTL) that is configuration-related. There was a message
on this list in the past couple of days addressing the method
and reason to change TTL from default.
*********************
Andrew Gallatin <gallatin_at_isds.Duke.EDU>
Here's some suggestions:
- make sure your route & network parameters are properly set in
/etc/rc.config. If these are incorrect, then other network
operations will see large slowdowns as well. How is ftp/rcp throughput?
- *which* nfs server isn't responding? You may be looking at the
wrong problem. If you have something nfs-mounted directly off the
root directory, and that server isn't responding, you'll see 'nfs2
server not responding' messages for anything you do. If its soft
mounted, then that search will timeout eventually, and every thing you
do will be incredibly slow. This didn't use to happen in 1.x.
- make sure that the mounts are of the right type. One of the major
changes between 1.x and 3.x is the introduction of nfs version 3.
Solaris doesn't support this. On alphas running 3.x, mounts default
to v3 unless the other machine only supports v2. I'm thinking that
perhaps the solaris machine (which only does v2) is claiming to do v3,
and there is some sort of mismatch between the protocols. This very
much a longshot, but you can rule it out by simply typing mount from
the prompt - if the mount is of the correct type, it should have a v2
(rather than v3) as the 1st item in parenthesis.
- one of the causes for very slow logins is having to wait for a dead
nameserver to respond, or having an improperly configured /etc/resolve.conf.
The format of /etc/resolv.conf has changed between 1.x and 3.x We used
to say 'domain isds.duke.edu' and it would first seach for hosts in
isds.duke.edu, then hosts in duke.edu. To get this behavior back
after upgrading to 3.x, one needs to specify the search order using
the search keyword:
search isds.duke.edu duke.edu
If I can help you in any other way - give me a ring at 684-5419.
We've got 13 alphas over here, and I've been handling OSF/1 every
release since 1.x
Drew
This is what I meant (but didn't say very well the first time around)
-- nfs-mounts directly underneath the root directory are a bad thing
under DU 3.x. I have no idea why, but I think that each time you do
nearly any I/O, its stat'ing the root directory, and all of its
subdirectories, ie -- stat'ing each one of these nfs mounted
directories. The first thing I'd try is moving all the nfs-mounts out
of the root directory & into a separate subdirectory. Ie,
from:
/patient1_at_roentgen /patient1 nfs rw,bg,hard,intr,nfsv2,timeo=40 0 0
to:
/patient1_at_roentgen /rotgen/patient1 nfs rw,bg,hard,intr,nfsv2,timeo=40 0 0
> How do I figure out where the bad calls are coming from and going to?
I'd run a packet filter & see what's going on. tcpdump is pretty easy
to use & will give a description of each & every packet you
send/recieve.
> Any Ideas?
One off the wall idea -- is there any fddi between the alpha & the
sparc? I've heard of weird nfs problems relating to mtu size when
fddi is involved.
*********************
Doug Apel <103153.264_at_compuserve.com>
Sorry to hear of your plight --- digital told me I had to "upgrade" from 3.2 to
3.2C due to NFS problems back in October, when that really wasn't the problem,
so I completely sympathize with you...
One thing to check is your rc.config file. Look for the line NFS_LOCKING and
make sure it is a zero -- it evidently defaults to on (I sure don't remember
flipping it on!). This caused all sorts of grief with clients having
ridiculously slow mounts and then server-side timeouts and "path not found
errors" on my clients.
Also, how many NFS threads are you running server side? If your output from
netstat -s lists a bunch of overflow sockets in the UDP section, bump them up
(value can also be set in rc.config). 3.2C default is either 4 or 8 (i forget),
but I am running theoretical max (128) and only incurring a .02 load increase on
the CPU, so I figure what the heck. I have 50 PCs and 4 hp-9000's doing a total
of about 325 NFS mounts off of my two alphas using the above setup.
Happy Hunting!
doug apel
*********************
***** ORIGINAL MESSAGE FOLLOWS *******
> Subject: Abyssmal NFS client performance under 3.2[CD]
> Date: Thu, 18 Jan 1996 02:43:22 -0500
> From: Phil Antoine <antoine_at_RadOnc.Duke.EDU>
>
>
> Gees, I knew I should have left these boxes at OSF 1.3. They were
> working just fine...
>
> My problem now... Having just finished the installation and
> configuration of 3.2D on a DEC 3000/400, I am now faced with a
> ridiculously slow NFS client. This same box just two weeks ago had
> decent NFS performance as a client to a Solaris 2.3 server. It
> mounts 8 different FS's from that server. All FS's on all the
> systems are UFS for portability reasons.
>
> There's nobody else running anything on either the client or server,
> as it is 2am and I just rebooted them both to make sure. For
> crying out loud, my Sparc 2 and an identical Alpha running OSF 1.2
> (that's not a typo. If it ain't broke, we generally don't fix it)
> are currently running NFS circles around this crippled beast using
> the same mounts from the server. I'm getting a bunch of "NFS2
> server not responding" messages from this box at the same time a
> Sparc 2 is getting reliable light to moderate service from that same
> host and mount. Overload on the server just isn't feasible because
> during the regular production day, about 15 clients are beating on
> these same FS's with no problem at all. I've had no problems until
> DU 3.2[CD]. I mention 3.2C also because another AlphaStation 250
> 4/266 is experiencing problems similar to this. Logins there take
> about 2 minutes to complete. Its sort of languishing in a corner
> until I can work on it.
>
> I've checked the sanity and integrity of the network configuration
> and the NFS configuration (without tuning any parameters from the
> defaults). I've tried it with and without NFS locking on the client
> side. I've currently got 7 nfsiod's running, which should be plenty
> with quiescent systems and attempted heavy access on only one remote
> mount.
>
> Printing, mail, NTP, DNS, NIS, and LMF all seem to be working fine on
> this box. Its late and I'm beyond frustration with this particular
> incarnation of UNIX. What might I be missing here?
>
> Thanks in advance,
>
> Phil Antoine (antoine_at_RadOnc.Duke.EDU)
> Duke University Medical Center
> Radiation Oncology Physics
> Durham, North Carolina USA
> http://www.RadOnc.Duke.EDU/~antoine
Received on Wed Feb 14 1996 - 03:38:07 NZDT