SUMMARY: NFS3 server not responding - what did i miss? from Bernt Christandl on 1997-10-15 (tru64-unix-managers)

From: Bernt Christandl <beb_at_mpe.mpg.de>
Date: Tue, 14 Oct 1997 14:12:29 +0200

Hello managers,

i've asked for possible reasons for the error messages
     NFS3 server x00 not responding still trying
     NFS3 server x00 ok
from my server x00 being connected via 10BaseT (100Mbit) to a switch.
(For the original question see at the end of this mail.)

The situation now is, that i have copied about 2 GB this morning between
two AlphaServers (via NFS) without any of these messages (at - what i feel -
is a reasonable good speed) and also about 300 MB via NFS from my old
ultrix box to the new 1000A (with no such messages too), so i think i have
a solution even if i do not *really* understand why...

My thanks for their answers go to:
O. Boebion <boebion_at_bbrother.phys.univ-tours.fr>
Richard Eisenman <eisenman_at_tricity.wsu.edu>
Olle Eriksson <olle_at_cb.uu.se>
Kai Grunau <kgrunau_at_ifm.uni-kiel.de>
Robert L. McMillin <rlm_at_syseca-us.com>
David Warren <warren_at_atmos.washington.edu>

They all pointed to different possibilities that may cause these messages
and the most surprising thing to me was, that there are obviously problems
between a switch and digital unix under NFS...

David Warren said:
> Are the 1000A and the switch set to full duplex by any chance? If so, you are
> dropping packets when the switch buffer gets full, and then you wait for a
> timeout to request a resend. This can actually happen even if you don't have it
> running full duplex. It is just less likely.

And yes, we had it on full duplex and 100Mbit. I *think* this was the hit for
my actual problem, since i have drastically fewer (nearly none!) such nfs -
messages after i have changed the connection to HalfDuplex and 100MBit...

Richard Eisenman said:
> There are apparent problems with NFS over switched media when dealing with
> DU. Both our site and others have been working this problem with Cisco
> and DEC, but without too much success, however, there are now reports that
> the spanning tree algoryhthm may be responsible. We're doing some
> experiments here with modified spanning tree parameters, as well as
> disabled spanning tree operation, on a Cat 5000 to see if we can come up
> with a working solution. I'll post the results as soon as we have them.

Our switch is a NBase Megaswitch NH-2012R and we have the spanning tree
completely disabled, but nevertheless we are *very* interested in these
results! When we started with the switch we had big problems with "hanging"
ports and one try to solve these was to disable the spanning tree...

Olivier Boebion said:
> I had this sort of messages with a DU 3.2c system. It appeared when I
> change the network media: 10base2 ---> 10BaseT. I never find the solution.

(He thought an automounter software (amd) could have caused this...)
This sounds to me, as if he has a switch too. (I forgot to ask him.)

Olle Eriksson
   was the first who answered and gave the hint that the problem
may lay in the different NFS versions (2/3) between Ultrix and DU.
(But i had the problem also between 2 DU's...)

Kai Grunau said:
> Watch the output of the command "netstat -a -I <interface> <interval>"
> during the transfer.

(His opinion was that the ethernet could be the bottleneck. Hereafter i've
sent my followup, telling you all about the switch-connection...)

Robert McMillin said:
> Are you dropping packets? Try a ping flood (ping -f, as root):
>
> - from the NFS server to the client
> - from the NFS server to any other random machine on the network
> - from the NFS client to any other random machine on the network
>
> I'm wondering if you maybe don't have some loose connections.

"dropping packets" with the full duplex connection... It seems so.

My original question was:
> Hello managers,
>
> your first answers to my question have come in and it showed up,
> that i forgot one point:
>
> our 1000A sits alone at a 10baseT-port of a switch and the "mrtg"
> does not show any remarkable traffic over that port. This is why we are
> thinking, that the net is not the bottleneck...
>
>
> My original question was:
> to have a good server machine for the (near) future, we bought an
> AlphaServer 1000A with DU-4.0B, 192 MB and 5 PCI-SCSI busses; we call it x00.
>
> One of these busses is a wide SCSI, where we have around 150
> user-home-directories, which are accessed ONLY via nfs
> from several other AlphaServers, where those users have their local
> diskspace.
>
> When i now cp to another nfs-exported disk on another (narrow) SCSI-bus
> (of x00) around 200 MB "locally" on an old ultrix-machine, where
> this second disk is mounted (from a disk attached on that ultrix-machine)
> i recieve several times the message pair
>
> NFS3 server x00 not responding still trying
> NFS3 server x00 ok
>
> until the transfer finished.
>
> I don't understand this. i thought, that such a slow transfer should not
> be able to need more than the 1000A can handle. (In the near future we want
> the x00 to handle far more disks!) We have 16 nfs-daemons running...
>
> What can i do to make the situation better???

Thank you all!

Bernt Christandl

----------------------------------------------------------------------
- Bernt Christandl / Max Planck Institut - Extraterrestrische Physik -
- D-85740 Garching / Phone: +49/89/3299-3342 / Fax: +49/89/3299-3569 -
- email: beb_at_mpe.mpg.de -
----------------------------------------------------------------------
Received on Wed Oct 15 1997 - 09:30:00 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT