SUMMARY: Alpha Hanging opon NFS server loss

From: Neil Smith <neils_at_csrp.tamu.edu>
Date: Fri, 18 Aug 1995 11:33:36 -0600

Eccellent Responses. Much Thanks.

------------------ Original query ----------------
Our Alphas are ceasing to provide certain functionality whenever an nfs
file system server goes down. On occasions, it will lock up the machine
completely. This feature has endured across 2 versions of OSF, 2.0 and
3.2, so my first guess is that its either OSF, or an NFS switch that needs
to be set. I am using the usual mount parameters; eg. from fstab:

/data0_at_csrp /data0 nfs rw,bg 0 0

The Alphas are model 3000/600 at OSF/1 v3.2. The file system servers are
these Alphas and IBM RS6000's running AIX 3.2.5. The AIX machines do not
have this problem. On both platforms, nfs dutifully announces that a
server is not responding, etc. But the Alphas are, in addition, hanging on
most commands, which will continue processing as soon as the server is
brought back on line and servicing client requests.

What is happening here? What should be done to correct it?
---------------------------------------------------

The prevailing wisdom, summarised by Bernhard Schneck:

          Thou Shalt Not Mounteth into the root directory.

ie.,
     1. have your nfs mount points below a directory of the root (/)
        directory, not at the root directory.

        eg., from my fstab example above:

        instead of
                /data0_at_csrp /data0 nfs rw,bg 0 0
        use
                /data0_at_csrp /nfs/data0 nfs rw,bg 0 0

        where the mount point is /nfs/data0 instead of /data0 at the root dir.

     2. include the intr mount option to provide ability to interrupt the
        RPC cycle

     2. alternatively, soft mounting the filesystems may be used but
        is not recommended, and is usually discouraged.

The reason for the problem is explained better by selected responses
included below:

=====
From: Bernhard.Schneck_at_Physik.TU-Muenchen.DE
The problem is with how getwd/getpwd do their job ... they repeatedly
go to the parent directory, stat(2) all files there, until they find
that stat(.) == stat(..).

If one of the files it wants to stat is on a NFS mounted volume where
the server is down, the stat will hang.

getwd/getpwd are called from an incredible amount of programs ... *sigh*.

======
From: Paul David Fardy <pdf_at_morgan.ucs.mun.ca>
This is a common problem with NFS. Many programs (the best example
being /bin/pwd) search up the directory to find the full pathname
for the current directory. On the way up, a program can encounter
a remote filesystem root directory and hang if the host is unavailable.
The process often gets into an uninterruptable state.

Our solution was to place each mount points in a separate directory.
So for your directory I'd

        umount /data0
        mkdir -p /nfs/data0/nfs
        chmod -R a+rx /nfs
        rmdir /data0
        ln -s /nfs/data0/nfs /data0

and use the following line in

        /data0_at_csrp /nfs/data0/nfs nfs rw,bg,intr 0 0

(We also use the hard option for mounting.)
The symbolic link preserves the pathname.

=======
From: Jon Reeves <reeves_at_zk3.dec.com>
You probably want to add "intr" to your options; you might want to add
"soft", but with "rw" that's living dangerously. Our NFS gurus have
referred to a soft-mounted file system as a "corrupt" file system.

More important, you should reconsider your mount point; by having a
hard-mounted NFS file system at the root level, you ensure that every
getcwd, most opens, command searches, ... will run across that mount point
and will "stat" it and then hang. If at all possible, you should create
a directory specific to the NFS mounts, then mount under that directory
(e.g., /server-name/data0). You could use a soft-link to point there.

It's partly a matter of luck, depending on the actual order of directory
entries, whether this will cause you a problem. This might explain the
behavior on the other machines, or you might be using a different mount
point, or perhaps they default to "intr,soft" mounts.

Incidentally, "bg" only affects the action at the time of the initial mount.

=======
Many thanks to those who responded:

        Mike Iglesias <iglesias_at_draco.acs.uci.edu>
        John Stoffel <john_at_WPI.EDU>
        Jon Reeves <reeves_at_zk3.dec.com>
        Bernhard.Schneck_at_Physik.TU-Muenchen.DE
        David R Courtade <drc_at_amherst.com>
        Paul David Fardy <pdf_at_morgan.ucs.mun.ca>
        Khalid Paden <khalid_at_FNAL.FNAL.GOV>
        Jason Yanowitz <yanowitz_at_eternity.cs.umass.edu>
        "Richard L Jackson Jr" <rjackson_at_portal.gmu.edu>

-----------------------------------------------------------------------
Neil R. Smith, Research Assoc./Comp.Sys.Mngr. neils_at_csrp.tamu.edu
Climate System Research Program 409/862-4342
Dept. of Meteorology, Texas A&M Univ., USA 409/862-4132 FAX
-----------------------------------------------------------------------
Received on Fri Aug 18 1995 - 18:46:26 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT