[SUMMARY] Unavailable NFS disk causes machine to seize up.

From: Anne Henderson <a.henderson_at_dem.csiro.au>
Date: Thu, 26 Oct 95 14:28:31 EST

My original question:
---------------------

I am having a problem following my OSF/1 upgrade from 2.X to 3.2A. I
configured the system files exactly as previously. The entry in the
/etc/fstab file for the NFS disk is like this:

/home_at_tvlsun /export/tvlsun/home nfs rw,bg 0 0


If tvlsun goes down, my system can't seems to handle it. Everything clags
up. Commands such as ls, ps, pwd, and more (regardless of the current
directory - even local) are sure-fire ways to cause the problem. They all
result in an error like:

NFS waiting for server to respond.


It doesn't make sense to me unless something that interragates the NFS disk
is wrapping those programs.

Other commands can be run without apparent ill-effect: su, cd, shutdown, init.

I know that on our previous system, OSF/1 (version 2.1), if you ran the df
or umount(!) command, you would get this response, but that at least it made
some sense.

Can anyone suggest anything? Automount? Different mount parameters? I don't
like having to reboot everytime the other machine goes down.

Any help would be most appreciated.


Answer:
-------

I tried a couple of suggestions from people without any luck.

In my case, I tracked the problem down, NOT to a mount>nfs>options problem
(I have set intr,and soft with no difference in behaviour), path definition,
or a command wrapping program (per se), but to an environment variable
definition (!), specifically the LD_LIBRARY_PATH. One of the directories
listed in this variable was an NFS directory.

If I take the potentially-offline NFS directory out of the LD_LIBRARY_PATH
variable definition everything works fine. I don't quite understand what is
happening here, but presumeably any shared library definitions are refered
to by the shell as it runs a command. If for some reason the shared library
directory can't be found, the shell seems to hang. (I am using tcsh BTW).

I don't believe the definition of LD_LIBRARY_PATH is any different than what
it was when we ran OSF/1V2.1 where I encountered no problems, but OSF/1V3.2
doesn't appear to handle non-exisistant or unreachable directories as
gracefully.

As it turns out, the software in question apparently runs fine without
LD_LIBRARY_PATH defined. So I found a quick and dirty solution.

Thank you to the following for their responses and advise:

Anthony Baxter, Alan Rollow, John Kinsella, Hellebo Knut, Dave Wolinski,
Brad Daniels, Craig Hagan, Kurt Watkins, and Kristian Koehnto
 

I have appended their responses below for those who are interested:

----------------------------------------------------------------------------
--------------------------
something must be accessing the hung mount point. Does the PATH contain
a dir on the hung mount point?

Anthony

----------------------------------------------------------------------------
---------------------
The "intr" option will allow interrupting the wait. That should take
care of the problem of having to reboot. I don't know what changed
that would cause a wider range of programs to depend on an NFS server
responding. One thing that might help is to put all the NFS mount
points under an extra directory under the root.

This sort of probably can often be caused by getwd() or a similar
function searching upward to find the root and then back down
again to get the current path. If an NFS mount point appears
in the search path before the current directory, it has the
chance to hang. For example, /mnt will typically be searched
before /usr. If /mnt is an NFS mount point and the server for
it is down, touching /mnt will hang all the processes that
go through it. If you put the NFS mount point under /mnt
instead of AT /mnt, /mnt can be searched quickly.

alan_at_nabeth.cxo.dec.com (Alan Rollow - Dr. File System's Home for Wayward
Inodes.)
----------------------------------------------------------------------------
--------------
> /home_at_tvlsun /export/tvlsun/home nfs rw,bg 0 0

Hey. I had this same problem at my last job with a bunch of HP/UX
machines...never did find a proper fix, and HP was still looking into
it when I left for greener pastures. I think your best bet might be to
try the berkeley automounter(amd). I believe you can get source at
ftp.cs.berkeley.edu, and I believe I had it running on a DEC 3000/400
at one point in time. Hope this helps somewhat!

John
_____________________________________________________________________________
John Kinsella, Sys Admin | UC Davis Math Department | Hardware: (n) The
Voice: (916) 752-8801 | 585 Kerr Hall | part of a computer
johnk_at_ucdmath.ucdavis.edu | Davis, CA 95616 | that can be kicked.

----------------------------------------------------------------------------
-----------------
Regards,

Try mounting the disk below / ,e.g. /nfs/home and see what happens

-- 
      ******************************************************************
      *         Knut Helleboe                    | DAMN GOOD COFFEE !! *
      *         Norsk Hydro a.s                  | (and hot too)       *
      * Phone: +47 55 996870, Fax: +47 55 996342 |                     *
      * Pager: +47 96 500718                     |                     *
      * E-mail: Knut.Hellebo_at_nho.hydro.com       | Dale Cooper, FBI    *
      ******************************************************************
----------------------------------------------------------------------------
----------------
Anne,
	We have the same NFS hanging problem on our OSF v3.0 machines.  When the
server goes down, none of the clients can do any commands, even local commands
as you described.  I'm sorry that I can't offer an answer.... but I'd be very
interested to read your summary when you get some answers.
Good luck.
Dave Wolinski
wolinski_at_umich.edu
----------------------------------------------------------------------------
------------------
Are you sure there isn't a directory on that machine in your path?  That
would cause the behavior you see.  Try doing "/bin/ls /bin" while the other
machine is down, and see if that freezes up.  If so, I have no clue what to
do.  I strongly suspect it won't cause a problem, however.  We ended up
setting up a system where shared executables and scripts get automatically
shadowed to local disks on all the machines that need them.  (We put
something in cron to copy the files from the NFS directories to directories
under /usr/local.shared.)  We then set all the paths to get it from the
shared directory instead of the master.  We occasionally have to do a manual
push when we need a file quickly, but down time is no longer a problem for
anyone who doesn't need resources unique to the machine that goes down.
- Brad
---------------------------------------------------------------------------
+ Brad Daniels                  | Before you walk a mile in another man's +
+ Biles and Associates          | shoes, be sure to spray them with that  +
+ These are my views, not B&A's | disinfectant they use at bowling alleys.+
---------------------------------------------------------------------------
----------------------------------------------------------------------------
------------------
add the options "soft,intr"  to your nfs options.
-- craig
----------------------------------------------------------------------------
-------------------
Hi Anne.
Is there any chance that an NFS mounted directory is in your $PATH? (The
commands you mention working look like root level commands, but the
non-working commands are user level commands one might use from a personal
account.) This would introduce the interrogation behaviour you describe. 
And the default mount is a hard mount, yes?  Lost soft mounts can
sometimes behave this way. 
G'Luck
K.
____________________________________________________________________
   Kurt Watkins                           Watkins_at_howie.swmed.EDU
   Howard Hughes Medical Institute          Phone: (214) 648-5034
   UT Southwestern Medical Center           Fax:   (214) 648-5066
   5323 Harry Hines Blvd. Y4.106         
   Dallas, TX 75235-9050 
----------------------------------------------------------------------------
----------------
The obvious: Is anything in your path referencing the missing disk?
Kristian
----------------------------------------------------------------------------
-----------------
Anne E. Henderson
Tropical Remote Sensing Unit
CSIRO division of Exploration and Mining
PMB Aitkenvale PO, Townsville, 4814, Australia
e-mail a.henderson_at_dem.csiro.au
ph:+61 77538544  fax:+61 77538600
Received on Thu Oct 26 1995 - 06:00:30 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT