Hello !
The problem (see original symptom below) has been cured by
changing mount options to -o tcp,hard instead of udp,soft
(udp,hard is the default).
Gerhard Niklasch from Sun support pointed out that having
mounted a filesystem with -o soft implies that error handling
is handed over to the programs working on that filesystem.
Therefore the NFS errors caused immediate faults of the programs.
So changing that to -o hard turns the responsability back to the
NFS/ip layer. Tcp is a more reliable protocol when the transport layer
is congested. It turned out that it is also a bit faster, mainly
because the buffers are 60 k instead of 9 k on the udp layer (but
that could be changed with dxkerneltuner...).
It seems that the problem is ultimately caused by a bug in Tru64,
but that should have been fixed with the new patchkits.
A special thanks to Bernard Van Renterghem who kindly provided the
patch for this situation. But comparing the symbol dates I was hesitating
to install that one because it was much older than our patched (P2)
5.1 Tru64 (as that patch should be in one of the new patchkits).
Thank's to the following people for answering so quick (in order of
appearance):
-------------------------------------------------------------------
From: Horst Dieter Lenk <lenk_at_mpi-muelheim.mpg.de>
We had errors in NFS mounted directories caused by data-overruns
between 1 GB ethernet to 100 MB ethernet.
Our solution was using tcp protocol insted of udp mounting the
directories.
-------------------------------------------------------------------
From: Michael A. Crowley <mcrowley_at_MtHolyoke.edu>
For the little it is worth, I have not seen this with a group
of machines running 4.0b, 5.0a, 5.1 in various combinations.
However, we are running: (v3, rw, tcp, hard, intr)
so there are several things to try differently.
-------------------------------------------------------------------
From: Ken Kleiner <ken_at_cs.uml.edu>
I had this problem when accessing nfs filesystems that were
served off of a 4.x machine, but mounted from a 5.x machine.
The quick fix is to mount them as nfsv2, but that only allows < 2GB files.
The long term fix is to get the patch from compaq - they sent me one
for this. I think it may be in the latest 5.0a jumbo patch.
-------------------------------------------------------------------
From: Bernhard Van Renterghem <vanrent_at_pcpm.ucl.ac.be>
On our 8 ES40, we get the same error and our Compaq support gave us a patch
which seems to fix that problem:
Engineering provided an extra patch to correct the
"NFS3 RFS3_WRITE failed" problem.
Attached below is the rpc.mod kernel module that should be
installed as follows:
As superuser:
cp -p /sys/BINARY/rpc.mod /sys/BINARY/rpc.mod_orig
cp ./rpc.mod /sys/BINARY/rpc.mod
chmod 0644 /sys/BINARY/rpc.mod
chown bin:bin /sys/BINARY/rpc.mod
A kernel build and reboot is required.
And for the rpc.mod.gz (186843 bytes) please download it from
http://big.pcpm.ucl.ac.be/rpc.mod.gz
-------------------------------------------------------------------
>====================================
> ORIGINAL SYMPTOM:
>On Tru Client:
>NFS3 write error 5 on host xxx
>NFS3 RFS3_WRITE failed for server xxx: RPC: Server can't decode arguments
>
>On Sun NFS Server:
>Mar 6 15:45:06 xxx unix: xdrmblk_getmblk failed
>Mar 6 15:45:07 xxx unix: NOTICE: nfs_server: bad getargs for 3/7
>
>Programs that try to write are often faulting.
>
>The directories are mounted with options v3,rw,nosuid,udp,soft,
>intr,actimeo=3.
>====================================
--
Dr. Udo Grabowski email: udo.grabowski_at_imk.fzk.de
Institut f. Meteorologie und Klimaforschung II, Forschungszentrum Karslruhe
Postfach 3640, D-76021 Karlsruhe, Germany Tel: (+49) 7247 82-6026
http://www.fzk.de/imk/imk2/ame/grabowski/ Fax: " -6141
Received on Mon Mar 12 2001 - 08:37:21 NZDT