NFS Locking being retained after process dies from James Talbut on 1998-11-10 (tru64-unix-managers)

From: James Talbut <jet35_at_phy.cam.ac.uk>
Date: Mon, 09 Nov 1998 15:41:27 +0000

I run a mail server (using exim) which has the mail spool NFS mounted by a small number of machines.
In most cases this works perfectly, but occasionally I'll notice a backlog of mail messages for one person building up in the queue.
The logs for these messages contain the same error repeated many times:
local_delivery transport deferred: failed to lock mailbox /var/spool/mail/jet35 (fcntl)

Exim itself uses lock files as well as fcntl, so if I use exim_lock on my mail spool and try to deliver a message I get:
local_delivery transport deferred: failed to lock mailbox /var/spool/mail/jet35 (lock file)
(in other words, it's not Exim's fault).

Using lslk I find that the file is locked by rpc.lockd for a remote process:
SRC PID DEV INUM SZ TY M ST WH END LEN NAME
tcm17.phy.cam.ac.uk 887 2653,37667 108 90895 w 0 0 0 9223372036854775807 0 /var/spool (spool_domain#spool)
tcm28.phy.cam.ac.uk 4526 2653,37667 191 0 w 0 0 0 9223372036854775807 0 /var/spool (spool_domain#spool)
alpha5.poco.phy.cam.ac.uk 9411 2653,37667 186 0 w 0 0 0 9223372036854775807 0 /var/spool (spool_domain#spool)
tcm21.phy.cam.ac.uk 29101 2653,37667 263 481440 w 0 0 0 9223372036854775807 0 /var/spool (spool_domain#spool)
tcm11.phy.cam.ac.uk 32407 2653,37667 443 52287 w 0 0 0 9223372036854775807 0 /var/spool (spool_domain#spool)

Unfortunately there is no process of that number on those machines (alpha5.poco has actually be rebooted and still retains it's
lock).
I have tried stopping and restarting rpc.lockd and rpc.statd but they make no difference.
Also, to free things up I copy the users spool file into a temporary one and then move it back (basically changing it's inode) but
this doesn't free up the lock.
That list from lslk is currently correct, but neither the processes nor the files with those inodes exist.

Both server and clients are running rpc.lockd and rpc.statd.
The server is on DU4.0D, alpha5.poco is DU3.2A and the tcm machines are all on DU4.0something.
We have a number of Linux clients too, none of them have had any problems.

This is a very intermittent problem, it occurs for one user every few days and there are about half a dozen users that have been
affected.

Has anyone seen this sort of thing before, and does anyone have any ideas of a solution?
Is there anyway to get rpc.lockd to reset it's list of locks?

J.T.

----------------------------------------------------------------------
            James Talbut James.Talbut_at_phy.cam.ac.uk
        Cavendish Laboratory, Madingley Road, Cambridge, CB3 0HE
         Tel: 01223 337457 Fax: 01223 337457
Received on Mon Nov 09 1998 - 15:42:24 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:38 NZDT