We've recently run across a strange file corruption problem, and were
wondering if anyone else has seen anything similar.
We have a number of Alphas with filesystems crossmounted via NFS. In
particular, we have a 2100 with /etc mounted on a number of other
Alphas.
We run a locally built passwd program that updates the passwd files on
multiple machines, with appropriate (and successful) locking. This
program updates the password file on the machine it is called on,
_and_ on the 2100 in question.
Once in a while, on an intermittent basis, and without any apparent
rhyme or reason, the password file "loses" a block of records from the
middle. The missing block always starts at a 512 byte boundary, but
does not always seem to be an integral number of blocks long. The
only system showing this problem is the 2100, where updates are being
made through NFS. The system where the passwd program is being run --
another Alpha -- has _its_ passwd file (which is local to itself)
updated correctly.
[I hope that made sense.]
The passwd program happens to be written in Perl, although that
doesn't seem to be particularly relevant.
Another user has, just today, reported a similar situation (large
block of data missing from the middle of a file) on another system.
This was just a plain text file -- no passwd file involved here. He
had a (Perl) program that write records to two different file -- one
local to his machine, the other NFS mounted. Again, it was the NFS
copy that showed the corruption.
This problem is intermittent. We can run for weeks without being
"hit". On occasion, both of the machines that "hosted" the corrupted
NFS file can become extremely busy, although we cannot confirm that
such busy peaks occurred coincident to the corruption.
We are less willing to suspect Perl in this case, now that a second
instance has shown up; and a third, that does not involve Perl at
all. We are currently leaning toward either a deep-seated IO bug
(which we tend to consider unlikely), or an NFS problem of some sort.
I apologize for the speculative nature of this message, but we're
kind of grasping here. Any and all information, suggestions,
hints, etc. will be welcome.
--
John G Dobnick "Knowing how things work is the basis
Information & Media Technologies for appreciation, and is thus a
University of Wisconsin - Milwaukee source of civilized delight."
jgd_at_csd.uwm.edu ATTnet: (414) 229-5727 -- William Safire
Received on Tue Oct 27 1998 - 22:57:58 NZDT