Dear Tru64/Digital Unix/OSF1 Managers,
We recently acquired a 4x400MHz 4100 with 512Mb memory (Dataram, new),
running 4.0E + patch kit 1, and firmware from the 5.3 CD that comes with
4.0E. The system is on a 4Gb DEC RZ1CB-CS disk hanging off a KZPBA-CA PCI
Ultra Wide SCSI card. The machine also has 8x9Gb user disk hanging off a
2nd KZPBA-CA and a KZPDA-AA PCI Wide SCSI II adapter. All file systems
are UFS, not AdvFS. The disks are NFS mounted by a set of AlphaStation 200
4/233's also running the identical 4.0E + patch kit 1.
Now for the problem. We get random, non-reproducible errors on the 4100
that appear to always be related in some way to disk access (on *all*
disks it seems). For example doconfig will often (50% of the time) not
be able to build a new kernel and thinks needed files or directories are
missing (they are not). User programs will occasionally write corrupt
files to disk containing just a few extra or missing characters. Adobe
Acrobat will think fonts are missing. Reruns immediately after getting an
error will occasionally succeed. And so on.... The only common thing is
that all errors do seem to involve I/O. The problem occurs under the
vmunix kernel that we can occasionally build, and under genvmunix, and
occurs both before and after installing 4.0E patch kit 1.
The odd thing is that we have no problems at all with the AS200's that
access the same disks via NFS. No corrupted files; nothing. All programs
run successfully to completion, Acrobat works, etc. Our first though was
that it might be some bizarre "timing" problem between memory and the I/O
subsystem.
We have rotated the PCI SCSI cards with no effect. We replaced one of the
KZPDA-AA's with a new KZPBA-CA to no effect. We got Dataram to swap out
the 512Mb memory cards for a new set, with no effect. We have run system
exercise routines without error on the 4100. We will probably next try
booting with only single CPU's present, but this seems like a long shot.
We could also try building 4.0A, which we have around, on a different disk
but this is also a long shot (?).
Any ideas as to even where the problem might lie? I realize this could be
slightly off-topic since it sounds like a "hardware" instead of a "software"
problem; but you never know. Any help appreciated.
Thanks, Jay P.
--------
Jay W. Ponder Phone: (314) 362-4195
Biochemistry, Box 8231 Fax: (314) 362-7183
Washington University Medical School
660 South Euclid Avenue Email: ponder_at_dasher.wustl.edu
St. Louis, Missouri 63110 USA WWW:
http://dasher.wustl.edu/
Received on Thu May 06 1999 - 21:42:35 NZST