Bizarre problem (?) on 4100 running 4.0E

From: Jay Ponder <ponder_at_dasher.wustl.edu>
Date: Thu, 06 May 1999 16:43:35 -0500

 Dear Tru64/Digital Unix/OSF1 Managers,

 We recently acquired a 4x400MHz 4100 with 512Mb memory (Dataram, new),
 running 4.0E + patch kit 1, and firmware from the 5.3 CD that comes with
 4.0E. The system is on a 4Gb DEC RZ1CB-CS disk hanging off a KZPBA-CA PCI
 Ultra Wide SCSI card. The machine also has 8x9Gb user disk hanging off a
 2nd KZPBA-CA and a KZPDA-AA PCI Wide SCSI II adapter. All file systems
 are UFS, not AdvFS. The disks are NFS mounted by a set of AlphaStation 200
 4/233's also running the identical 4.0E + patch kit 1.

 Now for the problem. We get random, non-reproducible errors on the 4100
 that appear to always be related in some way to disk access (on *all*
 disks it seems). For example doconfig will often (50% of the time) not
 be able to build a new kernel and thinks needed files or directories are
 missing (they are not). User programs will occasionally write corrupt
 files to disk containing just a few extra or missing characters. Adobe
 Acrobat will think fonts are missing. Reruns immediately after getting an
 error will occasionally succeed. And so on.... The only common thing is
 that all errors do seem to involve I/O. The problem occurs under the
 vmunix kernel that we can occasionally build, and under genvmunix, and
 occurs both before and after installing 4.0E patch kit 1.

 The odd thing is that we have no problems at all with the AS200's that
 access the same disks via NFS. No corrupted files; nothing. All programs
 run successfully to completion, Acrobat works, etc. Our first though was
 that it might be some bizarre "timing" problem between memory and the I/O
 subsystem.

 We have rotated the PCI SCSI cards with no effect. We replaced one of the
 KZPDA-AA's with a new KZPBA-CA to no effect. We got Dataram to swap out
 the 512Mb memory cards for a new set, with no effect. We have run system
 exercise routines without error on the 4100. We will probably next try
 booting with only single CPU's present, but this seems like a long shot.
 We could also try building 4.0A, which we have around, on a different disk
 but this is also a long shot (?).

 Any ideas as to even where the problem might lie? I realize this could be
 slightly off-topic since it sounds like a "hardware" instead of a "software"
 problem; but you never know. Any help appreciated.

                              Thanks, Jay P.

--------
Jay W. Ponder Phone: (314) 362-4195
Biochemistry, Box 8231 Fax: (314) 362-7183
Washington University Medical School
660 South Euclid Avenue Email: ponder_at_dasher.wustl.edu
St. Louis, Missouri 63110 USA WWW: http://dasher.wustl.edu/
Received on Thu May 06 1999 - 21:42:35 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT