I was seeing problems while trying to compile code on to an NFS server
which has a Prestoserve accellerator in in. The symptoms I saw suggested
that the Prestoserve was somehow responsible for the problems. Alan Rollow
told me that there were no patches for Prestoserve which addressed any
problems sounding like this. However, Ian Stewart and Marco Luchini
report seeing the same problem as I did. They suggested turning off
Prestoserve (I have), installing patches for 2.0, or upgrading to 3.2.
I'll put this off until 3.2 is installed (hopefully soon!) and then
check it. Full text is appended.
Thanks to:
Alan Rollow <alan_at_nabeth.cxo.dec.com>
Keith Chiles <kchiles_at_hccsf.com>
Tien LH Mai <tienm_at_amath.washington.edu>
Ian Stewart <Ian.Stewart_at_ranplc.co.uk>
Marco Luchini <luchini_at_siberia.ups-tlse.fr>
Jim Wright Keck Center for Integrative Neuroscience
jwright_at_phy.ucsf.edu Department of Physiology, Box 0444
voice 415-502-4874 513 Parnassus Ave, Room HSE-811
fax 415-502-4848 UCSF, San Francisco, CA 94143-0444
---------------------------------------------------------------------------
Date: Fri, 21 Apr 1995 15:39:55 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>
I have a PrestoServe NFS accellerator installed in a DEC 3000/600
running OSF/1 2.0. I believe it is corrupting files during compilation
of C code. The symptom is that an executable generated by an NFS client
dies with "illegal instruction". The same code when compiled to the
client's local disk works fine; also works fine when compiled by the
server on it's local disk. Once built, the application works fine for
either local or NFS clients. This is repeatable with a wide range of
code, from very simple to very large.
Everything reports as being fine with the presto and dxpresto command.
I can find no log files which indicate any problem.
So, does anyone else use the turbochannel PrestoServe board? Successfully?
Jim Wright Keck Center for Integrative Neuroscience
jwright_at_phy.ucsf.edu Department of Physiology, Box 0444
voice 415-502-4874 513 Parnassus Ave, Room HSE-811
fax 415-502-4848 UCSF, San Francisco, CA 94143-0444
---------------------------------------------------------------------------
Date: Fri, 21 Apr 95 17:34:48 -0600
From: alan_at_nabeth.cxo.dec.com
Well, the quick check to see if Prestoserve is at fault
is to turn it off and try the remote build. If that
fails it isn't Prestoserve.
It is worth noting that OSF/1 will let you overwrite a
running executable. It doesn't take long for the VM code
to notice is has lost the original execute and stop the
running with a "illegal instruction" signal.
Date: Fri, 21 Apr 1995 20:00:12 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>
Thanks for the response. Yup, the problems go away when presto is
turned off. Also, the problem doesn't involve overwriting running
executables. Everything (I can figure out) points to prestoserve.
Jim
---------------------------------------------------------------------------
Date: Sat, 22 Apr 1995 00:03:44 -0600
From: alan_at_nabeth.cxo.dec.com (Alan Rollow - Dr. File System's Home for Wayward Inodes.)
There are only two known patches for Prestoserve, one
related to the Advanced File System and the other for
an LSM shutdown problem. There appear to be a variety
of patches for assorted UFS and FDDI corruption problems,
but none related to Prestoserve.
The CSC Web server has a list of the patches available.
I think it is www.service.digital.com. If you have a
contract you can get the patches. If not, you can pay
a per-call charge to get them.
---------------------------------------------------------------------------
Date: Mon, 24 Apr 95 08:20:46
From: "keith chiles" <kchiles_at_hccsf.com>
Jim,
I have a problem with thinking that the prestoserv is causing your
compile problems. I have no experience with presto on an alpha box,
but I did have it on a DecServer running Ultrix. Prestoserv was just
a battery backed up disk cache that allowed write behind caching
without the fear of loosing data. My software development team was
compiling "C" code across the net and had no compatibility problems.
If the cache were causing a problem, then I would expect the problem
to show up on code that is compiled locally on the server.
I would try taking the prestoserv off-line and running your tests
again. I suspect that NFS might be the problem, or at least the
NFS-Presto link is where the problem is located.
Good luck, Keith
---------------------------------------------------------------------------
Date: Mon, 24 Apr 1995 09:07:58 -0700 (PDT)
From: Tien LH Mai <tienm_at_amath.washington.edu>
what you need is:
/usr/sys/BINARY/ufs_bmap.o (HPAQ4140D)
CHECKSUM: 42137 110
/usr/sys/BINARY.rt/ufs_bmap.o
CHECKSUM: 62478 117
-----------------------------
Patch Id: OSFV20-028-1
check w/your DEC support.
I believe similar problem exists on v3.2. I'm still trying to verify
w/DEC.
--Tien
---------------------------------------------------------------------------
Date: Mon, 24 Apr 1995 15:34:08 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>
Here's an overview of my experience so far
destination disk
| A | A+presto | B
--+----+----------+----
cpu A | ok | ok | ok
for --+----+----------+----
compiler B | ok | corrupt | ok
The "A" and "A+presto" disks are the same location, first with prestoserve
disabled and then with it enabled. I've just pinpointed this recently,
but I can't be sure how long this behavior has been present. My impression
is that it just started recently.
> If the cache were causing a problem, then I would expect the problem
> to show up on code that is compiled locally on the server.
I don't quite understand this. I thought the NFS accellerator had no
effect when accessing disk locally. And my tests so far reinforce that.
Thanks for your answer,
Jim Wright Keck Center for Integrative Neuroscience
jwright_at_phy.ucsf.edu Department of Physiology, Box 0444
voice 415-502-4874 513 Parnassus Ave, Room HSE-811
fax 415-502-4848 UCSF, San Francisco, CA 94143-0444
---------------------------------------------------------------------------
Date: Mon, 24 Apr 1995 16:08:46 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>
Alan, could I impose upon you for your opinion of this, regarding
Prestoserve and NFS write corruptions?
> Date: Mon, 24 Apr 1995 09:07:58 -0700 (PDT)
> From: Tien LH Mai <tienm_at_amath.washington.edu>
>
> what you need is:
> /usr/sys/BINARY/ufs_bmap.o (HPAQ4140D)
> CHECKSUM: 42137 110
> /usr/sys/BINARY.rt/ufs_bmap.o
> CHECKSUM: 62478 117
> -----------------------------
>
> Patch Id: OSFV20-028-1
>
> check w/your DEC support.
>
> I believe similar problem exists on v3.2. I'm still trying to verify
> w/DEC.
>
> --Tien
I should have included this in my first posting to clarify a bit further.
All machines are OSF/1 v2.0 and all filesystems locally are UFS.
destination disk
| A | A+presto | B
--+----+----------+----
cpu A | ok | ok | ok
for --+----+----------+----
compiler B | ok | corrupt | ok
The "A" and "A+presto" disks are the same location, first with prestoserve
disabled and then with it enabled. I've just pinpointed this recently,
but I can't be sure how long this behavior has been present. My impression
is that it just started recently.
Thanks for you trouble,
Jim
---------------------------------------------------------------------------
Date: Mon, 24 Apr 95 18:03:46 -0600
From: alan_at_nabeth.cxo.dec.com
While the ufs_bmap patch is certainly for V2.0 and
could cause data corruption, the text of the patch
doesn't make it appear to have anything to do with
Prestoserve.
The only Prestoserve patches appear to be that replaces
the presto(8) command to include some feature for the
Advanced File System and an replacement pr.o to solve
an panic; see below. So, I see two possibilities left:
1. An undiscovered bug in V2.0 related to Prestoserve.
2. A problem with Prestoserve NVRAM.
You won't get #1 fixed because V2.0 is two versions out
of data and long unsupported. A CSC will recommend;
upgrade. #2 is a hardware problem.
/sys/BINARY/pr.o
CHECKSUM: 30850 186
/sys/BINARY.rt/pr.o
CHECKSUM: 23495 188
----------------------
Problem 1: (QAR 21070)
*********
Patch ID: OSFV20-015-2 (included in V2.1)
This is to fix a panic which appears with the following panic string:
"vrele: bad ref count"
The signature of this panic is that the stack trace goes through
the nfs server's write gathering code as follows:
(dbx) t
> 0 boot(reason = 0, howto = 0) ["../../../../src/kernel/arch/alpha/machdep.c"
1 panic(s = 0xfffffc00004d3d00 = "vrele: bad ref count") ["../../../../src/k
2 vrele(vp = 0xfffffc0000283300) ["../../../../src/kernel/vfs/vfs_subr.c":10
3 rfs_writeg(vp = 0xffffffff8917ef80, wa = 0xffffffff89388500, ns = 0xffffff
4 rfs_write(wa = 0xffffffff8917ef80, ns = 0xffffffff8917f200, nreq = 0xfffff
5 rfs_dispatch(req = 0xffffffff991cbaa8, xprt = 0xffffffff89388500) ["../../
6 svc_getreq(xprt = 0xffffffff8938c800) ["../../../../src/kernel/rpc/svc.c":
7 svc_run(xprt = 0xfffffc0000290c10) ["../../../../src/kernel/rpc/svc.c":502
8 nfs_svc(p = 0xffffffff9885d7e8, args = (nil), retval = 0xffffffff991cbe10)
9 nfssvc(p = 0xffffffff9885d7e8, args = 0xffffffff991cbe20, retval = 0xfffff
10 syscall(ep = 0xffffffff991cbef8, code = 158) ["../../../../src/kernel/arch
11 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":860, 0xfffffc000
Problem 2: (QAR 22865)
*********
Patch Id: OSFV20-071-2 (included in V2.1)
Logical Storage Manager(LSM) V1.0 runs on DEC OSF/1 V2.0. In LSM V1.0, when
LSM volumes have been enabled for presto and the system is brought down
abnormally (power failure, system panic etc.), the system will panic while
trying to flush dirty NVRAM buffers on a subsequent system reboot.
This correction requires a kernel rebuild.
---------------------------------------------------------------------------
Date: Mon, 24 Apr 95 17:01:31
From: "keith chiles" <kchiles_at_hccsf.com>
Jim,
Thanks for your response. I am not sure about prestoserv now, but when I was
running it on my DecSystem 5900, I was cache hits in the 80% range before I ever
put an nfs mount on it. It was my understanding that it was a write cache that
improved all disk performance by allowing a rapid response to all file updates
that allowed the kernel to send a steady stream to the disk. On small writes
like inodes, it really improved performance and reduced write wait states. Your
chart does, indeed, suggest that there is a problem between nfs and prestoserv
as you suspected. I stand corrected.
Cheers, Keith
From: Ian Stewart <Ian.Stewart_at_ranplc.co.uk>
Date: Tue, 25 Apr 1995 17:53:46 +0100
We had this same problem some time back, there is a patch
available from DEC. The problem also seems to go away if you
upgrade to OSF/1 3.0 or later, which is what we did in the
end.
Ian Stewart
---------------------------------------------------------------------------
From: luchini_at_siberia.ups-tlse.fr (Marco Luchini)
To: Jim Wright <jwright_at_phy.ucsf.edu>
Hi Jim,
> I have a PrestoServe NFS accellerator installed in a DEC 3000/600
> running OSF/1 2.0. I believe it is corrupting files during compilation
> of C code. The symptom is that an executable generated by an NFS client
Yes indeed it does. We had exactly the same problem. Turning off
Presto stopped the errors. Eventually we installed a number of patches
to 2.0 which solved the problem. I believe the most relevant one is:
OSFV20-028-1 which states in its README:
Data corruption was being caused by fragments of files being incorrectly
written to disk.
But we also installed OSFV20-015-1 and a few NFS related patches as well
- there were quite a lot in 2.0 and I was glad to upgrade from it. If I
were you I'd upgrade to OSF3.2 and all problems should go away.
> So, does anyone else use the turbochannel PrestoServe board? Successfully?
Actually, in the end, I don't think it's the presto's fault. The errors
happen on non-presto platforms as well according to DEC, just much more
rarely. So we would only detect them with presto turned on.
Check out the COMET search gateway with the key "osf" for a full list of
patches:
http://www.service.digital.com:8031/
Ciao, Marco
-------------------------------------------------------------------------
Marco Luchini Internet : m.luchini_at_ic.ac.uk
Laboratoire de Physique Quantique Telephone: +33 61.55.60.39
Universite' Paul Sabatier Fax: +33 61.55.60.65
31062 Toulouse, FRANCE
-------------------------------------------------------------------------
Received on Mon May 01 1995 - 17:44:24 NZST