SUMMARY: PrestoServe corrupting compiles

From: Jim Wright <jwright_at_phy.ucsf.edu>
Date: Mon, 1 May 1995 14:42:13 -0700 (PDT)

I was seeing problems while trying to compile code on to an NFS server
which has a Prestoserve accellerator in in. The symptoms I saw suggested
that the Prestoserve was somehow responsible for the problems. Alan Rollow
told me that there were no patches for Prestoserve which addressed any
problems sounding like this. However, Ian Stewart and Marco Luchini
report seeing the same problem as I did. They suggested turning off
Prestoserve (I have), installing patches for 2.0, or upgrading to 3.2.
I'll put this off until 3.2 is installed (hopefully soon!) and then
check it. Full text is appended.

Thanks to:
        Alan Rollow <alan_at_nabeth.cxo.dec.com>
        Keith Chiles <kchiles_at_hccsf.com>
        Tien LH Mai <tienm_at_amath.washington.edu>
        Ian Stewart <Ian.Stewart_at_ranplc.co.uk>
        Marco Luchini <luchini_at_siberia.ups-tlse.fr>


Jim Wright Keck Center for Integrative Neuroscience
jwright_at_phy.ucsf.edu Department of Physiology, Box 0444
voice 415-502-4874 513 Parnassus Ave, Room HSE-811
fax 415-502-4848 UCSF, San Francisco, CA 94143-0444

---------------------------------------------------------------------------

Date: Fri, 21 Apr 1995 15:39:55 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>

I have a PrestoServe NFS accellerator installed in a DEC 3000/600
running OSF/1 2.0. I believe it is corrupting files during compilation
of C code. The symptom is that an executable generated by an NFS client
dies with "illegal instruction". The same code when compiled to the
client's local disk works fine; also works fine when compiled by the
server on it's local disk. Once built, the application works fine for
either local or NFS clients. This is repeatable with a wide range of
code, from very simple to very large.

Everything reports as being fine with the presto and dxpresto command.
I can find no log files which indicate any problem.

So, does anyone else use the turbochannel PrestoServe board? Successfully?

Jim Wright Keck Center for Integrative Neuroscience
jwright_at_phy.ucsf.edu Department of Physiology, Box 0444
voice 415-502-4874 513 Parnassus Ave, Room HSE-811
fax 415-502-4848 UCSF, San Francisco, CA 94143-0444

---------------------------------------------------------------------------

Date: Fri, 21 Apr 95 17:34:48 -0600
From: alan_at_nabeth.cxo.dec.com


        Well, the quick check to see if Prestoserve is at fault
        is to turn it off and try the remote build. If that
        fails it isn't Prestoserve.

        It is worth noting that OSF/1 will let you overwrite a
        running executable. It doesn't take long for the VM code
        to notice is has lost the original execute and stop the
        running with a "illegal instruction" signal.

Date: Fri, 21 Apr 1995 20:00:12 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>

Thanks for the response. Yup, the problems go away when presto is
turned off. Also, the problem doesn't involve overwriting running
executables. Everything (I can figure out) points to prestoserve.

Jim

---------------------------------------------------------------------------

Date: Sat, 22 Apr 1995 00:03:44 -0600
From: alan_at_nabeth.cxo.dec.com (Alan Rollow - Dr. File System's Home for Wayward Inodes.)

        There are only two known patches for Prestoserve, one
        related to the Advanced File System and the other for
        an LSM shutdown problem. There appear to be a variety
        of patches for assorted UFS and FDDI corruption problems,
        but none related to Prestoserve.

        The CSC Web server has a list of the patches available.
        I think it is www.service.digital.com. If you have a
        contract you can get the patches. If not, you can pay
        a per-call charge to get them.

---------------------------------------------------------------------------

Date: Mon, 24 Apr 95 08:20:46
From: "keith chiles" <kchiles_at_hccsf.com>


     Jim,
     
     I have a problem with thinking that the prestoserv is causing your
     compile problems. I have no experience with presto on an alpha box,
     but I did have it on a DecServer running Ultrix. Prestoserv was just
     a battery backed up disk cache that allowed write behind caching
     without the fear of loosing data. My software development team was
     compiling "C" code across the net and had no compatibility problems.
     If the cache were causing a problem, then I would expect the problem
     to show up on code that is compiled locally on the server.
     
     I would try taking the prestoserv off-line and running your tests
     again. I suspect that NFS might be the problem, or at least the
     NFS-Presto link is where the problem is located.
     
     Good luck, Keith

---------------------------------------------------------------------------

Date: Mon, 24 Apr 1995 09:07:58 -0700 (PDT)
From: Tien LH Mai <tienm_at_amath.washington.edu>


what you need is:
/usr/sys/BINARY/ufs_bmap.o (HPAQ4140D)
CHECKSUM: 42137 110
/usr/sys/BINARY.rt/ufs_bmap.o
CHECKSUM: 62478 117
-----------------------------

Patch Id: OSFV20-028-1

check w/your DEC support.

I believe similar problem exists on v3.2. I'm still trying to verify
w/DEC.

--Tien

---------------------------------------------------------------------------

Date: Mon, 24 Apr 1995 15:34:08 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>

Here's an overview of my experience so far

              destination disk

            | A | A+presto | B
          --+----+----------+----
cpu A | ok | ok | ok
for --+----+----------+----
compiler B | ok | corrupt | ok

The "A" and "A+presto" disks are the same location, first with prestoserve
disabled and then with it enabled. I've just pinpointed this recently,
but I can't be sure how long this behavior has been present. My impression
is that it just started recently.

> If the cache were causing a problem, then I would expect the problem
> to show up on code that is compiled locally on the server.

I don't quite understand this. I thought the NFS accellerator had no
effect when accessing disk locally. And my tests so far reinforce that.

Thanks for your answer,

Jim Wright Keck Center for Integrative Neuroscience
jwright_at_phy.ucsf.edu Department of Physiology, Box 0444
voice 415-502-4874 513 Parnassus Ave, Room HSE-811
fax 415-502-4848 UCSF, San Francisco, CA 94143-0444

---------------------------------------------------------------------------

Date: Mon, 24 Apr 1995 16:08:46 -0700 (PDT)
From: Jim Wright <jwright_at_phy.ucsf.edu>

Alan, could I impose upon you for your opinion of this, regarding
Prestoserve and NFS write corruptions?

> Date: Mon, 24 Apr 1995 09:07:58 -0700 (PDT)
> From: Tien LH Mai <tienm_at_amath.washington.edu>
>
> what you need is:
> /usr/sys/BINARY/ufs_bmap.o (HPAQ4140D)
> CHECKSUM: 42137 110
> /usr/sys/BINARY.rt/ufs_bmap.o
> CHECKSUM: 62478 117
> -----------------------------
>
> Patch Id: OSFV20-028-1
>
> check w/your DEC support.
>
> I believe similar problem exists on v3.2. I'm still trying to verify
> w/DEC.
>
> --Tien

I should have included this in my first posting to clarify a bit further.
All machines are OSF/1 v2.0 and all filesystems locally are UFS.

              destination disk

            | A | A+presto | B
          --+----+----------+----
cpu A | ok | ok | ok
for --+----+----------+----
compiler B | ok | corrupt | ok

The "A" and "A+presto" disks are the same location, first with prestoserve
disabled and then with it enabled. I've just pinpointed this recently,
but I can't be sure how long this behavior has been present. My impression
is that it just started recently.

Thanks for you trouble,
Jim

---------------------------------------------------------------------------

Date: Mon, 24 Apr 95 18:03:46 -0600
From: alan_at_nabeth.cxo.dec.com


        While the ufs_bmap patch is certainly for V2.0 and
        could cause data corruption, the text of the patch
        doesn't make it appear to have anything to do with
        Prestoserve.

        The only Prestoserve patches appear to be that replaces
        the presto(8) command to include some feature for the
        Advanced File System and an replacement pr.o to solve
        an panic; see below. So, I see two possibilities left:

        1. An undiscovered bug in V2.0 related to Prestoserve.

        2. A problem with Prestoserve NVRAM.

        You won't get #1 fixed because V2.0 is two versions out
        of data and long unsupported. A CSC will recommend;
        upgrade. #2 is a hardware problem.

/sys/BINARY/pr.o
CHECKSUM: 30850 186
/sys/BINARY.rt/pr.o
CHECKSUM: 23495 188
----------------------

Problem 1: (QAR 21070)
*********

Patch ID: OSFV20-015-2 (included in V2.1)

This is to fix a panic which appears with the following panic string:

        "vrele: bad ref count"

The signature of this panic is that the stack trace goes through
the nfs server's write gathering code as follows:

(dbx) t
> 0 boot(reason = 0, howto = 0) ["../../../../src/kernel/arch/alpha/machdep.c"
   1 panic(s = 0xfffffc00004d3d00 = "vrele: bad ref count") ["../../../../src/k
   2 vrele(vp = 0xfffffc0000283300) ["../../../../src/kernel/vfs/vfs_subr.c":10
   3 rfs_writeg(vp = 0xffffffff8917ef80, wa = 0xffffffff89388500, ns = 0xffffff
   4 rfs_write(wa = 0xffffffff8917ef80, ns = 0xffffffff8917f200, nreq = 0xfffff
   5 rfs_dispatch(req = 0xffffffff991cbaa8, xprt = 0xffffffff89388500) ["../../
   6 svc_getreq(xprt = 0xffffffff8938c800) ["../../../../src/kernel/rpc/svc.c":
   7 svc_run(xprt = 0xfffffc0000290c10) ["../../../../src/kernel/rpc/svc.c":502
   8 nfs_svc(p = 0xffffffff9885d7e8, args = (nil), retval = 0xffffffff991cbe10)
   9 nfssvc(p = 0xffffffff9885d7e8, args = 0xffffffff991cbe20, retval = 0xfffff
  10 syscall(ep = 0xffffffff991cbef8, code = 158) ["../../../../src/kernel/arch
  11 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":860, 0xfffffc000

Problem 2: (QAR 22865)
*********

Patch Id: OSFV20-071-2 (included in V2.1)

Logical Storage Manager(LSM) V1.0 runs on DEC OSF/1 V2.0. In LSM V1.0, when
LSM volumes have been enabled for presto and the system is brought down
abnormally (power failure, system panic etc.), the system will panic while
trying to flush dirty NVRAM buffers on a subsequent system reboot.

This correction requires a kernel rebuild.

---------------------------------------------------------------------------

Date: Mon, 24 Apr 95 17:01:31
From: "keith chiles" <kchiles_at_hccsf.com>
     
Jim,

Thanks for your response. I am not sure about prestoserv now, but when I was
running it on my DecSystem 5900, I was cache hits in the 80% range before I ever
put an nfs mount on it. It was my understanding that it was a write cache that
improved all disk performance by allowing a rapid response to all file updates
that allowed the kernel to send a steady stream to the disk. On small writes
like inodes, it really improved performance and reduced write wait states. Your
chart does, indeed, suggest that there is a problem between nfs and prestoserv
as you suspected. I stand corrected.

Cheers, Keith

From: Ian Stewart <Ian.Stewart_at_ranplc.co.uk>
Date: Tue, 25 Apr 1995 17:53:46 +0100


We had this same problem some time back, there is a patch
available from DEC. The problem also seems to go away if you
upgrade to OSF/1 3.0 or later, which is what we did in the
end.

Ian Stewart

---------------------------------------------------------------------------

From: luchini_at_siberia.ups-tlse.fr (Marco Luchini)
To: Jim Wright <jwright_at_phy.ucsf.edu>

Hi Jim,

> I have a PrestoServe NFS accellerator installed in a DEC 3000/600
> running OSF/1 2.0. I believe it is corrupting files during compilation
> of C code. The symptom is that an executable generated by an NFS client

Yes indeed it does. We had exactly the same problem. Turning off
Presto stopped the errors. Eventually we installed a number of patches
to 2.0 which solved the problem. I believe the most relevant one is:

OSFV20-028-1 which states in its README:

Data corruption was being caused by fragments of files being incorrectly
written to disk.

But we also installed OSFV20-015-1 and a few NFS related patches as well
- there were quite a lot in 2.0 and I was glad to upgrade from it. If I
were you I'd upgrade to OSF3.2 and all problems should go away.

> So, does anyone else use the turbochannel PrestoServe board? Successfully?

Actually, in the end, I don't think it's the presto's fault. The errors
happen on non-presto platforms as well according to DEC, just much more
rarely. So we would only detect them with presto turned on.

Check out the COMET search gateway with the key "osf" for a full list of
patches:

http://www.service.digital.com:8031/


Ciao, Marco

-------------------------------------------------------------------------
Marco Luchini Internet : m.luchini_at_ic.ac.uk
Laboratoire de Physique Quantique Telephone: +33 61.55.60.39
Universite' Paul Sabatier Fax: +33 61.55.60.65
31062 Toulouse, FRANCE
-------------------------------------------------------------------------
Received on Mon May 01 1995 - 17:44:24 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT