[FOLLOWUP] AS1000A boot problem

From: Simon Greaves <Simon.Greaves_at_usp.ac.fj>
Date: Mon, 13 Mar 2000 14:22:32 +1200

Thanks for all the suggestions, however the problem seemed to be
caused by some massive corruption of the filesystem. We restored from
tape which seemed to fix things - at least as far as booting. Compaq
replaced the memory with some different SIMMS from stock until we got
the real replacements. I ran the memory execiser (/usr/field/memx)
and the machine crashed a few hours later. In kern.log there were
loads of messages like:

Mar 10 01:18:38 xxxx vmunix: Machine Check error corrected by processor
Mar 10 01:18:40 xxxx vmunix: Physical address of error
  ffffff000c72095f Corrected ECC Error in Memory during D-Cache fill
Mar 10 01:18:44 xxxx vmunix: Fill Syndrome = 00000000000000e9
Mar 10 01:18:44 xxxx vmunix: Single Bit error in Quadword 0 at bit<27> in a Data bit
Mar 10 01:18:49 xxxx vmunix: EI Address = ffffff000c72095f
Mar 10 01:18:52 xxxx vmunix: EI Status = fffffff0c5ffffff
Mar 10 01:18:53 xxxx vmunix: Interrupt Status Reg = 0000000100000000
Mar 10 01:18:54 xxxx vmunix: ECC Syndrome = 0000000000000000
Mar 10 01:18:54 xxxx vmunix: Memory Port 0 Status Reg = 0000000000000000
Mar 10 01:18:55 xxxx vmunix: Memory Port 1 Status Reg = 0000000000000000
Mar 10 01:18:56 xxxx vmunix: CIA Error Status = 0000000000000000
Mar 10 01:18:56 xxxx vmunix: CIA Error Reg = 0000000000000000

one every couple of minutes, and in the memx logs, lots of entries like:

Thu Mar 9 23:22:27 2000
memxr3: Data error in memory:
VIRTUAL BYTE = 4068c953 GOOD = ffffffc0 BAD = ffffffc8

Thu Mar 9 23:31:06 2000
memxr6: Data error in memory:
VIRTUAL BYTE = 40ca4953 GOOD = 15 BAD = 1d

Thu Mar 9 23:33:55 2000
memxr15: Data error in memory:
VIRTUAL BYTE = 405fc953 GOOD = 55 BAD = 5d

Though curiously the times of the messages do not correlate with each other.

Hmmm... looks like a memory problem still(?). By then, Friday (5 days later)
Compaq had got the real replacement SIMMS and fitted them. The machine
booted ok so I just left it running. It has crashed every day since,
at 01:06, 02:18 and 01:05. Only thing that runs around then is
'/usr/sbin/defragcron -p' and that seems to run ok if I force it to
run now. The machine isn't used for interactive access and has very
few accounts.

Using dia I found the following error:

Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 2.
Timestamp of occurrence 13-MAR-2000 01:03:19
Host name xxxx

System type register x0000001B AlphaServer 800 or 1000A
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 302. ASCII Panic Message Type

SWI Minor class 9. ASCII Message
SWI Minor sub class 1. Panic

ASCII Message panic (cpu 0): rbf_pin_record: end offset
                                     beyond page
                                      N1 = 23148

Which looks to me like a problem with paging(?) which to me still
sounds like a memory problem, I haven't tried memx again.

There's some crashdumps too, but I haven't tried to do anything with
them yet - not something I've delved into before.

So, it's looking like another call to Compaq - great! We've got an
AS2000 here that's entering it's third week of downtime because the
local Compaq guys don't know what's wrong with it and it's taking them
about a week a time to get boards from Australia (there are flights
here every day). That's a fun problem - console says the memory board
has failed but it will boot from a CD. SIMMS and memory board have
been replaced but still no dice - but Hey, it's a VMS box :-)

To round off for now, here is my initial query and the responses I
received:

8<---- original -----

We have an AS1000A which was shutdown cleanly before a mains power outage.
When it was booted, early on in the 'blue screen' initialization stuff
(ie before it had started to boot) it started scrolling error messages up
the screen pointing out a memory problem. Compaq came out and have
temporarily removed the SIMMS in the duff bank, leaving 128MB in slot0.

Now the machine starts to boot, configures the SWXCR, fsck's the root
partition, starts LSM, reports it has initialsed lsm0 and lsm1 then the
screen flashes and a message like:

DUMP:

followed by a bunch of numbers appears, followed by another couple of
lines that are difficult to catch before it starts to reboot.

Compaq are going to see if they have memory in stock, if not it could take
a while to get it shipped from Australia. In the meantime I'm trying to
determine if the inability to boot with 128MB is still hardware or if it's
software (ie needs restored) - Should I expect an AS1000A with a SWXCR and
LSM to boot with 128MB RAM? Any tests I can do? I don't really want to
start restoring if it isn't going to work anyway due to memory issues.

Any comments etc welcome,

8<---- responses -----

From: Knut Hellebų <Knut.Hellebo_at_nho.hydro.com>

Can you boot single user ? If this is possible try editing /etc/fstab
commenting out all file systems but root/usr/var/tmp if it is not clear
that the errors is not coming from some kind of filesystem corruption.
Then you could try mount one by one of the other filesystems until you
possibly encounter the booterror situation again. Good Luck ;-)

-> Not really, doesn't mount the file system properly etc

8<----

From: "Dr. Tom Blinn, 603-884-0646" <tpb_at_doctor.zk3.dec.com>

128MB should be plenty.

Do you have a "dumb terminal" (e.g., a PC with terminal emulation s/w or
a second UNIX system that is set up to do "tip" through a serial port)
that you can connect to the "COM1" serial interface on this system? If
so, switch the SRM console to "serial" (>>>set console serial), init the
firmware (so it will stick), then try to boot so you can capture all of
the messages that are coming out. Then perhaps someone will be able to
see what's really happening. Also, set the auto_action to halt so it
will not try to reboot on a panic. If it's just panic-ing over and over
again, you've got something else going wrong, and the last thing you want
is to have rogue hardware trash a disk, or worse.

-> Couldn't get a terminal at short notice, I think the last part may
-> well be what happened when the box came up after a big power fail
-> here.

8<----

From: alan_at_nabeth.cxo.dec.com

        You don't seem to say what version you're running, but
        typically the minimum memory configuration was 32 to
        64 MB. One of the V3 releases had a minimum memory
        configuration of 24 MB.

        If you have the CDROM distribution for whatever version
        you're running you can try to boot it and see if there
        is a problem with the root and other file systems. If
        the CDROM won't boot, then it is probably a hardware
        that only looked like a memory problem initially. If
        it will boot you can check and mount the root and see
        if anything is broken.

        If you haven't removed, you might also try to boot genvmunix.

-> OS is 4.0D

8<----

From: John Losey <JOHLOS_at_HBSI.COM>

We're running an AS1000A with 128MB of RAM with a 60GB SWXR array and some
LSM here under 4.0d. We haven't had any problems. I'd probably agree with
your service tech and swap out the RAM first, then probably the mother
board.

8<---- end of responses ---------


Thanks again to everyone for taking the time to reply, I'll post to
the list again once we get this problem resolved.

Simon
-- 
Simon Greaves				voice: (+679) 212114
Computer Centre				fax:   (+679) 304089
The University of the South Pacific	email: Simon.Greaves_at_usp.ac.fj
Suva, Fiji
Received on Mon Mar 13 2000 - 02:23:56 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT