Crash dump in genvmunix after system update...

From: Guy Dallaire <dallaire_at_total.net>
Date: Sun, 8 Jun 1997 22:49:18 -0400

Hello,

This weekend, we updated both of our alpha 2100 servers. Here is what we
did:

a) Install a KZPAA in each server

b) Install a BA356 (splitted in 2, with 8 bit IO module) in the rack and
put a TZ87 and a TZ88 in the BA (Those drives used to be behind HSZ40's but
we needed some disk space and decided to connect the tape directly to the
server, hence the new BA356 and the KZPAA's.

c) Upgrade the HSOF in our HSZ40 from V3.0 patch level 1 to V3.1 patch
level zero.

d) Upgrade DISK firmware of all our RZ29B's and RZ28M's (via the KZPAA and
with the diskupd program from ARC menu)

e) Upgrade system firmware with CD 3.8 (Was 3.4)

f) Ran the ECU 2.0 (was 1.9)

g) Installed 1 supplementary internal drive in each alpha server.

We are running Digital Unix 3.2D-1. We have DECsafe installed but we are
not yet running any service under DecSafe.

1st problem: Before the upgrade, We had that annoying message upon bootup
telling us that the system was "unable to disable nonexistent interrupt
-1". This message was related to the ATI Mach 64 ISA board in the server.
So we ran ECU and discovered that our system was configured with a
"Standard VGA card" in EISA slot 1, instead of an ATI Mach 64. We said to
ourself, let's fix it and changed the setting to the ATI card instead. When
we rebooted, the system stopped showing the annoying message but kept
crashing, so we reverted back to the standard VGA and that seemed to fix
it. This is a minor problem, other that the annoying message, the system
was running just fine.

2nd problem: We turned off the power on the whole rack (after shutting
everything down) and decided to restart the whole kit just to make sure
everything was fine with the upgrade. The 2 blower fault leds on one of the
BA356 behinf the HSZ40 lit up, the system kept booting anyway and there was
no error on the HSZ40 console, nor at the OS level... Strange. We could not
verify that the blowers were indeed faulty. We shut down the whole thing
again and replugged the SCSI cable because we taught it might have been
incorrectly plugged what we played with that BA during the disk firmware
upgrade. When we restarted, the lights were off (It was probably a cable
problem....)

3rd problem: We had to rebuild the kernel in order to add the new KZPAA's,
disks, etc... to the configuration and to add LSM support to the system. So
we rebooted with genvmunix. One of the systems booted fine, the other would
not boot at all, when it came to the point of mounting file systems from
the HSZ40 (After / and /usr) it crashed. I have a crash dump that I will
analyze monday. What is strange is that with the current kernel, everything
is fine.

Regarding that 3rd problem, does anyone have a clue what could cause this ?
Here are my list of potential culprits:

a) Could the GENVMUNIX that came with the CD have the ADVFS BUG that gets
fixed with the ADVFS consolidated patch ? (We did not regenerate our
genvmunix kernel after installing the ADVFS consolidated patch). If so, why
does it happen on one server only ?

b) uerf logged a disk error on the system disk a couple of weeks ago, could
there be a corrupted block in genvmunix making it crash ? (I doubt it,
because this file was not being accessed when we had the uerf error, unless
the system can detect disk problems in places where no files are actually
accessed)

I'm really puzzled by this one, We really don't know what do do when
genvmunix does not even boot. Isn't is supposed to be the last ressort
kernel that you use when everything doesn't work ?

One last question: As some of you may know, the system does not "see"
devices behind an HSZ40 when they do not have LUN 0. To alleviate this, you
have to add lines in the system configuration file and list the HSZ40
devices there. For example, if you have an rza9 and an rzc9 and you do and
iostat rz9, you won't see all the disks, unless you add them to the config.
I was wondering if it was dangerous to put disks that do not exist in the
config file. For example, suppose I have a /dev/*rza9* and /dev/*rzc9*
should I put only rza9 and rzc9 in sys/conf/SYSTEMNAME or can I put all the
rz*9 variety ?

                                                Thanks !


Guy E Dallaire, Unix SysAdmin, DBA
Le Directeur General des Elections du Quebec, QC, Canada
Phone: (418) 646-8618
Fax: (418) 644-9624

    ** "Honni soit qui disco pense"
Received on Mon Jun 09 1997 - 04:59:33 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT