Thanks to Tom Blinn, Pat O'Brien and Jason Ohrendorf. Tom's
comments immediately pointed me to the correct explanation.
The machine I was adding had seen the unit before it was
deleted and added again to a HSG80 set (while that machine was
down). So the console wwid setting was right, but the logical
hose/slot/bus number, which translates to major/minor disk
numbers, which are recorded in the clu_base section as
cluster_seqdisk_major and cluster_seqdisk_minor, was another
one at boot time than was assigned from the adding server.
Since booting the machine to find out the current setting
was impossible (since the partition in question was the
boot_partition....), I decided to delete the unit from the
hardware databases and the kernel records with hwmgr and
dsfmgr and do a 'hwmgr -scan comp -cat scsi' again. That
procedure gave me a new dskxx disk, which then could be
propagated to the new member as a boot_partition. After
the cluster was up, I could move the disk with
'dsfmgr -m dskX dskY' to the desired number again, only
the etc/fdmns links, the sysconfigtab and
/cluster/admin/.memberx.cfg entries had to be adapted.
Here are the comments for further references:
------
Jason Ohrendorf wrote:
Check the storage settings. Are the devices used for
both the member boot disk and the cluster_{root,usr,var}
all in the same zone? Or do all have the same SSP
settings to all all eight nodes (and especially the one
you're adding) access to the same devices?
------
Pat O'Brien wrote:
I think you need to look at wwidmgr at the console level.
------
Dr.Tom Blinn wrote:
I don't know exactly what's failing, but I have some guesses.
The console firmware passes a relatively messy string into the
kernel that reflects the hardware topology. The string has a
number of fields, and if you can match the fields correctly to
the existing busses and devices in the system, then you can
find the boot device. A typical string for a simple system
might look like this:
'SCSI 0 3001 0 1 100 0 0'
which on this particular system (running V4.0F) happens to be
the console's device DKB100 which is rz9 in this system. Of
course, in V5.x, the device name doesn't map in any simple way
to the device address based on bus number, adapter on the bus,
and device on the adapter.
The possibilities for what exactly is wrong are endless. I'm
looking at this stuff because it got broken in the V4.0x patch
kits by a well-intentioned but totally bogus code change that
broke network booted kernels. I haven't had to look at the
code in a V5.x system recently. But it's convoluted and not
obvious.
If the console firmware's understanding about what information
applies to the HSG80 RAID set that's the boot device for the
new member doesn't exactly match what the member that set up
the boot device thought the device should be called, you'd see
something like this (because the data provided to the kernel
by the console when you booted couldn't be matched up to the
data that is being provided in the hardware persistency data
that's accessed from the existing members via the cluster
file system early in the boot process). That's what I'd be
looking at. Unfortunately, when the system panics, it does
NOT usually give you enough information to figure out what
was in the device string it was trying to parse, which is a
real nuisance, since you don't get a crash dump more ofthen
than not in these panics (too early to have figured out just
where the swap is, no place to write a dump, can't boot the
system from a different disk so you can't get the in-core
dump saved, etc.).
========================
Original question:
>
> After creating a new member 8 (Tru64 5.1A, PK1, member boot on
> HSG80 ACS 8.6F-PK5), booting with 'boot -file genvmunix dga148'
> fails with
> 'panic: init_rootdev: boot device translation failed'
> I tried deleting and adding again, even from a different member,
> but no success. The member has been a member of a 5.1 cluster
> without any flaws. 5.1A was setup as a new installation, the
> old member boots have been disabled by zeroing disklabels.
> I noticed that after creation of the new member (which showed
> no error message at all), the /cluster/members/member8/boot_partition
> directory was entirely empty. I then checked the root8_domain#root
> filesets etc/sysconfigtab, and the major and minor numbers correctly
> point to the cnx partition of the member boot disk. The cnx partition
> lines up exactly with the disks end, and no overlapping partitions
> exist on the disk (which is the 8th partition on a HSG80 mirrorset).
> I tried both with persistently reserved and cleaned persistent flag.
> All member boot disks have exactly the same size, and the a partitions
> are all 1G, swap about 8G, cnx exactly 2048 blocks = 1MB.
> clu_bdmgr -d dsk9 (the boot disk) reports the correct pointer to
> the cluster file disk.
> I have no idea what's going on here, as the addition of 6 members
> to the first member before succeeded without any error.
--
Dr. Udo Grabowski email: udo.grabowski_at_imk.fzk.de
Institut f. Meteorologie und Klimaforschung II, Forschungszentrum Karslruhe
Postfach 3640, D-76021 Karlsruhe, Germany Tel: (+49) 7247 82-6026
http://www.fzk.de/imk/imk2/ame/grabowski/ Fax: " -6141
Received on Tue Feb 26 2002 - 08:29:22 NZDT