Trucluster 5.1 LSM mirrors do not survive loss of array from M Luchini on 2001-03-15 (tru64-unix-managers)

From: M Luchini <luchini_at_talk21.com>
Date: Thu, 15 Mar 2001 10:00:30 +0000

Hi,

I have a cluster of ES40s under 5.1 (PK2). They are connected using dual KGPSAs to Fibre switches. The switches are connected to two physical arrays (dual HSG80 controllers) full of disks. I am mirroring using LSM all the non-cluster root disks between one array and the other.

This configuration should survive the failure of the array where the cluster-root are not located (LSM on root is not supported).

However, when I turn the power off that array, all the mirrored volumes become disabled and the AdvFS domains panic. For the mirrored volumes, LSM shows the volume as disabled, the plex that should be dead as disabled and the plex that should be live as active (correct). I can recover, umounting all the domains, restarting the volumes and remounting. So that proves I've got my plexes in the right place.

If I check the hwmgr situation, I note that some of the disks that should be dead are still shown and the adapter that should be dead still has some disks. However on reboot, the correct hwmgr output appears (ie only the live disks are shown).

The message log makes this clearer (the first message is from me taking the other member down to have a clean situation ie one member owning everything and nothing else):

Mar 14 17:07:57 nors07 vmunix: clua: reconfiguring for member 2 down

Mar 14 17:07:57 nors07 vmunix: CLSM Rebuild: initiated

Mar 14 17:07:57 nors07 vmunix: CLSM Rebuild: completed

Mar 14 17:07:57 nors07 vmunix: CLSM Rebuild: done.

Mar 14 17:10:32 nors07 vmunix: lsm:volio: error on Plex dsk1-02p while writing of volume stripe1 offset 42643616 length 16

Mar 14 17:10:32 nors07 vmunix: lsm:volio: Plex dsk1-02p detached from volume stripe1<4>lsm:volio: dsk1-02 Subdisk failed in plex dsk1-02p in vol stripe1

Mar 14 17:10:48 nors07 vmunix: lsm:volio: Kernel log update failed: Volume stripe1 detached

Mar 14 17:11:18 nors07 vmunix: AdvFS I/O error:

Mar 14 17:11:18 nors07 vmunix: Volume: /dev/vol/rootdg/stripe1

Mar 14 17:11:18 nors07 vmunix: Tag: 0xfffffff7.0000

Mar 14 17:11:18 nors07 vmunix: Page: 124

So I guess the question is: why does the kernel log update fail? It has another plex so it should survive just fine.

I would like to know whether anybody has actually done the above - I'm sure it will work fine if I just have a disk failure but the point of sticking fibre between two buildings is to guard against the possibility of the whole building going up in smoke.

My feeling is that this is a problem with the device recognition part of the kernel rather than LSM. hwmgr is still seeing the disks so it is telling LSM that it's OK. When LSM tries it then gets really upset. It shouldn't be failing the volume as a whole though.

Thanks,

Marco
Received on Thu Mar 15 2001 - 10:02:04 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT