SUMMARY(partial): V4 cluster crash when idle node rebooted.

From: Blom, Wayne <Wayne.Blom_at_au.faulding.com>
Date: Wed, 29 Nov 2000 14:17:10 +1030

Thanks to Hoai TRAN, Dr Tom Blinn, Stephen L Tyce, John Seel, Mohd Nayle,
Jim Lola;

Also Compaq support. (Original question at the end)

So far the general consensus is that it shouldn't be happening but that
under certain circumstances it does.

Most suggest shutdown from the console to single user either by shutdown now
or shutdown +1. Then halt the system from single user. This apparently stops
ase from panicking and resetting or rebooting.

I have since tested on another cluster the same set of circumstances and got
the same result. Testing other variations has resulted in the following.

1. User logged on and reading a file on the ase service. User logged in on
the cluster ip address.
2. Idle machine rebooted using a variety of commands;
    a: shutdown -h now
    b: shutdown -r now
    c: shutdown now
    d: shutdown +1, halt
    e: init 0
    f: halt
 
On each of the above scenarios the active users were kicked off by an ase
reset. This is not what we have been led to understand. Getting kicked off
by a failover, yes, not if the idle machine goes down though. Only option a
& b caused the reboot, the others just caused the reset.
 
PS. We didn't try init s, halt.

Compaq assure as this is not the way it should be. We have given them as
much information as possible and await their response.

It is difficult to test this situation as the machines are active 18 hours
out of every day. As soon as I know more or can test the resolution I will
deliver the final summary.

Wayne
============================================================================
=========
ORIGINAL QUESTION

> Gidday fellow 64 errs,
>
> We run a pair of DS10s in a 4.0f cluster running 1.6 ASE.
> Today we needed to shutdown one of the nodes (the idle node) to replace
the
> tape unit.
>
> The "shutdown -h now" command was issued on the idle node. The active node
> immediately rebooted. Below is the daemon.log from the active node. Has
> anyone seen this before? Any help would be greatly appreciated.
>
> daemon.log:
> Nov 10 11:47:56 u-whmade2 ASE: u-whmade1-alt Director Warning: Director
> exiting...
> Nov 10 11:47:57 u-whmade2 ASE: u-whmade1-alt Agent Warning: aseagent
exiting
> on request...
> Nov 10 11:48:02 u-whmade2 ASE: local HSM Warning: Can't ping u-whmade1-alt
> over the SCSI bus
> Nov 10 11:48:02 u-whmade2 ASE: local HSM ***ALERT: network ping to host
> u-whmade1-alt is working but SCSI ping is not
> Nov 10 11:48:15 u-whmade2 ASE: local HSM Warning: Network interface tu1
> 192.168.4.41 DOWN
> Nov 10 11:48:15 u-whmade2 ASE: local HSM ***ALERT:
> HSM_NI_STATUS:192.168.4.41:DOWN
> Nov 10 11:48:16 u-whmade2 ASE: local Simulator Notice: snd: exiting...
> Nov 10 11:48:17 u-whmade2 ASE: local HSM Warning: Can't ping u-whmade1-alt
> over the network
> Nov 10 11:48:17 u-whmade2 ASE: local HSM ***ALERT:
> HSM_PATH_STATUS:192.168.4.42:DOWN
> Nov 10 11:48:17 u-whmade2 ASE: local HSM Warning: member u-whmade1-alt is
> DOWN
> Nov 10 11:48:24 u-whmade2 ASE: local HSM Notice: Network interface tu1
> 192.168.4.41 UP
> Nov 10 11:48:25 u-whmade2 ASE: local HSM ***ALERT:
> HSM_NI_STATUS:192.168.4.41:UP
> Nov 10 11:48:25 u-whmade2 ASE: local Simulator Notice: snd: exiting...
> Nov 10 11:48:32 u-whmade2 ASE: u-whmade2-alt Agent Notice: agent on
> u-whmade1-alt.faulding.com.au should start director, but isn't i
> n RUN state
> Nov 10 11:48:32 u-whmade2 ASE: u-whmade2-alt Agent Notice: starting a new
> director
> Nov 10 11:48:54 u-whmade2 ASE: local Director Error: timeout connecting to
> Agent on u-whmade2-alt.faulding.com.au
> Nov 10 11:48:54 u-whmade2 ASE: local Director Error: connect to Agent on
> u-whmade2-alt.faulding.com.au failed
> Nov 10 11:48:54 u-whmade2 ASE: local Director Error: can't open channel to
> my agent
> Nov 10 11:48:54 u-whmade2 ASE: local Director Error: can't enable director
> Nov 10 11:48:54 u-whmade2 ASE: local Director Error: Initialization failed
> Nov 10 11:48:54 u-whmade2 ASE: local Director Warning: Director exiting...
> Nov 10 11:49:12 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/ase_mount_action: /whm/sb: Device busy
> Nov 10 11:49:12 u-whmade2 last message repeated 9 times
> Nov 10 11:49:12 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/ase_mount_action: Unable to umount /whm/sb
> Nov 10 11:49:43 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/ase_mount_action: /whm: Device busy
> Nov 10 11:49:43 u-whmade2 last message repeated 9 times
> Nov 10 11:49:43 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/ase_mount_action: Unable to umount /whm
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg: Disk group whm: Some volumes in the
disk
> group are in use
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg deport of disk group whm failed
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: Disk group whm busy, attemping another deport
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg: Disk group whm: Some volumes in the
disk
> group are in use
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg deport of disk group whm failed
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: Disk group whm busy, attemping another deport
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg: Disk group whm: Some volumes in the
disk
> group are in use
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg deport of disk group whm failed
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: Disk group whm busy, attemping another deport
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg: Disk group whm: Some volumes in the
disk
> group are in use
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg deport of disk group whm failed
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: Disk group whm busy, attemping another deport
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg: Disk group whm: Some volumes in the
disk
> group are in use
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg deport of disk group whm failed
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: Disk group whm busy, attemping another deport
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg: Disk group whm: Some volumes in the
disk
> group are in use
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error:
> /var/ase/sbin/lsm_dg_action: voldg deport of disk group whm failed
> Nov 10 11:49:49 u-whmade2 ASE: u-whmade2-alt Agent Error: cut off from all
> monitored networks and can't stop services. reboot!
>

Thanks
Wayne Blom
System Specialist
Technical Development Healthcare
F H Faulding & Co Limited
Ground Floor
1 Station St
Hindmarsh SA 5007
Ph: +61 8 8241 8334
FAX: +61 8 8241 8357
Mobile: +61 0419 808 496
Email: wayne.blom_at_au.faulding.com
Received on Wed Nov 29 2000 - 03:49:29 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT