HSZ70 controller hangs from Stuart Hartley on 1999-09-30 (tru64-unix-managers)

From: Stuart Hartley <stuart.hartley_at_jewson.co.uk>
Date: Thu, 30 Sep 1999 11:33:33 +0100

Hello everyone,

This is my first posting to this group - so be kind to me.

I am running two 4100's (DU 4.0D pk3) and TCR1.5 . I have 4 10000 cabs and
one 7000 cab, each run by a pair of HSZ70's in a dual-redundant
configuration (firmware V70), running off KZPBA's. Most of the disks in
these cabinets are part of a disk_service, can are all RAID 0+1 sets.

In attempt to cut down time "wasted" (not my phrase) doing backups we have
introduced a new way of backing up :
1) mid-afternoon we run a script which introduces an extra disk into each of
the mirrors on each controller, then monitors the job until the disk has
sync'd in successfully
2) at a clean cut-off point when all users are off, we run another script
which removes this "extra" disk from each of the mirrors, creates a new unit
from this disk, creates an AdvFS filesystem from this new unit, and mounts
it on the second cluster member
3) we then start our overnight processing on the first member, and run a
backup on the second member

All fine in theory, and it was working perfectly. However, for the past 3
weekends the first member has crashed whilst stage 1 was running. The
sequence of events is :
1) member A reports that it cannot unreserve some of the devices on the
HSZ's. The exact device number varies, and they are not always on the same
controller.
2) member A then decides that it cannot see these devices, and decides that
it must relocate the disk_service to member B in order to maintain
availability.
3) member A then tries to stop the disk_service, but cannot, as it cannot
unreserve the devices or unmount some of the filesystems.
4) member A then reboots itself so that member B can pick up the service (i
have no problem with this bit, it is what it is meant to do)
5) member B then tries to start the service but cannot do so because it
cannot reserve the appropriate devices..service vanishes for all intents and
purposes
6) meanwhile.....back at the ranch member A is attempting to reboot...when
it starts to problem the controllers all goes well until it reaches on of
the controllers which has a "dodgy" device on it. The boot sequence then
loops with "ss_perform_timeouts" until it reaches the max timeouts, then it
reinitialises the card "isp_reinit : Adapter/Card reinitialization" and
continues in this sequence until I get fed up and turn it off

If I try to connect to any of the HSZ's that have reported a problem then
EITHER
(a) I cannot get to the CLI as the controller is hung
OR
(b) the controller looks fine, but if I "restart this" or "restart other"
then it hangs

The only way to get the cluster back into a workable state is to halt/power
down both of the Alphas, reset all of the controllers, then bring both
machines up again.

I cannot find any errors in the system logs, other than the ones I would
expect from the above sequence of events (symptoms, not the cause).

The scripts work by using "hszterm -f <device>" and sending it a string....I
think that this is causing the problem, but I cannot see how. Our local
DEC....cough.....Compaq reps have reviewed the scripts and cannot see
anything obviously fatal in them.

The other weird thing about all of this is that it only falls over at the
weekend, and not during the week. We are doing nothing differently at these
times, no extra cron jobs etc etc.

Has anyone else had a similar problem? Any ideas?

TIA

Stuart Hartley
Jewson IT
stuart.hartley_at_jewson.co.uk
Received on Thu Sep 30 1999 - 10:34:14 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT