SUMMARY (Partial): KZPSC Problems from Tim W. Janes on 1996-09-16 (tru64-unix-managers)

From: Tim W. Janes <janes_at_signal.dra.hmg.gb>
Date: Mon, 16 Sep 1996 11:25:59 +0100 (BST)

Very Many Thanks to the following for replying.

Edward C. Bailey <bailey_at_niehs.nih.gov>
Dave Golden <golden_at_falcon.invincible.com>
John Hascall <john_at_iastate.edu>

I quote John Hascall's response as I feel it should have a wide
circulation. It was not our problem as we are using two Wide SCSI
external boxes - the internal rack is empty.

   I don't know if this is your problem or not, but we
   had ALL KINDS of terrible flakeyness with our AS1000's
   and AS1000a's with the KZPSA when it had some channels
   connected to the internal storageworks bus in the
   AS1000(a) and some channels connected to external
   storageworks cabinets. We finally found a DEC
   storageworks expert who said that was a `widely known'
   problem -- won't work (problem is it wasn't widely
   known to anyone else apparently). Going to all
   external boxes solved our problems.

Apologies for an major inaccuracy in my post. Whenever I type RAID 0 I
really meant RAID 1 - it was too late when I posted.

I recovered the situation using the standalone utility. We now have
the following setup.

We have a AlphaServer 1000 acting as a heavy NFS server ( over FDDI)
with V3.2C and KZPSA 3 channel Raid Controller. (all REV levels are
OK)
configured with the following RAID groups
1) RAID 5 - 7 x RZ29VW (channel 0) Heavy Load
2) RAID 5 - 3 x RZ29VW (Channel 1) Heavy Load
3) RAID 1 - 2 x RZ28VW (Channel 1) (system disk) Very light Load

If I fail one of the RAID 1 set I am unable to rebuild it with our
system running. It either does not start the rebuild or will fail
within a minute. In either case an error is logged as below and the
disk is left in WOL state.

I can rebuild the disks OK with the standalone utility or if I stop
all NFS activity even if I hammer the disks locally - but not if there
is any NFS activity.

I can also reliable crash the system by
1) Mark a disk as failed.
2) Remove the disk from the rack.
3) Fire up the Online Utility - or if its running hit refresh.

DEC don't seem to have any useful suggestions except that the load
must be too high to cope or possibly a hardware problem with the
controller.

So the problem continues.

Tim.

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 198. ASTRO CONTROLLER
SEQUENCE NUMBER 13.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Thu Sep 12 22:52:13 1996
OCCURRED ON SYSTEM joyce
SYSTEM ID x00060011
SYSTYPE x00000000

----- UNIT INFORMATION -----

CLASS x0000 DISK
SUBSYSTEM x0000 DISK
BUS # x0000

----- CAM STRING -----

ROUTINE NAME xcr_cmd_timeout

----- CAM STRING -----

                                        Controller has stopped responding

----- CAM STRING -----

ERROR TYPE Hard Error Detected

----- CAM STRING -----

                                        Controller Softc at time of error

----- ENT_XCR_SOFTC -----

*SC_BUS_NAME xFFFFFC0000587580
SC_CNTRL_NUM x0000000000000000
SC_CNTRL_TYPE x0057CE1000000000
*SC_CTRL xFFFFFC000057CE10
SC_IOHANDLE x0000000400010100
SC_FLAGS x00000002
SC_REG_OFF x00000000
SC_MAX_ACT x0000003C
SC_SPEC_ACT x00000004
SC_CMDS_ACT x00000001
*SC_ACT_FLINK xFFFFFC0003CD87D0
*SC_ACT_BLINK xFFFFFC0003CD87D0
SC_CMDS_PENDING x00000015
*SC_PEND_FLINK xFFFFFC0003CD6050
*SC_PEND_BLINK xFFFFFC0003D702D0
*SC_FREE_FLINK xFFFFFC0003CD82A8
*SC_FREE_BLINK xFFFFFC0003CD84D8
SC_FREE_CMD_SLOTS x0000003F

Original Post

>
> Hello Alpha Managers,
>
> What have I done wrong??
>
> We have a AlphaServer 1000 with V3.2C and KZPSA 3 channel Raid Controller.
> configured with the following RAID groups
> 1) RAID 5 - 7 x RZ29VW
> 2) RAID 5 - 3 x RZ29VW
> 3) RAID 0 - 2 x RZ28VA
>
> The RAID 0 set is the system disk. All filesystems are UFS and Write Through.
>
> We are about to have a disk re-organisation and I have 2 RZ28VW that I
> thought I would use to replace the RZ28VA disks. I tried to carry out
> a replacement with the system up as follows ( I should add that there
> was heavy I/O on the 2 RAID 5 systems and the systems other 2 SCSI
> busses throughout this but little activity on the RAID 0. )
>
> Added a new RZ28VW to the storageworks shelf
>
> Used the online GUI utility to define it as a hot spare
>
> Marked one of the RZ28VA as FAILED
>
> Removed the "Failed" drive from the shelf expecting it to be rebuilt
> onto the RZ28VW
>
> However got a mail that disk had failed and was NOT being rebuilt on Hot spare -
> no reason given ( Fault management is enabled )
>
> Decided not to risk things further - would put things back as they were
> until I could shut the machine down.
>
> Tried to unmark the RZ28VW as the host spare - entire system froze.
> At the console ( A VT220) saw a stream of XCR I/O error messages and
> NO disk activity.
>
> Tried reset button - caused system to panic - but unable to write dump
> to anywhere. On booting hung Waiting for dra0.0.0.13.0 to poll.
> Power cycled system - ditto
> Power cycled system and disks - booted OK.
>
> Stupidly again tried to unmark Hot Spare - went round the above loop again!!
>
> Oh well leave this hot spare alone just put the other RZ28VA back in
> and let it rebuild. So put disk in but no rebuild started. ( possibly
> did not wait long enough?) fired up online utility and manually
> started rebuild. Disk activity showed rebuild was now going.
>
> 5 mins later system froze in same way as before. However both system
> disk now had failed light on and reboot hung after
> CPU 0 booting
> Cycling power did not help until I cycled the disks power while the
> server was hung in this state Urggh :-(
>
> Now able to get to >>> prompt
>
> Fired up standalone SXCXMGR from VT220 - but found that although
> cursor keys worked in the main menu they did not on the small
> confirmation YES NO menu so could not use utility.
>
> Tried SRLMGR but was the same.
>
> Went and found a keyboard and monitor and used system console to run SWCXMGR
> and marked the good system disk as OPTIMAL and was able to reboot the system.
>
> So What went wrong How can I avoid it again? How do I safely rebuild
> my still "Failed" disk?
>
> Is it really only safe to do rebuild on an idle system or one that is
> down with the standalone utility?
>
> Any Help Welcome. I will open a call with DEC tomorrow.
>
> Thanks
>
> Tim.
>
> Tim Janes | e-mail : janes_at_signal.dra.hmg.gb
> Defence Research Agency | tel : +44 1684 894100
> Malvern Worcs | fax : +44 1684 895103
> Gt Britain | #include <std/disclaim.h>
>
Received on Mon Sep 16 1996 - 13:03:46 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT