PRE-SUMMARY: mysterious HSZ 40/50/70 problems

From: <Peter.Braack_at_degussa.de>
Date: Tue, 05 May 1998 10:47:22 +0200

Many thanks to
 Paul Thompson
 Richard Jackson
 Nick Batchelor

who shared their HS* troubles with me.
However, no solution so far. Until now we changed nothing and last weekend a
couple of HSZ70 crashed again.
Digital is still working on it - I'll summarize if they'll find anything.

Paul mentioned:
I have had problems with HSJ-40 controllers where if the console terminal (or
decserver if you're using reverse lat) hangs the CLI process after its output
buffer gets full.
Not exactly the same. Try disconnecting from the DECserver and running a
terminal on some of them and see if it helps.

Richard Jackson:
We have had similar hard luck stories with our dual-redundant HSZ40's.
There were two main causes. The first was the writeback cache
batteries that were suppose to have 5 years life actually had only 1.
Unfortunately, the HSZ40 prior to 3.1-2 did not handle the situation
well. Now the new writeback cache batteries are improved and will be
replaced pro-actively every two years. I noted that under bare HSOF
3.1, during the once a day battery test, would trigger a HSZ40 failover
that failed and the result was corrupted AdvFS filesystems. Keep in
mind if you have AdvFS filesystems, some of your problems may have
nothing to do with the HSZs. I recommend running all of the AdvFS
tools your have at hand to scan for problems (e.g., under DU 4.0x try
verify and under DU 3.2x try vchkdir and msfsck). It may also be
prudent to backup, rebuild the filesystems, then restore under DU 4.0x
if you have old 3.2x AdvFS filesystems that may have AdvFS metadata
corruptions.

Nick:
Not really a solution but we had similar problems with an HSZ50
attached to an 8400 at one point. The symptoms were that a whole
bunch of disks would suddenly report SCSI errors in a burst and then
one or more of them would be flagged as faulty by the HSZ and spun
down.
We also had a problem where the HSZ hung when deleting a unit from it
when replacing one of the so-called faulty disks. Like you we
eventually replaced the HSZ and have had no problems since. We also
received no explanation from DEC.

my original problem:

>This is just a rather brief description, but I try not to waste bandwidth if
nobody has any ideas:
>
>We have 14 HSZ40/50/70 connected to various alphas (1000, 2100, 4100) with or
without TruCluster.
>We use DU 3.2d and 4.0b with all patches. HSOF is V3.1-Z4, V5.1-Z4 and V7.0 .
>All HSZ40/50 are single, the HSZ70 are dual redundant.
>
>During the last 3 month we had five crashes due to inaccesible units
>(some raid, some single disks) on HSZs. The CLI was dead and hszterm
>didn't work too. Only one or more reset brought the HSZ back to live.
>I can provide some logs, but mostly it happens without a trace.

>Each HSZ was replaced by digital and the failure occured never again.
>Until now digital has no clues what was happening but it seems that they
>are suspicious about our console manager setup. Every server and HSZ
>consoleport is connected to a DECserver 700 (more than one and the failures
>occured on nodes connected to different DECservers) and is managed from
>a single host with the Polycenter Console Manager V1.7.


Thanks,
Peter

Peter Braack
Unix System Administrator
Degussa AG, Frankfurt, Germany
Peter.Braack_at_degussa.de
Received on Tue May 05 1998 - 10:50:07 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT