SUMMARY: 3-port KZPSC (SWXCR) RAID controller failing from Dejan Muhamedagic on 1998-10-17 (tru64-unix-managers)

From: Dejan Muhamedagic <dejan_at_yunix.co.yu>
Date: Fri, 16 Oct 1998 13:03:04 +0200

Hello,

Thanks goes to
   "Dr. Tom Blinn, 603-884-0646" <tpb_at_doctor.zk3.dec.com>
   Yvon Lauriault <yvon.lauriault_at_nlc-bnc.ca>
   Bruce Taube <Bruce.Taube_at_InfoAve.Net>
   "Tim Tahaney" <Tim.Tahaney_at_nnr.dowjones.com>
for advises.

Bruce said:
   We had similar problems with exactly the same configuration. After
several
   attempts to resolve them from DEC support, we ended up replacing the
   3 channel controller with single channel controllers. Haven't had a
   problem since.

I hope that this won't be necessary as this is not an option for
me.

Tim said:
   I have a single port kzpac on 4 systems with 4.0d + patch 1 and
   haven't had any problems, also defrag them each night.

   However I did have some crashes, panics which went away after
   turning off fault management on the kzpac thru the console
   maint program ...

However, this is in contradiction with the SWXCR firmware
documentation and it disables one of the most important functions.

As Yvon suggested, I upgraded the KZPSC firmware to v2.49,
though the v5.2 firmware documentation for AS2100 mentions only
v2.36 as the required firmware revision. Yvon also noted that DU4.0D
requires v2.49, but the SPD doesn't mention anything about
the required KZPSC firmware revision (perhaps it should).

Dr. Tom Blinn thoughtfully said that I shouldn't try hard to break the
AdvFS/KZPSC combination with the defragment utility, and I changed
the crontab entry so that no more than one filesystem
is defragmented per run.

Will this make the problem go away is still not certain.

There was also a minor/major incident with the /usr advfs
filesystem. The good news was that the corrupted entry was not
important and the bad news: salvage was not able to deal with
it. One directory (/usr/dt/appconfig/icons/C) contained an
invalid entry and there was no way to remove it. It was probably
made by defragment because no sane application should try to
write something in that directory. The salvage command failed
with the following message:

salvage: Internal data inconsistency in "get_pathname"

Please call Digital technical support (or fire brigades)

which has been repeated a few thousand times and then salvage
exited. The corrupted entry looked something like
"Fpprnt.t))) col curs )". Perhaps somebody from DEC can
say something about this. The only thing I was able to do
has been vdump/rmfset/mkfset/vrestore.

Sorry for this long summary, but this seems like one of those
problems when one can't trace to neither its head nor tail.

Cheers!

-- 
Dejan Muhamedagic   dejan_at_yunix.co.yu
the original post:
> 
> Hello,
> 
> I'm having troubles with a PCI SWXCR (3-port KZPSC) controller.
> The server is the 2CPU AS2100 5/300 running du4.0d (patch set 1
> installed).  The firmware is v5.1 and on SWXCR it is v2.36.
> There are 3 groups each with 3 RZ28M-VW in a RAID 5 configuration
> and one group of 2 RZ28M-VW mirrored.
> 
> Everything started 15 days after I upgraded from du4.0b to 4.0d
> and applied patch set 1.  Sometimes, the SWXCR controller stops
> responding but it doesn't happen too often and (so far) doesn't
> have catastrophic consequences--filesystems on RAID become
> unavailable and the only remedy is to reboot.  Since I moved
> the system disk to the SWXCR if the filesystem rendered
> inaccessible is this one than the machine panics and reboots
> (not surprising).  The binary.errlog file contains
> records on this and there will be a typical excerpt attached.
> 
> Most of this happens around 4am which looked to me pretty
> odd, but this is what I've found in root's crontab:
> ----------------------------------------
> 1 4 * * * test -x /usr/sbin/defragcron && /usr/sbin/defragcron -p >>/usr/adm/defragcron.log 2>&1
> ----------------------------------------
> This says to defragment all mounted AdvFS in parallel, so, there
> has been indeed a lot of activity early in the morning.  I changed this
> so that no more than two filesystems are defragmented.  However,
> that didn't make the problem go away.
> 
> Recently I moved boot from rz0 to RAID and the
> same thing happened during (from single user mode):
> # vdump -0 -f - /usr | vrestore -x -f - -D /mnt/usr
> (dump from internal SCSI to a RAID group).
> 
> It looks like the KZPSC can not stand a lot of activity from a
> couple of 5/300 alpha CPUs.
> 
> Has anybody seen/resolved this?  Anybody out there having a
> stable (and pretty fast and I/O demanding) alpha with this kind of
> RAID controller?  I read a couple of good summaries from the
> archive, but it seems that nobody came to firm conclusions about it.
> 
> Sorry for such a long message.  However, there will be yet another
> posting which may have something to do with this afair.
> 
> Thanks for your time.
> 
> Sincerely,
> 
> Dejan Muhamedagic   dejan_at_yunix.co.yu

Received on Fri Oct 16 1998 - 11:04:16 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:38 NZDT