3-port KZPSC (SWXCR) RAID controller failing

From: Dejan Muhamedagic <dejan_at_yunix.co.yu>
Date: Tue, 13 Oct 1998 14:44:10 +0200

Hello,

I'm having troubles with a PCI SWXCR (3-port KZPSC) controller.
The server is the 2CPU AS2100 5/300 running du4.0d (patch set 1
installed). The firmware is v5.1 and on SWXCR it is v2.36.
There are 3 groups each with 3 RZ28M-VW in a RAID 5 configuration
and one group of 2 RZ28M-VW mirrored.

Everything started 15 days after I upgraded from du4.0b to 4.0d
and applied patch set 1. Sometimes, the SWXCR controller stops
responding but it doesn't happen too often and (so far) doesn't
have catastrophic consequences--filesystems on RAID become
unavailable and the only remedy is to reboot. Since I moved
the system disk to the SWXCR if the filesystem rendered
inaccessible is this one than the machine panics and reboots
(not surprising). The binary.errlog file contains
records on this and there will be a typical excerpt attached.

Most of this happens around 4am which looked to me pretty
odd, but this is what I've found in root's crontab:
----------------------------------------
1 4 * * * test -x /usr/sbin/defragcron && /usr/sbin/defragcron -p >>/usr/adm/defragcron.log 2>&1
----------------------------------------
This says to defragment all mounted AdvFS in parallel, so, there
has been indeed a lot of activity early in the morning. I changed this
so that no more than two filesystems are defragmented. However,
that didn't make the problem go away.

Recently I moved boot from rz0 to RAID and the
same thing happened during (from single user mode):
# vdump -0 -f - /usr | vrestore -x -f - -D /mnt/usr
(dump from internal SCSI to a RAID group).

It looks like the KZPSC can not stand a lot of activity from a
couple of 5/300 alpha CPUs.

Has anybody seen/resolved this? Anybody out there having a
stable (and pretty fast and I/O demanding) alpha with this kind of
RAID controller? I read a couple of good summaries from the
archive, but it seems that nobody came to firm conclusions about it.

Sorry for such a long message. However, there will be yet another
posting which may have something to do with this afair.

Thanks for your time.

Sincerely,

Dejan Muhamedagic dejan_at_yunix.co.yu


******************************** ENTRY 12 ********************************


Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 56.
Timestamp of occurrence 27-SEP-1998 04:04:20
Host name panda

System type register x00000009 AlphaServer 2x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 198. SWXCR RAID Controller Event


------ Device Data ------
Class x00 RAID Disk
Subsystem x20 SWXCR Mport/RAID Controller
Number of Packets 5.

------ Packet Type ------ 258. Module Name String
Routine Name re_flush
------ Packet Type ------ 256. Generic String
                                     Cmd rejected by port
------ Packet Type ------ 259. Software Error String
Error Type Possible Software Problem - Impossible
                                     Cond Detected
------ Packet Type ------ 256. Generic String
                                     Active XCR_COM at time of error
------ Packet Type ------ 0. SWXCR Communication Block (XCR_COM)
   Packet Revision 1.

Controller Number x00000000
Unit Number on Controller x00000000
Function Status Codes x00000003 Command has Timed Out.
Adapters Status Code x0000 Normal Completion. Configuration
                                     transferred.
SWXCR Flags x00000000
Received by Callback x00000000
Data Xfer Length 0.
Number of Scatter Entries 0.
Command Data Length 0.
Block Number x00000000
Xfer Residual Length 0.
Timeout Value in Seconds 120.
XCR Command x0000000A Clear Cache of Dirty Blocks (Type 1).


******************************** ENTRY 13 ********************************


Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 55.
Timestamp of occurrence 27-SEP-1998 04:04:19
Host name panda

System type register x00000009 AlphaServer 2x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 206. Advanced File System (AdvFS) Domain Panic

SWI Minor class 9. ASCII Message
SWI Minor sub class 4. Informational

ASCII Message
    AdvFS Domain Panic; Domain usre_domain Id 0x35fcf50d.00095960
    An AdvFS domain panic has occurred due to either a metadata write error or
    an internal inconsistency. This domain is being rendered inaccessible.
    Please refer to guidelines in AdvFS Guide to File System Administration
    regarding what steps to take to recover this domain.
      


******************************** ENTRY 14 ********************************


Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 54.
Timestamp of occurrence 27-SEP-1998 04:04:19
Host name panda

System type register x00000009 AlphaServer 2x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 3. High Priority
Entry type 198. SWXCR RAID Controller Event


------ Device Data ------
Class x00 RAID Disk
Subsystem x20 SWXCR Mport/RAID Controller
Number of Packets 7.

------ Packet Type ------ 258. Module Name String
Routine Name re_complete
------ Packet Type ------ 256. Generic String
                                     I/O failed
------ Packet Type ------ 260. Hardware Error String
Error Type Hard Error Detected
------ Packet Type ------ 256. Generic String
                                     Active XCR_COM at time of error
------ Packet Type ------ 0. SWXCR Communication Block (XCR_COM)
   Packet Revision 1.

Controller Number x00000000
Unit Number on Controller x00000003
Function Status Codes x00000003 Command has Timed Out.
Adapters Status Code x0000 Normal Completion.
SWXCR Flags x00000010 BP Points to Buffer.
Received by Callback x00000001
Data Xfer Length 8192.
Number of Scatter Entries 0.
Command Data Length 0.
Block Number x00215D70
Xfer Residual Length 0.
Timeout Value in Seconds 60.
XCR Command x00000003 Write (Type 1).

------ Packet Type ------ 256. Generic String
                                     Active Controller Working Set at time of
                                     error
------ Packet Type ------ 1. Controller/HBA Working Set(CNTRL_WS)
   Packet Revision 1.

General Flags x00000000
Command Retry Count 0.
160. Bytes Scatter/Gather ** Not Printed **
Mask Register x00000FFF

     - Registers 0->F -
Opcode x03 Write (Type 1).
Command ID x00
Count of Blocks 16.
Start Block Number x00215D70
Logical Drive x03
Pointer x00000000
Scatter-Gather Type x00 Unused.
Command ID x00 Unused.
Adapters Status Code x0000 Normal Completion.
Received on Tue Oct 13 1998 - 12:45:21 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:38 NZDT