SUMMARY: DECevent (dia) help - long

From: Dawn Lovell <dawn.lovell_at_centurytel.com>
Date: Mon, 24 May 1999 17:07:09 -0500

My summary is late, but the responses were timely! Thank you to the
following people for your assistance...

    alan_at_nabeth.cxo.dec.com (Alan Rollow - Dr. File System's Home for
      Wayward Inodes.)
    Chad Price <cprice_at_molbio.unmc.edu>
    "Macfarlane, Fraser" <Fraser.Macfarlane_at_compaq.com>

The consensus seems to be that the controller is likely failing, although
it was mentioned that the mysterious crash was similar to a problem that
occurred with loose memory boards. We'll be checking the latter when
we bring the machine down again, just to be on the safe side.

As it turns out, our problem appears to have been the firmware version
on our RAID controller (KZESC). We were at version 1.99, which is
incompatible with 4.0d according to Compaq. We couldn't find a controller
firmware requirement higher than 1.9 in the 4.0 installation manual and
the 4.0d release notes, but we're missing some of the associated docs
(shipped from another department) and could easily have overlooked it.

We've upgraded the KZESC firmware to 2.16 and will watch for any further
errors/problems. Thanks again!

Dawn Lovell
dawn.lovell_at_centurytel.com

--- Submitted Question ---
>One of our Alpha 1000's (DU4.0d, patchkit 3) crashed yesterday morning
>with no trace of an error, nothing in any logs, no crash data, etc.
>Using uerf, we did see a controller error from last week. We're not
>certain if it is related or even exactly what it means. Looking back
>through the dia output, we've found a few instances of these errors
>since July of last year.
>
>Compaq support had us install DECevent to get more detailed information;
>the output from it for the latest error is included below. Would someone
>please take pity on me and explain what it means? It looks (to the
>uninformed, that being me :-) like an error with the controller itself,
>since it doesn't appear to mention a drive on the controller.
>
>Compaq thinks that the controller is going bad, although they're now
>reviewing the dia output to be sure. If this is enough information to
>tell, does that appear to be the case? Also, to what does the "Needs to
>be Restarted" flag refer?
>
>Thank you for your time and assistance.
>
>Dawn Lovell
>dawn.lovell_at_centurytel.com
>
>--- dia output ---
>Logging OS 2. Digital UNIX
>System Architecture 2. Alpha
>Event sequence number 12.
>Timestamp of occurrence 11-MAY-1999 01:46:50
>Host name vs2
>
>System type register x00000011 AlphaServer 1000
>Number of CPUs (mpnum) x00000001
>CPU logging event (mperr) x00000000
>
>Event validity 1. O/S claims event is valid
>Event severity 3. High Priority
>Entry type 198. SWXCR RAID Controller Event
>
>
>------ Device Data ------
>Class x00 RAID Disk
>Subsystem x20 SWXCR Mport/RAID Controller
>Number of Packets 5.
>------ Packet Type ------ 258. Module Name String
>Routine Name xcr_cmd_timeout
>------ Packet Type ------ 256. Generic String
> Controller has stopped responding
>------ Packet Type ------ 260. Hardware Error String
>Error Type Hard Error Detected
>------ Packet Type ------ 256. Generic String
> Controller Softc at time of error
>------ Packet Type ------ 512. SWXCR Softc(XCR_SOFTC)
> Packet Revision 2.
>
>Controller Number x00000000
>Controller Version x00000000
>Flags x00000002 Needs to be Restarted.
>Normal Commands Active 60.
>Special Commands Active 4.
>Command Slots Active 0.
>Commands on Pending List 0.
>Command Slots Available 61.
>2560. Bytes Cmd Que Data ** Not Printed **
>--- End of dia output---
Received on Mon May 24 1999 - 22:07:53 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT