SUMMARY / FOLLOWUP : error "Impossible Cond Detected"

From: Aldridge, Robert E. <REAldridge_at_mcdermott.com>
Date: Thu, 21 Jun 2001 13:30:44 -0500

Tru64 Managers:

The Compaq services engineer recommended replacing the KGPSA-CA firmware.
(The KGPSA-CA is the fibre channel host bus adapter [HBA] in the ES40
systems.)

Apparently there is a field service "blitz" notification that firmware
version 3.03X2 should be replaced.
The current version of the firmware for the KGPSA is 3.81A4 -- we'll
replace this as soon as possible.

One follow-up question: What information might I lose when the firmware is
upgraded? (For instance, will I need to reset the WWIDs, or will the disk
numbering change when the upgrade occurs?)
  The actual storage set information should be stored with the HSG60
controllers, so I don't think I need to worry about losing that type of
info.

Thanks once again,

Rob Aldridge
AT&T Solutions
Alliance, Ohio


-----Original Message-----
From: Aldridge, Robert E. [mailto:REAldridge_at_mcdermott.com]
Sent: Wednesday, June 20, 2001 9:32 AM
To: 'tru64-unix-managers_at_ornl.gov'
Subject: SUMMARY: Help understanding error: "Impossible Cond Detected"
Importance: Low


Tru64 Managers:

I'll summarize, though I don't have any "real" answers and still awaiting
Compaq support's response.

Some of the suggestions from the list participants include:

- Run "dia -R |more" [I don't seem to have 'dia' command on this system]

- Possible conflict caused by snmd and advfsd trying to query the disks

- Understand which disk the driver thinks is "target 5" [see error message].

- Could a device be disconnecting abruptly? [If it is disconnecting, it's
not a mechanical change -- this is occurring in the middle of the night when
the system is unattended.]

- 'SCSI CAM ERROR PACKET' indicates that the problem could be even with the
SCSI connecting cables ...
[I don't understand the interaction between SCSI and Fibre Channel. The
disks are all Fibre Channel via HSG60. The only SCSI devices are the tape
library; and the cd-rom on each ES40 system.]



Thank you for the suggestions. If I find a more definitive answer I'll
re-summarize.



Rob Aldridge
AT&T Solutions
Alliance, Ohio

-----Original Message-----

Tru64 Managers:

On our ES40 cluster (Tru64 5.1 patch 3), we frequently get disk-related
errors. The errors seem to be generated from the HSG60 disk connection.

I would appreciate some help interpreting the error message (below). The
error mentions a particular scsi target (b=2 t=5 l=0) but that doesn't taget
exist on the ES40. Could the error message refer to the SCSI bus of the
HSG60 (MA6000 array)?

ALSO -- the one ES40 in the two-node cluster crashes every few days. The
crash does NOT occur at the same time of these error messages.


Here's a look at our hardware (hwmgr):


# hwmgr -view dev
 HWID: Device Name Mfg Model Location
 
----------------------------------------------------------------------------
--
    4: /dev/kevm
   51: /dev/disk/floppy0c            3.5in floppy     fdi0-unit-0
   56: /dev/disk/dsk0c      COMPAQ   BF01863644       bus-0-targ-0-lun-0
   57: /dev/disk/dsk1c      COMPAQ   BF01863644       bus-0-targ-1-lun-0
   58: /dev/disk/dsk2c      DEC      HSG60            IDENTIFIER=110
   59: /dev/disk/dsk3c      DEC      HSG60            IDENTIFIER=120
   60: /dev/disk/dsk4c      DEC      HSG60            IDENTIFIER=10
   61: /dev/disk/dsk5c      DEC      HSG60            IDENTIFIER=20
   62: /dev/disk/dsk6c      DEC      HSG60            IDENTIFIER=30
   63: /dev/disk/dsk7c      DEC      HSG60            IDENTIFIER=40
   64: /dev/disk/dsk8c      DEC      HSG60            IDENTIFIER=50
   65: /dev/disk/cdrom0c    COMPAQ   CRD-8402B        bus-3-targ-0-lun-0
   66: /dev/cport/scp0               HSG60CCL         bus-2-targ-0-lun-0
  132: /dev/changer/mc0              TL800    (C) DEC bus-1-targ-0-lun-0
  133: /dev/ntape/tape0     DEC      TZ89     (C) DEC bus-1-targ-4-lun-0
  134: /dev/ntape/tape1     DEC      TZ89     (C) DEC bus-1-targ-5-lun-0
And here is the error we receive ---
>From root_at_saturn.xyz.xyz.com Mon Jun 18 21:40:42 2001
Date: Mon, 18 Jun 2001 21:40:42 -0400 (EDT)
From: system PRIVILEGED account <root_at_saturn.xyz.xyz.com>
Subject: EVM ALERT [700]: SCSI event
Content-Length: 1864
======================= Binary Error Log event =======================
EVM event name: sys.unix.binlog.hw.scsi
    Binary error log events are posted through the binlogd daemon, and
    stored in the binary error log file, /var/adm/binary.errlog.  This
    event is used to report all SCSI device errors, including disk,
    tape, HSZ raid events, and adapter errors.
======================================================================
Formatted Message:
    SCSI event
Event Data Items:
    Event Name        : sys.unix.binlog.hw.scsi
    Priority          : 700
    PID               : 524693
    PPID              : 524289
    Event Id          : 1456
    Member Id         : 1
    Timestamp         : 18-Jun-2001 21:40:42
    Host IP address   : 131.184.3.49
    Cluster IP address: 131.184.3.52
    Host Name         : mars
    Cluster Name      : saturn
    User Name         : root
    Format            : SCSI event
    Reference         : cat:evmexp.cat:300
Variable Items:
    subid_class (INT32) = 199
    subid_num (INT32) = 2
    subid_unit_num (INT32) = 168
    subid_type (INT32) = 0
    binlog_event (OPAQUE) = [OPAQUE VALUE: 856 bytes]
============================ Translation =============================
Sequence number of error: 531235328
Time of error entry: 18-Jun-2001 21:40:42
Host name: mars
SCSI CAM ERROR PACKET
SCSI device class: DISK
Bus Number: 2
Target number: 5
Lun Number: 0
Name of routine that logged the event: cdisk_complete
Event information: Status = CMP but resid not NULL
Software detected event: Possible Software Problem - Impossible Cond
Detected
Device Name: DEC     HSG60           V85L
Event information: Active CCB at time of error
Event information: CCB request completed w/out error
                ############### Entry End ###############
======================================================================
Thanks for any assistance you can provide.  I also have a call open with
Compaq support.
Rob Aldridge
AT&T Solutions
Alliance, Ohio
Received on Thu Jun 21 2001 - 18:31:36 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT