SUMMARY: Bad SCSI controller or disk

From: Bill Sadvary <sadvary_at_dickinson.edu>
Date: Thu, 11 Feb 1999 10:07:49 -0500 (EST)

Thanks for all the replies. The majority of the folks suspected the drive
especially since the scsi controller is on the mother board.

I reseated the memory modules and some other scsi connectors and also
added a second 1 GB drive. I disconnected the original disk from the
bus and the system seems to be running fine so far. Time will tell.

I'll include some of the replies below the original msg.

-Bill Sadvary
 Dickinson College

Original msg:
-------------
I have a AS 200 4/166 that keeps crashing. No syslog entries, no errors is
the "messages" file and nothing in uerf. It just dumps core to the screen
and *tries* to reboot. (See below)

It appears to me to be a hardware problem but, unfortunately, we don't
have a hardware contract so I'm on my own.


Dec 8 14:19:03 ns1 vmunix: Alpha boot: available memory from 0x7f2000
to
        0x3ffe000
Dec 8 14:19:03 ns1 vmunix: Digital UNIX V4.0D (Rev. 878); Tue Dec 8
        13:48:32 EST 1998
Dec 8 14:19:03 ns1 vmunix: physical memory = 64.00 megabytes.
Dec 8 14:19:03 ns1 vmunix: available memory = 56.06 megabytes.
Dec 8 14:19:03 ns1 vmunix: using 238 buffers containing 1.85 megabytes
of
        memory
Dec 8 14:19:03 ns1 vmunix: AlphaStation 200 4/166 system
Dec 8 14:19:03 ns1 vmunix: DECchip 21071
Dec 8 14:19:04 ns1 vmunix: 82378IB (SIO) PCI/ISA Bridge
Dec 8 14:19:04 ns1 vmunix: Firmware revision: 6.3
Dec 8 14:19:04 ns1 vmunix: PALcode: Digital UNIX version 1.46
Dec 8 14:19:04 ns1 vmunix: pci0 at nexus
Dec 8 14:19:04 ns1 vmunix: psiop0 at pci0 slot 6
Dec 8 14:19:04 ns1 vmunix: Loading SIOP: script 801d00, reg 82040000,
        data 80dc10
--
Everything is normal up to this point, then..
--
CAM_LOGGER: cam_error packet
CAM_LOGGER: bus 0 target 0 lun 0
ss_perform timeout
timeout on disconnected request
Active CCB at time or error
---
and the system hangs
---
What should have happened next was...
---
Dec  8 14:19:04 ns1 vmunix: scsi0 at psiop0 slot 0
Dec  8 14:19:04 ns1 vmunix: rz0 at scsi0 target 0 lun 0 (LID=0) (DEC
RZ26F
        (C) DEC 630J)
Dec  8 14:19:04 ns1 vmunix: rz4 at scsi0 target 4 lun 0 (LID=1) (DEC
RRD45
        (C) DEC  0436)
Dec  8 14:19:04 ns1 vmunix: isa0 at pci0
Dec  8 14:19:04 ns1 vmunix: gpc0 at isa0
Dec  8 14:19:04 ns1 vmunix: ace0 at isa0
etc.
So it seems to being dying during the "scsi0 at psiop0 slot 0" phase which
leads me to think it might be a flakey scsi controller.  Or, maybe the
disk since the CAM error mentions "bus 0 target 0 lun 0."  ??? 
BUT!, the system will fully boot if I recycle power.  Once it's up for a
while, it then crashes at random times.
If someone could help me determine which is more likely at fault (the disk
or controller or ??) in this situation, I would appreciate it. 
At this point, the system is totally hosed.  In desperation, I installed
v4.0E (thinking a re-format of the disk could be a cheap way out) and, of
course, it crashed in the middle of loading the subsets.
Not a good day.  ;-)
Some various replies
--------------------
        The RZ26F may be the problem.  I believe the driver message
        indicates that a command didn't finish within the expected
        time frame, so it performed a bus reset to get the bus and
        device back into an expected state.
alan_at_nabeth.cxo.dec.com
--------------------
I'd start with the disk drive -- on-board drive controllers fail far more
often than the (single-chip) motherboard SCSI controllers do.  There's
simply far fewer parts to fail.
Swap out the drive with another, or hook up a different scsi drive to the
external SCSI connector.  If the problem disappears, your drive is sick.
Otherwise, you need a new motherboard....
John Francini <francini_at_nashua.progress.com>
--------------------
You are on the right track in suspecting either the SCSI controller or the
disk.  The SIOP is the NCR 53C810 chip; on that system, I believe it's on
the motherboard, not an add-in card.  In any case, open up the system box
and make sure ALL of the cables are well-seated; I'd pull them off and
reseat them, as over time, you can get a bit of corrosion on the
connectors (oxidation)  that can cause electrical connectivity problems. 
If you've got a spare disk (as the disk was probably hosed anyway when you
partially installed V4.0E), you might swap in a spare hard disk, or if
there's room in the box for two disks, hook it up a unit 1 and try the
install there.  (1GB SCSI disks are getting to be a commodity item these
days; even Western Digital is selling them.) 
If the SCSI controller is a plug-in card, remove and reseat the card, as
well as the cable reseats I noted. 
And make SURE the SCSI bus is terminated at the CDROM end.  No termination
can lead to SCSI errors. 
"Dr. Tom Blinn, 603-884-0646" <tpb_at_doctor.zk3.dec.com>
--------------------
Similar CAM error messages were reported in the release notes for one of
the 4.0 versions, I can't lay my hands on it right now, but the solution
was to run, at the console,
isp1020_edit -sd
to change the code used by the Qlogic 1020 card. The 4.0D release notes
section 3.1.5 mention similar problems which they suggest using eeromcfg
to fix.  And there are various Qlogic problem/code update/fixes described
in recent firmware release notes, so you could try upgrading firmware if
isp1020_edit -sd doesn't work for you. 
Oisin McGuinness <oisin_at_sbcm.com>
--------------------
I have had a couple of problems exactly like this.  In both cases, it
was a bad hard drive.  Could be a cable, too.  Definitely doesn't look
like a controller.
Ian Watkins
--------------------
Received on Thu Feb 11 1999 - 15:08:46 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT