To cheer up Alistair and others :), here is a summary to a question i posed
on SCSI problems and a few other less important answers to older questions.
It does take time to write summaries, and some questions do not seem to
have readily apparent answers. It also takes time to diagnose, fix and move
onto the next disaster :) But here goes...
Original subject: DAT drive fails with SCSI errors
Original subject: DEC3000 300X meets Barracuda IV and disapproves
Original subject: VRT21 monitor - patchy illuminance
Original subject: HUBwatch 3.1 - slowest program in the world?
The second question was posted to comp.periphs.scsi - a very intense mix
of problems and discussions of Adaptec controllers...
BTW, before i start the SCSI faq makes interesting reading, well worth a
look if you are a novice. It does lack some of the real nitty gritty but
it's a good introduction. My local faq archive...
<A HREF="src.doc.ic.ac.uk/usenet/news-faqs/news.answers/scsi-faq/part1">SCSI Part 1</A>
<A HREF="src.doc.ic.ac.uk/usenet/news-faqs/news.answers/scsi-faq/part2">SCSI Part 2</A>
>
> I've just noticed that our DDS-1 DAT drive is not playing ball anymore. It
> used to work, but now does not. This seems to coincide with the addition of
> a new drive to the SCSI chain, so i've removed this and it still hasn't
> returned to normality.
>
> The drive is getting old(ish) so it might be a hardware problem, although
> there's no indication on the two leds of any problem. I did notice the
> drive was set-up as target 7, so edited the kernel config file, changed
> the scsi_b prom setting from 6 to 7, changed the tape from 7 to 6,
> rebooted, removed the old rmt0h devices and MAKEDEV tz14. Didn't do
> any good.
Ok, the problem here is actually very obvious. This turned out to be the
standard problem of the SCSI bus being too long. In my original planning
of disk drives to machines, i wondered that type of SCSI controller the
DEC3000 500 had so i found some technical specs on www.digital.com which
claimed in was 'SCSI-2' - it made no mention of Fast SCSI-2, so i assumed
it was a slow SCSI and thought i had 6m to play with. The addition of the
BA350 to the bus added 1metre of cable and 1metre of internal bus on the
BA350. Since the external bus was already at approximately 2.5m, i thought
i would get away with it. Fast SCSI-2's 3metre maximum caught me out.
For reasons i am not prepared to ponder this only affected the tape drive
at the end of the chain which gave regular timeouts every hour, etc.
Please skip uerf output if not interesting.
> # uerf -c err -o full
> [...bits deleted...]
>
>
> ********************************* ENTRY 132. *********************************
>
> ----- EVENT INFORMATION -----
>
> EVENT CLASS ERROR EVENT
> OS EVENT TYPE 199. CAM SCSI
> SEQUENCE NUMBER 35.
> OPERATING SYSTEM DEC OSF/1
> OCCURRED/LOGGED ON Thu May 11 12:30:50 1995
> OCCURRED ON SYSTEM neptune
> SYSTEM ID x0004000F CPU TYPE: DEC
> CPU SUBTYPE: KN15AA
>
> ----- UNIT INFORMATION -----
>
> CLASS x0001 TAPE
> SUBSYSTEM x0000 DISK
> BUS # x0001
> x0078 LUN x0
> TARGET x7
>
> ----- CAM STRING -----
>
> ROUTINE NAME ctape_move_tape
>
> ----- CAM STRING -----
>
> Unexpected CCB status
>
> ----- CAM STRING -----
>
> ERROR TYPE Hard Error Detected
>
> ----- CAM STRING -----
>
> DEVICE NAME UNKNOWN
>
> ----- CAM STRING -----
>
> Active CCB at time of error
>
> ----- CAM STRING -----
>
> CCB request aborted by the host
> ERROR - os_std, os_type = 11, std_type = 10
>
>
> ----- ENT_CCB_SCSIIO -----
>
> *MY ADDR x861DAF28
> CCB LENGTH x00C0
> FUNC CODE x01
> CAM_STATUS x0042 CAM_REQ_ABORTED
> SIM QFRZN
> PATH ID 1.
> TARGET ID 7.
> TARGET LUN 0.
> CAM FLAGS x000000C0
> CAM_DIR_NONE
> *PDRV_PTR x861DAC28
> *NEXT_CCB x00000000
> *REQ_MAP x00000000
> VOID (*CAM_CBFCNP)() x003C4610
> *DATA_PTR x00000000
> DXFER_LEN x00000000
> *SENSE_PTR x861DAC50
> SENSE_LEN x40
> CDB_LEN x06
> SGLIST_CNT x0000
> CAM_SCSI_STATUS x0000 SCSI_STAT_GOOD
> SENSE_RESID x00
> RESID x00000000
> CAM_CDB_IO x000000000000000100000011
> CAM_TIMEOUT x00000E10
> MSGB_LEN x0000
> VU_FLAGS x0000
> TAG_ACTION x00
>
> ********************************* ENTRY 133. *********************************
>
> ----- EVENT INFORMATION -----
>
> EVENT CLASS ERROR EVENT
> OS EVENT TYPE 199. CAM SCSI
> SEQUENCE NUMBER 36.
> OPERATING SYSTEM DEC OSF/1
> OCCURRED/LOGGED ON Thu May 11 13:29:16 1995
> OCCURRED ON SYSTEM neptune
> SYSTEM ID x0004000F CPU TYPE: DEC
> CPU SUBTYPE: KN15AA
>
> ----- UNIT INFORMATION -----
>
> CLASS x0022 DEC SIM
> SUBSYSTEM x0000 DISK
> BUS # x0001
> x0078 LUN x0
> TARGET x7
>
> ----- CAM STRING -----
>
> ROUTINE NAME ss_abort_done
>
> ----- CAM STRING -----
>
> SCSI abort has been performed
>
> ********************************* ENTRY 134. *********************************
According to my VAR, the DEC3000 500 has SCSI-2 buses which allows it to
use the 'tighter' timings to become a Fast SCSI-2 bus, the SCSI-2 being
the important bit. He claimed that keeping up with the firmware updates
would have enabled the 'Fast' factor.
So, bottom line is, keep the SCSI bus as short as possible and as far
below the lengths
Fast SCSI-2 - 3m - 5megs/s sync
SCSI-1 - 6m - 10megs/s sync
Diff SCSI - 25m - 10megs/s sync
as possible, ALSO, always remember the internal cable lengths on the host
(*not relevant in this case, as it is a purely external bus i was using*)
and on any enclosures which may be in the SCSI chain.
ALSO, watch out for controller id's. I had assumed that the controller id
on our 500 was set to target 7. The two buses (internal and external)
were both set to target 6. I re-arranged this via the boot prom and the
>>> set SCSI_A 7
>>> set SCSI_B 7
commands.
Allegedly, the fast/slow SCSI-2 modes can be altered with the following
prom variables. I can't remember if i checked these, and i've just
forgotten to check when i rebooted. If i had reconfigured the external
bus to slow, i *might* have gotten away with the longer bus.
>>> show FAST_SCSI_A
>>> show FAST_SCSI_B
ALSO, we were using a passive terminator. Fast SCSI-2 requires active
terminators and i know (think i) know why. Passive terminators do not
do a good job due to the specification of the terminators not matching
the typical impedance of SCSI cables. The 5meg/s SCSI gets away with
this, but tightening up the timings to get Fast SCSI-2 performance
requires a higher quality transmission line, so out come the active
terminators to match the impedance more closely. Allegedly, cheap SCSI
cables do not help this due to low impedances and maybe higher resitive
loses.
ALSO, the BA350 can be re-configured from one bus into two half buses.
I converted from the default single bus to two half buses to shorten the
my overlall SCSI bus chain length. I had to put the Barracuda and CD-ROM
into even slots to get them on the correct half of the bus, but it has
saved me 0.5m (!).
ALSO, another piece of trivia, a Sparcstation 10 has 0.8m of internal
SCSI cabling. So using the onboard Fast SCSI-2 controller, you only get
2.2m MAX to play with. Grim, eh?
OK, ONTO THE NEXT QUESTION AND ANSWER
> I have a DEC3000 300X with an external BA353 which until recently
> only had an RRD43 CD-Rom in it. I've added a Barracuda IV to the bus,
> configured it as device number 1 and now i have problems.
>
> When the machine boots and it gets to fcsk'ing the disks a lot of error
> messages along the lines of (copied from my handwriting)...
>
> cam_logger: CAM_ERROR packet
> cam_logger: bus 0 target 1 lun 0
> dme_tcds_resume()
> Invalid DME DAT element.
> cam_logger: CAM_ERROR packet
> cam_logger: bus 0 target 1 lun 0
> sim_err_sm
> Target went to message in phase
> <...REPEAT...>
Well, this turns out to be not entirely explainable. It works some of
the time, and not at others. It seems to work better if the Barracuda is
fsck'ed and mounted later on in the boot process. The drive was initially
set via jumpers to spin up at target id multiplied by 10 seconds. I
changed this to spin up on demand. This made no difference. BTW, once
the disk got going there were no problems.
A fellow sufferer had discovered that enabled SCSI power to the bus
on his Hawk 4 made everything ok. I tried this and it appears to have
solved the boot time problems - i booted three times afterwards from
power down, halted, and a shutdown -r and it came up first time, every
time.
I also had this prob lem on a BA350 with another (slightly older) Barracuda
IV disk. Again, enabling scsi power to bus gives me a machine which boots
first time, every time, touch wood.
It appears that most (all? laptops?) SCSI controllers supply termination
power to the bus so theoretically it is not required anywhere else.
Realistically there may be some problem with it appearing as a clean d.c.
+5v (?) signal at the other end of the bus. It is widely agreed that
multiple devices supplying termination power is not a problem.
ObRumour:
My VAR came up with a story about older Hawk disks having a problem on
spin-up where the drive would get confused if asked to 'do anything'
during initialisation. Apologies as to the vagueness of this story, i
made no notes, but he claimed that supplying termination power helped
solve this problem since the termination power somehow got dragged down
by the drive. Solved in later firmware editions, he claims.
ObVagueFact:
A BA353 has 0.9m of internal cabling and is not really designed
for Barracuda IV's cooling requirements. I only use 2/3 bays in
a cool machine room, and it's a usenet news disk - there's
my excuses.
Here's uerf output from a DEC3000 500 with a Barracuda IV being naughty...
> ********************************* ENTRY 128. *********************************
>
> ----- EVENT INFORMATION -----
>
> EVENT CLASS OPERATIONAL EVENT
> OS EVENT TYPE 300. SYSTEM STARTUP
> SEQUENCE NUMBER 0.
> OPERATING SYSTEM DEC OSF/1
> OCCURRED/LOGGED ON Sun Jun 4 19:30:39 1995
> OCCURRED ON SYSTEM neptune
> SYSTEM ID x0004000F CPU TYPE: DEC
> CPU SUBTYPE: KN15AA
> MESSAGE Alpha boot: available memory from
> _0x736000 to 0x6000000
> DEC OSF/1 V2.1 (Rev. 250); Tue May 16
> _11:52:17 BST 1995
> physical memory = 94.00 megabytes.
> available memory = 83.62 megabytes.
> using 360 buffers containing 2.81
> _megabytes of memory
> tc0 at nexus
> scc0 at tc0 slot 7
> tcds0 at tc0 slot 6
> asc0 at tcds0 slot 0
> rz1 at asc0 bus 0 target 1 lun 0 (DEC
> _ RZ25 (C) DEC 0700)
> rz2 at asc0 bus 0 target 2 lun 0 (DEC
> _ RZ25 (C) DEC 0700)
> rz3 at asc0 bus 0 target 3 lun 0 (DEC
> _ RZ26 (C) DEC T384)
> rz4 at asc0 bus 0 target 4 lun 0 (DEC
> _ RRD42 (C) DEC 4.5d)
> cam_logger: CAM_ERROR packet
> cam_logger: bus 1 target 2 lun 0
> ss_abort_done
> SCSI abort has been performed
> asc1 at tcds0 slot 1
> rz8 at asc1 bus 1 target 0 lun 0
> _(SEAGATE ST43400N 0116)
> rz10 at asc1 bus 1 target 2 lun 0
> _(SEAGATE ST15150N 0017)
> rz12 at asc1 bus 1 target 4 lun 0
> _(TOSHIBA CD-ROM XM-4101TA 0064)
> tz14 at asc1 bus 1 target 6 lun 0
> _(ARCHIVE Python 28388-XXX 4.28)
> fb0 at tc0 slot 8
> 1280X1024
> bba0 at tc0 slot 7
> ln0: DEC LANCE Module Name:
> ln0 at tc0 slot 7
> ln0: DEC LANCE Ethernet Interface,
> _hardware address: 08-00-2b-30-93-75
> DEC3000 - M500 system
> Firmware revision: 5.1
> PALcode: OSF version 1.35
> lvm0: configured.
> lvm1: configured.
Some words of wisdom from OSF/1 supremo, Selden...
> Other than a hardware failure, the obvious things that I can think of are
>
> 1. check to make sure that all of the internal terminators
> have been removed from all of the devices on the SCSI bus
> (too often people have sworn to me that this was the case and then
> an internal terminator was found to still be installed on a disk drive)
>
> 2. the last device on the SCSI bus must be configured to supply termination
> power to the bus. It is actually best if ALL of the devices are jumpered
> to supply power to the bus. This is usually not the default.
>
> 3. check to make sure that the SCSI cables are all firmly seated.
> Archive Pythons are badly designed in this regard. The back plate is
> too thick. Connectors with (thin) metal ears do *not* seat properly and
> usually pop loose slightly. Cables with (thick) plastic ears are more
> firmly held by the metal bails. We usually have to modify the backplates
> (cutting the connector holes larger) so that the cables can seat fully
> into the connectors.
>
> I hope this helps a little.
>
> Selden
To quickly sum-up on HubWatch. I had one reply mentioning that it does
have to wait on the network a fair bit when polling via SNMP for info
from the network adapter - potentially a lot of requests and a lot
of info flying around the network. I still have not tried v4.0.
And VRT-21's - well i had it swapped out by Digital and it improved a bit.
I think it's a mixture of two effects - there is slight EMI/RFI interference
next to the monitor which affects it a little bit. I am surprised that
this can affect the intensity of areas of the screen, but i'm not a
physicist. It also appears that all monitors suffer to some degree from
uneven phosphor illuminance - it just depends how close you look and how
picky you are.
Incidentally, to anyone in Digital, can we have plain Sony monitors, please?
Re-badged if they really have to be, and not re-designed enclosures. I
assume the VRT-21 case is Digital's fault and i know refer to this
type of monitor as a Value Subtracted product. Stop wasting my Centre's
money on pointless re-designing of monitor cases.
And thanks to...
Steve Kanefsky <kanefsky_at_datamagic.com> (comp.periphs.scsi)
Wilko Bulte <wilko_at_yedi.iaf.nl> (comp.periphs.scsi)
Larry Furst <furst1_at_ix.netcom.com> (comp.periphs.scsi)
Mike <system_at_edu.nmsu.pslaxp> (alpha-osf-managers)
Selden E. Ball, Jr. <SEB_at_LNS62.LNS.CORNELL.EDU> (alpha-osf-managers)
Magali <magali_at_lirmm.fr> (alpha-osf-managers)
Apolgoies if i have forgotten anyone.
regards
|<evin
--
Kevin J. Walters London Para||el Applications Centre
k.j.walters_at_lpac.ac.uk Queen Mary and Westfield College
+44 (0)171 775 3247 Mile End Road, London E1 4NS
Received on Mon Jun 05 1995 - 23:11:40 NZST