Hi Managers,
My original posting follows below, my thanks to:
Hank Lee <hank_at_employees.org>
Pat O'Brien <pobrien_at_mitidata.com>
Alan <alan_at_nabeth.cxo.dec.com>
Dr. Thomas Blinn <tpb_at_doctor.zk3.dec.com>
Hank mentioned a friend had seen similar problems with faulty memory channel.
I checked the memory channel, checked the CQ logs, and could find no problems
there.
Pat suggested a problem with persisent SCSI reservations. I couldn't find
any evidence of this, but tried using 'cleanPR' just in case.
Alan suggested using 'scu' to see if I could see the device; which I could!??!
Tom suggested it might be a bug in the kernel's device recognition subsystem.
Based on the fact that the problem was fixed after a reboot, this seems to
suggest a problem there.
We logged a call with CQ-AU and were advised that Patch Kit 0003 has a lot
of fixes for the hardware management subsystems, and that we should apply that
patch kit and see if the problem persists.
At this stage we are looking at testing this out on our Test Cluster, to then
apply on our Production cluster - and it will hopefully fix the problem.
I'll post a followup if the problem resurfaces and I have any more news.
Thanks to all ..
gunther
My original posting:
>Hi Managers,
>
>I've been observing "strange" things happening with devices files
>lately under v5.1.
>
>Our setup: t64 v5.1 + TruCluster v5.1 + Patch Kit 0002
>
>Sometimes (intermittently) we find that from one node or another, we
>no longer have access to a device. If we reboot - everything is fine.
>
>e.g. If you try to do a disklabel to a disk, you get the message:
>
>disklabel: dsk57: No such device or address
>
>I've been experiencing this with disks (from an HSG80) or SCSI tapes
>(from a DLT892).
>
>I experienced this again today when I added two new disks. The
>procedure I followed:
>
>- Created stripe sets on HSG80
>
>- Initialised stripes
>
>- Added units
>
>- On node 2, did a 'hwmgr -scan scsi'
> The disks were assigned as dsk57 and dsk58
>
>- On node 1, did a 'hwmgr -scan scsi'
> I could see the LUNs, but the device names were 'unknown'
>
>- On node 1, did a 'dsfmgr -K'
> I could now see the devices as dsk57 dsk58
>
>- On node 2 did a 'disklabel -wr dsk57'
> Received the "No such device or address" message
>
>- On node 2 did a 'disklabel -wr dsk58'
> Worked fine.
>
>- On node 1 did a 'disklabel -wr dsk57'
> Worked fine.
>
>- On node 1 did a 'disklabel -wr dsk58'
> Worked fine.
>
>- On node 1 and 2, then tried a 'disklabel -r dsk57'
> Worked fine from 1, but got the "No such device or address" message
> on 2.
>
>- On node 1 and 2, then tried a 'disklabel -r dsk58'
> Worked fine from both.
>
>From node 2, I tried doing a truss on the 'disklabel -r dsk57' command,
>and the "interesing" part of the output came up with:
>
># truss disklabel -r dsk57
>[ Output truncated ... ]
>stat("dsk57c", 0x0000000140009DE0) = 0
>open("dsk57c", O_RDONLY, 043777733540) Err#6 No such device or
address
>disklabewrite(2, " d i s k l a b e", 8) = 8
>l: write(2, " l : ", 3) = 3
>getuid() = 0 [ 0 ]
>getuid() = 0 [ 0 ]
>getgid() = 1 [ 1 ]
>getgroups(32, 0x000000011FFF9140) = 6
>open("/usr/lib/nls/msg/C/libc.cat", O_RDONLY, 00) Err#2 No such file or
directory
>getuid() = 0 [ 0 ]
>open("/usr/share/.msg_conv-C", O_RDONLY, 01777777777760002723350) Err#2
No such file or directory
>dsk57write(2, " d s k 5 7", 5) = 5
>: write(2, " : ", 2) = 2
>No such device or addresswrite(2, " N o s u c h d e v i".., 25) = 25
>
>write(2, "\n", 1) = 1
>sigprocmask(SIG_BLOCK, 0xFFFFF137, 0x00000000) = 0
>_exit(4)
>
>Now, the files are there, checking from node 1:
>
># file /dev/rdisk/dsk57c
>/dev/rdisk/dsk57c: character special (19/1109) SCSI #1 "HSG80" disk
#2 (SCSI ID #0) (SCSI LUN #33)
>
>but from node 2:
>
># file /dev/rdisk/dsk57c
>/dev/rdisk/dsk57c: character special (19/1109)
>
>All I'm left with is a reboot; but mydowntime window isn't for a couple
>of days.
>
>Has anyone else experienced this, or have any ideas why it is occuring?
>
>Thanks in advance,
>gunther
Received on Thu Jun 14 2001 - 01:00:22 NZST