SUMMARY:mkfset hangs on HSG80

From: Udo Grabowski <udo.grabowski_at_imk.fzk.de>
Date: Mon, 06 Aug 2001 17:09:10 +0200

Damn ! (sorry...)

This one makes me really angry, as it was apparently
flaky Compaq software that costs me hours of work !
A kind frenchman (LHERCAUD_at_bouygestelecom.fr) sent me a
general advice to stop the advfsd daemon:
======================================================
>> I advise STOPPING the advfsd deamon which is of
>> very very very little usage and it's full of bugs.
======================================================
I took this as a general hint and decided to follow that one later.
After 2 additional hours and several tries I just killed the mkfset
command and restarted the advfsd daemon, but issuing mkfset from the
commandline also hung.
Having no alternatives, I just shot into the dark, killing all
advfsd daemons on all members of the cluster, and then IT WORKED!!!
I could allocate any number of filesets on that domain in seconds !!
Seems that advfsd cannot handle more than 2 filesets in a domain
and has extrem problems to synchronize through the cluster, blocking
access even for other commands.
So I would strongly advice anyone who sees bad performance on such
operations to kill this flaky bit of software and use the commandline
tools !

That kind guy also gave a hint useful when cleaning up after a
advfs panic:

>> also check and remove the /etc/fdmns/.*lock* that may concern your domain
>> (may be left there by previous session)

in addition to the usual fdmns directory entries.

THANKS LHERCAUD !!!
--------------------
Original posting:

Hello !

I've a problem with creating more than one fileset on a
HSG80 device in Tru5.1/TruCluster5.1 PK3 (7 members). Using
dtadvfs, I created a domain (named tru64) on a HSG80 60 GB
disk unit using the full dsk<n>c disk (as recommended in the
manuals). The disk unit is part of a two 72GB disks mirrorset
and occupies the Gbytes left on that set by the two other
units where the cluster root, usr/var are located.
Adding the first fileset took about 2 minutes, and after
that it could be mounted flawlessly. The second fileset
on that domain took already 5 minutes, but also showed up
successfully and could be mounted.
But the third fileset still does not appear although the
creation process runs about 16 hours now. I had this
problem 2 days before when my first attempt failed similarly,
and then I couldn't do anything with the advfs commands,
all commands hung and could not be killed (not even with -9!),
including the advfsd daemon. A reboot made the machine even
inaccessible, though it was running as a cluster member
(probably because an Advfs panic occured on the failed domain).
After that I had to clean the disklabel and the fdmns directory,
relabeled the disk, and initialized and rebooted the machine
as well as the HSG80, which shows no error.
The advfs log gives no clue, the last entries show that the
mkfset command issued by dtadvfs was accepted from both the
dispatcher and the runjob modules, but the 'Exit' entry seen
for the oher two filesets is still missing. There's one
ERROR entry after unmounting the root fileset, regarding some
cache inconsistency I've seen a couple of times also on
other machines, but that disappears after restarting the advfsd
process.
The mkfset command is still on the machine, but has accumulated
only 5 sec of cpu time in the last half day. On the machine we
see periodically high activity of the advfsd process and a bunch
of ~60 icssvr_daemon_pe/_fr processes in WAIT state (what are
these guys good for, and why there are so many of them ?).
No revealing entries in the other logs.

A detailed ps shows that the mkfset process is idle and waits
for an event:

F S UID PID PPID %CPU PRI NI RSS WCHAN
STARTED TIME COMMAND
80808001 I < 0 526690 525882 0.0 43 -1 320K 17912cc0
19:53:05 0:00.05 mkfset tru64 var

Is is possible to find out to which module the event
address belongs ?

What's going on there ? How can I get out of this situation
without hanging the machine again ? Is something wrong with
my setup ? Any recommendations how to organize Advfs domains and
filesets in a large cluster on large disk units (up to 500 GB)
with HSG80's ?

Thanks for any help !

Appendices:
showfdmn, showfsets (current state), advfs log, and disklabel
-------
# showfdmn tru64

                Id Date Created LogPgs Version Domain Name
3b6d839a.0105d85f Sun Aug 5 19:34:18 2001 512 4 tru64

   Vol 512-Blks Free % Used Cmode Rblks Wblks Vol Name
    1L 122316720 122306144 0% on 256 256 /dev/disk/dsk4c#
-------
showfsets tru64
root
         Id : 3b6d839a.0105d85f.1.8001
         Files : 2, SLim= 0, HLim= 0
         Blocks (512) : 32, SLim= 0, HLim= 0
         Quota Status : user=off group=off

usr
         Id : 3b6d839a.0105d85f.2.8001
         Files : 2, SLim= 0, HLim= 0
         Blocks (512) : 32, SLim= 0, HLim= 0
         Quota Status : user=off group=off
-------
log:
SNMP Research SNMP Agent Resident Module Version 12.2.0.0
Copyright 1989, 1990, 1991, 1992, 1993, 1994 SNMP Research, Inc.
Sun Aug 5 19:34:18 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 1
         1 tru64 /dev/disk/dsk4c 512 0 128 | File dispatcher.c | Line 118
Sun Aug 5 19:34:18 2001 | INFO | Runjob: received message:
         1 tru64 /dev/disk/dsk4c 512 0 128 | File runjob.c | Line 1283
Exit status of job is: 0.
Sun Aug 5 19:34:18 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:37:24 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 2
         4 tru64 root | File dispatcher.c | Line 118
Sun Aug 5 19:37:24 2001 | INFO | Runjob: received message:
         4 tru64 root | File runjob.c | Line 1283
Exit status of job is: 0.
Sun Aug 5 19:37:27 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:39:03 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 3
         7 tru64 root /mnt 0 0 0 rw | File dispatcher.c | Line 118
Sun Aug 5 19:39:03 2001 | INFO | Runjob: received message:
         7 tru64 root /mnt 0 0 0 rw | File runjob.c | Line 1283
Exit status of job is: 0.
Sun Aug 5 19:39:04 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:40:44 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 8 job_num: 4
         :5:3b6d839a:0105d85f:1:8001:0:0:1:119860241:0:1:: | File dispatcher.c |
Line 118
Sun Aug 5 19:40:44 2001 | INFO | Runjob: received message:
         :5:3b6d839a:0105d85f:1:8001:0:0:1:119860241:0:1:: | File runjob.c |
Line 1283
Exit status of job is: 0.
Sun Aug 5 19:40:44 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:40:49 2001 | ERROR | The cache does not contain the requested data
item | File ../AgentCache.cc | Line 186
Sun Aug 5 19:42:22 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 5
         8 tru64 root 0 0 0 0 | File dispatcher.c | Line 118
Sun Aug 5 19:42:22 2001 | INFO | Runjob: received message:
         8 tru64 root 0 0 0 0 | File runjob.c | Line 1283
Exit status of job is: 0.
Sun Aug 5 19:42:23 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:45:21 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 6
         4 tru64 usr | File dispatcher.c | Line 118
Sun Aug 5 19:45:21 2001 | INFO | Runjob: received message:
         4 tru64 usr | File runjob.c | Line 1283
Exit status of job is: 0.
Sun Aug 5 19:47:51 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:49:45 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 7
         7 tru64 usr /mnt 0 0 0 rw | File dispatcher.c | Line 118
Sun Aug 5 19:49:45 2001 | INFO | Runjob: received message:
         7 tru64 usr /mnt 0 0 0 rw | File runjob.c | Line 1283
Exit status of job is: 0.
Sun Aug 5 19:49:56 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:52:30 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 8
         8 tru64 usr 0 0 0 0 | File dispatcher.c | Line 118
Sun Aug 5 19:52:30 2001 | INFO | Runjob: received message:
         8 tru64 usr 0 0 0 0 | File runjob.c | Line 1283
Exit status of job is: 0.
Sun Aug 5 19:52:32 2001 | INFO | Runjob: received message:
         (null) | File runjob.c | Line 1283
Sun Aug 5 19:53:05 2001 | INFO | Dispatcher: rcvd msg: fd: 17 type: 3 job_num: 9
         4 tru64 var | File dispatcher.c | Line 118
Sun Aug 5 19:53:05 2001 | INFO | Runjob: received message:
         4 tru64 var | File runjob.c | Line 1283
-----
disklabel -r in the current state:
/dev/rdisk/dsk4c:
type: SCSI
disk: HSG80
label:
flags:
bytes/sector: 512
sectors/track: 254
tracks/cylinder: 20
sectors/cylinder: 5080
cylinders: 24079
sectors/unit: 122316723
rpm: 3600
interleave: 1
trackskew: 7
cylinderskew: 26
headswitch: 0 # milliseconds
track-to-track seek: 0 # milliseconds
drivedata: 0

8 partitions:
# size offset fstype [fsize bsize cpg] # NOTE: values
not exact
   a: 121314675 0 unused 0 0 # (Cyl. 0 -
23880*)
   b: 1000000 121314675 unused 0 0 # (Cyl. 23880*-
24077*)
   c: 122316723 0 AdvFS # (Cyl. 0 -
24078*)
   d: 0 0 unused 0 0 # (Cyl. 0 - -1)
   e: 0 0 unused 0 0 # (Cyl. 0 - -1)
   f: 0 0 unused 0 0 # (Cyl. 0 - -1)
   g: 0 0 unused 0 0 # (Cyl. 0 - -1)
   h: 2048 122314675 unused 0 0 # (Cyl. 24077*-
24078*)

-- 
Dr. Udo Grabowski                           email: udo.grabowski_at_imk.fzk.de
Institut f. Meteorologie und Klimaforschung II, Forschungszentrum Karslruhe
Postfach 3640, D-76021 Karlsruhe, Germany           Tel: (+49) 7247 82-6026
http://www.fzk.de/imk/imk2/ame/grabowski/           Fax:         "    -6141
Received on Mon Aug 06 2001 - 15:09:59 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT