T64v5.1 - slow disk access, waiting?

From: Simon Greaves <Simon.Greaves_at_usp.ac.fj>
Date: Mon, 03 Sep 2001 15:48:14 +1200

Hi,

I have an AlphaServer 800 4/400 with a Mylex DAC960 KZPSC RAID
Controller (SWXCR) and 4 disks, 1 JBOD and 3 as a raid5 set. The
machine has LSM installed and most of the filesystems are AdvFS on LSM
except for root which is ufs (not LSM encapsulated). This config
seemed ok with DU4.0D.

I upgraded to T64v5.1 and applied T64V51AS0003-20010521 Aggregate
Patch Kit, rebuilt the kernel and all looked ok. I also needed to make
some changes to the LSM config, and I think I may have made a mistake
here, certainly I get an error at boot time, though it doesn't show up
with any of the vol* commands I've tried and the system seems to work
ok _EXCEPT_ disk accesses seem very slow. During a normal boot, I see
the error:

  lsm:volio: Cannot open disk dsk0f: kernel error 16

then it carries on as normal. Similarly, if I boot single user and
mount the /usr and /var aprtitions (as instructed in the patch install
docs) I get:

  lsm: volio: Illegal vminor encountered
  Error: /dev/vol/rootdg/vol_var is an invalid device
         or cannot be opened

and again, running /sbin/bcheckrc gives:

  starting LSM
  lsm: volio: Cannot open disk re0f: kernel error 16

This partition is marked as LSMsimp in the disklabel and contains the
/usr and /var volumes, one plex each. Using the vol* commands seems to
give the results I would expect, but I'm not really very knowledgeable
about LSM. Certainly vold is runnig and there are two voliod threads,
and other checks in the LSM manual troubleshooting section all seem
ok. The only other wierdness I can see is that the
/dev/vol/rootdg/vol_var device has an unusual timestamp, ie:

brw-r--r-- 1 root daemon 40, 7 Jul 12 21:43 vol_misc
brw-r--r-- 1 root daemon 40, 5 Jan 1 1996 vol_var

This may not be related, but using the Compaq (unsupported) monitor
tool, I see the system spending what seems like a lot of cycles in
'wait' and often the queue on the disks seems quite big.

This can be a real problem for disk intensive operations, for example
I was using a simple application that uses SleepyCat DB to create an
on-disk DB of about 1.8GB. I ran the process niced down, but the whole
system ground to a near halt. I took the same code to another AS and
it ran with minimal impact on the system, same with my own Linux PC,
so it looks to me like there is a problem on the main machine.

I ran swxcrmgr and checked the disks for errors, all but one had zero
errors, the last one (part of the raid set) had 127 misc errors,
should I get Compaq to swap it out? I did notice that the raid set
sometimes seems to spend more time accessing one disk (the same one I
think) than the others, but then again, a VMS system next to it does
the same.

Any suggestions for how to further diagnose this would be much
appreciated, much more info available on request :-)

Thanks,

Simon
-- 
Simon Greaves            Voice: +679 212114
Systems & Networks       Fax:   +679 304089
ITS, USP, Suva           Email: Simon.Greaves_at_usp.ac.fj
Fiji
Received on Mon Sep 03 2001 - 07:18:33 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT