TCR 5.1 and LSM problem from Kevin Jones on 2001-08-10 (tru64-unix-managers)

From: Kevin Jones <Kevin.Jones_at_compelsolve.co.uk>
Date: Thu, 09 Aug 2001 22:14:09 +0100

We appear to have found a possible bug/feature in a TruCluster/LSM
configuration.

The set ups that we have noticed this on are a six node and an eight node
cluster. They are both running V5.1 and Patch Kit 3.
Both clusters have a shared fibre channel array for their cluster_root,
cluster_usr, cluster_var and quorum disk. Due to the nature of the HPTC
application to be run on each node they also have a local LSM stripe set
with an AdvFS filesystem to provide a temporary scratch area, these are
mounted to a CDSL to provide a node specific /scratch directory. Each of
these LSM stripesets is a separate diskgroup.
As we are in the test/set up phase of the project the systems are being
shutdown/booted fairly regularly. Upon reboot we started to notice errors
against some diskgroups

starting LSM
lsm:vold:ERROR:Disk group 36dg: Reimport of disk group failed:
        Disk group has no valid configuration copies
lsm:vold:ERROR:Disk group 35dg: Reimport of disk group failed:
        Disk group has no valid configuration copies
lsm:vold:ERROR:Disk group 32dg: Reimport of disk group failed:
        Disk group has no valid configuration copies

Also we would get the occasional crash on boot...

Further investigation revealed that typing the voldg list command would show
these diskgroups as 'disabled'. The only way round this was to deport the
diskgroups and then re-create them, this would then make the diskgroup
enabled. Interestingly though the data on the volume within these diskgroups
was always accessible.

Our question then was to why this was happening ??

Our theory goes like this :-

When a node is shutdown, it's local disk effectively becomes invisible to
the rest of the cluster, but CFS continues to serve the file system on these
disks from another node, this is the expected behaviour and is documented in
the cluster admin guide. If you attempt to access this filesystem you get an
I/O error until the down node returns, and the file system is manually
relocated to be served by the returning node.
Unfortunately, it transpires from reading the voldg man page that an I/O
error can also make the diskgroup become disabled - hence our errors above.

The way we have got round this is, disable advfsd, as this is the most
likely thing that is going to try and look at the offline filesystem, (other
than users) and report an I/O error. In addition we have created a shutdown
script that unmounts the scratch area from the cluster, this prevents CFS
from trying to serve the offline filesystem. This is a bit of a belt and
braces approach. We are now monitoring the clusters to see if the changes
remove the errors for good.

Does anyone have any comments/suggestions on the above ? I'm not sure
whether this should be described as a bug, possibly an oversight. I'm not
sure how many people will be running clusters with local LSM stripesets.

Regards

Kevin Jones

-------------------------------------------------------------------
Standard Email disclaimer follows for legal reasons....

The information in this e-mail is confidential and is intended for the
addressee only. Copying or use by anybody else is not authorised. Its
content does not necessarily represent the opinion of Compel or any of its affiliates. If you are not
the intended recipient, please advise the sender by return e-mail.

This email was scanned by hades.hamilton.co.uk.
It was deemed clean for transmission.
Please be aware that all emails are subject to monitoring.
--------------------------------------------------------------------
Received on Thu Aug 09 2001 - 21:24:29 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT