A couple of weeks ago I asked for help with a 5-member TruCluster in
which all the members were coming up with "Amnesiac" for their hostname.
Several people responded. In particular, John Becker, Alan Davis, Shawn
Keller, Jason Orendorf, and Thierry Faidherbe all suggested that there
was some kind of problem with /var (corrupted wtmp, lost link etc). Tom
Blinn suggested looking at /var/adm/lmf (because my license PAKS weren't
being registered) and identified the string in getty which inserts
"Amnesiac" if the hostname can't be determined. I've included John
Becker's complete "Reasons for Amnesiac" at the end of this message.
Unfortunately, his #5 was the case here!
People were right about a problem with /var. In looking at it more
carefully though, it turned out that everything was fine until member2
mounted /var; then everything went bad. So - I could have 4 out of 5
cluster members running fine but if member2 was then booted *OR* I just
did an ls in /var/cluster/members/member2 then then there would be an
AdvFS file panic and /var was rendered inaccessible to all cluster
members.
So, as long as nobody touched anything in /var/cluster/members/member2
we were OK and we limped along on 4 members until I got this sorted out
(with help from Compaq support)(because of some other unexplained
issues, nobody was quite sure at first if this was a software or
hardware problem).
Tried running "verify" with the cluster down except for one member in
single-user mode. That just made things worse because *nobody* could
mount /var after that ("Found bad xor in sbm_total_free_space!
Corrupted SBM metadata file!").
After that, I ran "salvage" on the cluster_var domain, removed the
domain ("rmfdmn cluster_var"), recreated it ("mkfdmn /dev/disk/dsk14c
cluster_var", "mkfset cluster_var var"), mounted it ("mount
cluster_var#var /var"), and copied all the salvaged files (except for
quota.user, quota.group, and .tags) to the new /var.
After this, I brought up all the cluster members (except for 2), one at
a time. Then did a clu_add_member to add member2 back to the cluster (I
had partially deleted it earlier - clu_delete_member worked until it
tried to delete /var/cluster/members/member2 at which point the file
system panicked). A few grumbles and minor complaints along the way but
everything seems fine now.
Wish I knew what triggered all this but that's probably out of the
question (*immediately* before this problem somebody was testing
directio on *another* member, that member and then the cluster hung
completely and never came back properly. Moreover, the flash BIOS had to
be reinstalled on that other member).
Thanks for the advice and suggestions!
Chris
Chris Loken wrote:
>
> I've seen one message about a similar problem but there was no answer...
>
> Had a lot of trouble last night with a 5-member TruCluster which has
> been running well for a year. Now I can boot every member and they all
> succesfully join the cluster but the console login for each member is
> "Amnesiac" rather than the hostname. Can only login as root and then the
> machine complains that it can't find an OSF-BASE license PAK and
> immediately dumps me into the configuration setup (sysman?).
>
> I _really_ hope I don't have to enter all the license PAKS again (48
> CPUs!). Any tips on what do next or what to read? At least some of the
> configuration information is really there (like in rc.config).
>
> Thanks,
>
> Chris
> +AAM-
> cloken_at_cita.utoronto.ca
""Becker, John"" wrote:
>
> Chris:
>
> Reasons for Amnesiac (most common is 2):
> |
> |
> |
> | 1. /etc/wtmp and /etc/utmp files may be corrupt.
> |
> |
> |
> | Recreated using /usr/sbin/acct/nulladm
> |
> |
> |
> | 2. Verify that /var and /usr/var are set up correctly.
> |
> |
> |
> | If var is on its own partition, it may look as follows:
> |
> |
> |
> | root-:/var/adm> ls -ld /var drwxr-xr-x 27 root
> |
> | system 8192 Jul 26 14:39 /var
> |
> | root-:/var/adm> ls -ld /usr/var lrwxr-xr-x 1 root
> |
> | system 6 Jul 22 22:01 /usr/var ->../var
> |
> |
> |
> | 3. Verify that the system startup files in /etc/rc.config are
> |
> | valid. For example, rc.config may contain a bogus entry
> |
> | which will result in the system coming up as Amnesiac.
> |
> |
> |
> | # diff /etc/rc.config /etc/rc.config_prev
> |
> | 141a142,143
> |
> | > IFCONFIG=166.11.15.55="netmask 255.255.255.224 speed 200"
> |
> | > export IFCONFIG=166.11.15.55
> |
> |
> |
> | This particular IFCONFIG environment variable is not valid.
> |
> |
> |
> | 4. Check that links on files in rc*.d are valid.
> |
> |
> |
> | ls -l /sbin/rc*.d
> |
> |
> |
> | 5. It may not be possible to identify the root cause of the
> |
> | problem in which case a restore of / (root) and /var is
> |
> | required.
> |
> |
> |
+ABg-
----------------------------------------------------
Chris Loken Phone: 416-978-5619
Computing Facility Manager Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics
Received on Mon Oct 15 2001 - 13:34:53 NZDT