AdvFS error after firmware update/error with a MSA1000 - more Info

From: Antonio Gonzalez <antonio.gonzalez_at_terra.es>
Date: Mon, 28 Mar 2005 19:27:47 +0200

Regarding the MSA1000, I've just find out that, by the momment of the
issue, after the reboot, the MSA1000 lost also the Host Identification
on ALL units. So, the connection Profile and the host ID got lost after
the firmware update reboot.

I've checked again the reboot but all the variables remain the same
after the power cycle of the MSA1000.
Of course, fixing this host ID doesn't help at all with my advfs
corruption.

Thanks for any help you can provide in trying to understand what
happened.
Regards
antonio

=== origial report ======= >>>>>>>>>>>

We have an ES40 + MSA1000 (fabric). Tru64 OS (rev. 5.1A-PK6) is on the
MSA1000 + many oracle domains. The MSA1000 f/w rev. = 4.32 There are one
single controller in the MSA1000 and one single HBA in the host. I were
testing the MSA1000 firmware update procedure:
1.- update MSA1000 f/w to the same actual version (4.32)(is the last
one)
2.- shutdown Tru64
3.- power cycle MSA1000
>>> at the startup, one of the drives got failed.
Everything else is OK.
4.- I Change the drive & start rebuild
5.- add spare (there was no spare before)
4.- reboot Tru64
>>> several AdvFS domain panics

After some investigation I find that the firmware update has CHANGED the
connection PROFILE from Tru64 to Default !!

I restored the right value back to Tru64.
In the MSA1000 all units seem to be OK, some of them are still
rebuilding but the status is OK.

Now I have lost /usr and 3 more oracle domains.
Rebooting the server doesn't change anything.
The fixfdmn can't help.

The problems are in some of the units that live in the Array that uses
the failed disk, but all units were RAID5 protected.

===== boot messages ====
Mounting / (root)
user_cfg_pt: reconfigured
root_mounted_rw: reconfigured
user_cfg_pt: reconfigured
root_mounted_rw: reconfigured
user_cfg_pt: reconfigured
dsfmgr: NOTE: updating kernel basenames for system at /
    scp kevm tty00 tty01 lp0 dmapi dsk0 dsk1 scp0 floppy0 cdrom0 dsk2
dsk3 dsk4 dsk5 dsk7 dsk8 dsk9 dsk10 dsk11 tape0 Mounting local
filesystems
exec: /sbin/mount_advfs -F 0x14000 root_domain#root / root_domain#root
on / type advfs (rw) /proc on /proc type procfs (rw)
exec: /sbin/mount_advfs -F 0x4000 usr_domain#usr /usr
live_dump: BMT page has the wrong page number: Expected 221, read 0.
unable to live_dump: directory /var/adm/crash not found

BMT page has the wrong page number: Expected 221, read 0.
AdvFS Domain Panic; Domain usr_domain Id 0x41f8c58d.0004bea0
An AdvFS domain panic has occurred due to either a metadata write error
or an in ternal inconsistency. This domain is being rendered
inaccessible. Please refer to guidelinlive_dump: BMT page has the wrong
page number: Expected 237, read 0. unable to live_dump: directory
/var/adm/crash not found es in AdvFS Guide to File System Administration
regarding what steps to take to recover this domain. usr_domain#usr on
/usr: I/O error

=======

The same for /var and 3 more oracle domains. The /usr problem prevent
multiuser boot.

Example: on ora_his3 I get this error:
exec: /sbin/mount_advfs -F 0x4000 ora_his3#histo3 /u08

Found bad xor in sbm_total_free_space! Corrupted SBM metadata file!
AdvFS Domain Panic; Domain ora_his3 Id 0x42010772.0009cca0 An AdvFS
domain panic has occurred due to either a metadata write error or an in
ternal inconsistency. This domain is being rendered inaccessible. Please
refer to guidelines in AdvFS Guide to File System Administration
regardin g what steps to take to recover this domain. AdvFS I/O error:
    A read failure occurred - the AdvFS domain is inaccessible (paniced)
    Volume: /dev/disk/dsk10c
    Tag: 0xfffffff9.0000
    Page: 338
    Block: 354103696
    Block count: 256
    Type of operation: Read
    Error: 5 (see /usr/include/errno.h)
    EEI: 0x300
    AdvFS initiated retries: 0
    Seconds from first I/O attempt to this failure: 0
    Total AdvFS retries on this volume: 0
AdvFS I/O error:
    A read failure occurred - the AdvFS domain is inaccessible (paniced)
    Volume: /dev/disk/dsk10c
    Tag: 0xfffffff9.0000
    Page: 354
    Block: 354103952
    Block count: 128
    Type of operation: Read
    Error: 5 (see /usr/include/errno.h)
    EEI: 0x300
    AdvFS initiated retries: 0
    Seconds from first I/O attempt to this failure: 0
    Total AdvFS retries on this volume: 0
AdvFS I/O error:
    A read failure occurred - the AdvFS domain is inaccessible (paniced)
    Volume: /dev/disk/dsk10c
    Tag: 0xfffffff9.0000
    Page: 362
    Block: 354104080
    Block count: 144
    Type of operation: Read
    Error: 5 (see /usr/include/errno.h)
    EEI: 0x300
    AdvFS initiated retries: 0
    Seconds from first I/O attempt to this failure: 0
    Total AdvFS retries on this volume: 0
ora_his3#histo3 on /u08: I/O error

How is it possible to corrupt a filesystem before using it, just trying
to mount after boot without any application I/O ?

With fixfdmn I could fix /var, but no others:
# /sbin/advfs/fixfdmn usr_domain
fixfdmn: Checking the RBMT.
fixfdmn: Clearing the log on volume /dev/disk/dsk0g.
fixfdmn: Checking the BMT mcell data.
fixfdmn: Checking the deferred delete list.
fixfdmn: Checking the root tag file.
fixfdmn: Checking the tag file(s).
fixfdmn: Checking the mcell nodes.
fixfdmn: Checking the BMT chains.
fixfdmn: Checking the frag file group headers.
fixfdmn: Checking for frag overlaps.
fixfdmn: Checking for BMT mcell orphans.
fixfdmn: Checking for file overlaps.
fixfdmn: Checking the directories.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 3168.8002 in fileset usr remains inaccessible.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 4354.8001 in fileset usr remains inaccessible.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 4359.8001 in fileset usr remains inaccessible. ...... Many
many lines like these
fixfdmn: No free pages in this domain.
         Tag 25437.8001 in fileset usr remains inaccessible.
fixfdmn: Can't add page because there are no free mcells.
fixfdmn: No free pages in this domain.
         Tag 29656.8001 in fileset usr remains inaccessible.
fixfdmn: Checking the frag file(s).
fixfdmn: Checking the quota files.
fixfdmn: Checking the SBM.
fixfdmn: Completed.

Then I try to mount /usr ...

# mount /usr
ADVFS EXCEPTION
Module = ../../../../src/kernel/msfs/bs/bs_extents.c, Line = 3012
load_inmem_xtnt_map: bad extent map type
panic (cpu 0): load_inmem_xtnt_map: bad extent map type
syncing disks... done

DUMP: blocks available: 15000000
DUMP: blocks wanted: 374082 (partial compressed dump) [OKAY]
DUMP: Device Disk Blocks Available
DUMP: ------ ---------------------
DUMP: 0x1300023 11254095 - 14999997 (of 14999998) [primary swap]
DUMP.prom: Open: dev 0x5100081, block 2000000: SCSI3 1 4 0 1 0 0 0
_at_wwid0
DUMP: Writing header... [1024 bytes at dev 0x1300023, block 14999998]
DUMP: Writing data......................... [25MB]
DUMP: Writing header... [1024 bytes at dev 0x1300023, block 14999998]
DUMP: crash dump complete.

Any ideas to share ?
Meanwhile, pls be carefull whith this box !!.
antonio

Antonio González
e-Mail antonio.gonzalez _at_ terra.es
Received on Mon Mar 28 2005 - 17:30:30 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT