I asked:
I have a AS2000 running 4.0g PK3, 512 MB memory, 2 300Mhz CPU's
In short,
I had a system that would not let me boot and the / filesystem was corrupt.
Thanks to the fine answers from
Dr. Tom
Ian Baker
Allan Rollow
Dr. Tom's answer
It sounds like the initial tape drive problem (was it "rmt0"?) lead
to a sequence of mis-steps. I doubt you've been hacked. You need to
just work through getting the system stable again. Any time you make
almost ANY change in a production environment, you have the risk of
having something down-stream "break" because of a dependency on how
things were working before that wasn't fully understood.
There is a probably a relatively simple explanation for each of the
symptoms you've hit. For instance, it's possible to have "osf_boot"
be missing because it never got restored from a backup. Or you can
hit other problems (like your bad /etc/fstab which probably happened
as you were re-building your boot disk from the prior problems). It
is just a messy process and you just have to keep finding things and
fixing them until things stabilize again.
Ian said to recreate the disklabel. I plan on doing that this weekend
but there is more involved.
Allan's Comments deserve reading also. Very good.
Regarding the SCSI adapter that was a wonder it worked
at all... Actually, it looks like it wasn't working.
Or at least only enough to cause problems with the
devices it was presenting.
Regarding reformatting the page/swap space... The
page/swap space isn't a format, other than the low
level format of the underlying disk that makes the
disk usable. Page/swap space is just blocks. No
file system. Don't bother making one since it will
just get overwritten as it is used.
The absense of osf_boot is usually the result of it
not being there, or something having happened to the
boot blocks of the disk. Someone in the last week or
so was changing partition tables. My vague recollection
(the list gets lots of questions) is that it might have
been you. If so, the disk may not have a boot block.
If the disk is failing or the SCSI adapter to which it
is connected is going insane, then that could cause the
content of the boot blocks to be overwritten or quietly
fail to read.
Unless a special device has become corrupted, its major
and/or minor number changed, recreating them will have
no affect on the underlying device working. The special
file merely encodes the major/minor device numbers and
provides access control.
I would track down a CDROM distribution of V4.0G and
boot it. To the extent possible, non-destructively
exercise all the devices on the system to verify they
seem to work. For disks with unused or page/swap
partitions, a read/write test is safe, if you can
manage not to touch other partitions. Check the
partition tables before doing anything that writes
to ensure they address the parts of the disk they're
supposed to.
For devices with removable media (tapes), do write/read
testing on those to ensure they're working correctly.
Fix any hardware problems you encounter before going
further.
Mount the root file system with the standalone system
and verify it looks intact. Compare the top of the
root with that of the CDROM and a file listing of the
backup. For minor damage see if you can copy the missing
files from the CDROM. For anything else, restore from the
last known good backup.
If you have removable disks, you might also consider
a clean installation on a spare disk. Use that to help
check the rest of the system.
Be methodical.
So to summarize the whole thing.
Basically the scsi controller failed / came loose from the box causing
software corruption on the / file system.
Everything else that happened was on me. (Not building the disklabel
correctly, moving osf_boot off of main partition, etc)
--
Ron Bramblett
Sys Admin
Fuller Brush Company
Received on Mon Nov 10 2003 - 16:17:28 NZDT