Hi,
Our configuration is: two AlphaServer 4100 5/300, DU3.2G. These
AlphaServers are installed on a rack. They both are configured with 2
KZPSA´s, 1 KZPSC, 1 DE500, 1 Video Card, and 1 DNSES; 384 MB. Firmware
upgraded with the 3.9 Firmware Upgrade CDROM + 3.8-6 SRM console
upgrade.
The O.S. disk (root, /usr, swap) on both nodes is a mirror made up from
two RZ28D-VW disks and the KZPSC RAID controller
They´ve been working fine since 4 months ago.
Three weeks ago, we needed to install an additional card on one of the
AS4100 ( the upper one ). The installation procedure was extremely
careful, but suddenly the other server (which was running DU3.2G with 0
users at that time) stopped and showed up the blue screen and the P00>>
prompt. Partially recovered of the shock, we booted again the server.
The event log says that everything is OK, with the exception that there
has not been a shutdown event before the last startup event.
The problem remained on the darkness (we thought that it had been due to
a human error or something like that) until one week ago, when we needed
to add a card to the second AS4100 (lower one).
In the middle of the process (very, very, very meticulous), it happened
exactly the same as three weeks ago, with the exception that the first
we perceived was that in one of the O.S. members of the mirror (upper
node) it began flashing the error led.
With the node still up (and the other one still opened), we tried to use
the swxcrmgr software for DU, but it generated a core file. We tried to
run the program 4 or 5 times else when the machine went down to the blue
screen and the P00>> prompt. The events log says the same as three weeks
before: everything is fine. We changed the "bad disk", rebuilt the
mirror and till now, everything is OK.
There is no reason for a node to go down when one and only one of the
members of the mirror goes down, i think. What happened ? no idea.
Now, we have two AS4100 which weirdly crashed down without an apparent
cause. The PCM boards on both machines doesn´t report problems.
These crashes never gave us one single clue.Is there any previous
similar experience wherever ? Is there an FCO ? Perhaps a tip to give
maintenance to AS4100´s rackmounted ? Is recommended to turn off both
Alphas for maintenance ?
Thanks for reading this long. For sure, i´ll summarize the responses, if
any.
Regards,
UNIX Admin
Received on Thu Jun 26 1997 - 01:30:56 NZST