OT: strange hardware problem with alpha oem-board-based server

From: Horst Reiterer <reiterer_at_bit-pilots.at>
Date: Tue, 15 Feb 2000 18:59:23 +0100

Hi,

I'm running Linux/Alpha (this is not a software-related question) on an
LX164 board. The problem is that after random uptime the system hangs. The
machine is still running and the monitor still detects a VGA-signal, however
the screen is black and no action can be taken anymore.
This problem started about a year ago, that time I thought it was a unique
hang and forgot about it but after 3 months or so the same happened again.
This problem reappeared again and again, the uptimes were continually
decreasing, currently it can't even be up for more than a day...

I'm quite sure it's not a software related problem because this also
happened when the server was idle and no deamon was running. Moreover I
resetup the system multiple times, with different kernels...
In fact I came to the conclusion that it's hardware related.

The initial hardware configuration was as follows:

    LX164 Alpha motherboard, 21164a 600mhz
    512 MB ECC RAM (4 DIMMS)
    Adaptec 2940 Ultra2 SCSI controller
    IBM 4,5GB Ultra2 LVD SCSI HD
    3COM network adapter

I also want to point out that it's not (at least 99%) a temperature /
overheating problem, I've checked the cooling fans and monitored the
temperature, it's fine...
So far I replaced the power supply, RAM DIMMS, motherboard with an SX164
board (21164PC 533mhz), SCSI controller with an Intraserver and 3COM adapter
with a DEC Tulip.

Unfortunately nothing helped! Interesting was that after I changed the
Adaptec with the Intraserver controller, the uptime dramatically decreased.
However I don't know for sure whether the Intraserver controller caused this
or not...

The thing is that when it 'hangs' (no action possibly, not reachable via
network) there's still a VGA signal. If the CPU would stop due to
overheating or the power supply would cause this then there would be no VGA
signal...
So I still think it has to do with the storage hardware. I'm pretty sure the
Intraserver is ok, also the HD should be. After a reboot everything works
fine again...

Could the reason for this be a defect or bad SCSI cable? This is what I
haven't changed yet, it may also be the SCSI terminator placed at the end of
the device chain. Would it be possible that one of those items causes this?

I'd more than appreciate any help concerning this subject. Thank you all in
advance!
Will summarize!


cheers,

    Horst Reiterer
Received on Tue Feb 15 2000 - 17:59:45 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT