This has happened to me twice now.
the machine has been running with no problems for months. Solid as a rock.
Now I get this twice in a row within 60 minutes of each other:
Apr 8 15:28:21 net vmunix: WARNING: too many Processor corrected errors
detected on cpu 0. Reporting suspended.
Apr 8 15:28:21 net vmunix: Machine Check error corrected by processor
Apr 8 15:32:44 net vmunix: fffd4
Apr 8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error
Could errant software cause this or should I be looking for loose chips, simms
or cards? The last thing we did was hack wuftpd to not allow logins for users
with the shell /bin/nologin which is for a POP only type of account.
The machine and cables haven't been moved at all in a while and nothing major
has changed since it was stable.
I do have the messages file with all the addresses, but our service contract
with those thieves at DEC has expired and the dump info is just so much
gibberish.
Do I have a definate hardware fault here or can software ellicit such a crash?
Dust in the case of the 1000? Loose simms? Dirty card edges? Power is
conditioned through a UPS.
I also have the following at the end of the messages file, but we haven't
crashed yet:
Apr 8 16:58:27 net vmunix: WARNING: too many Processor corrected errors
detected on cpu 0. Reporting suspended.
Apr 8 16:58:27 net vmunix: Machine Check error corrected by processor
As shown above we crashed within 5 minutes of the errors.
I really need a tip on what direction to take to debug or eliminate this
problem without going broke doing it. We can't have this machine down for long.
Any help will be much appreciated and if I get any useful info on this event
I'll summarize for the list.
Thanks in advance,
J. Henry Priebe Jr. President & Network Administrator
root_at_net.bluemoon.net Blue Moon Internet Services
sysop_at_bbs.bluemoon.net Blue Moon Online System
http://www.bluemoon.net Try MoonMUD! "telnet mud.bluemoon.net 4000"
Received on Tue Apr 08 1997 - 23:51:35 NZST