-- __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ Jeffrey G. Micono 505.844.6767 Ktech Corporation 505.268.3379 __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ From: Kurt Carlson <sxkac_at_java.sois.alaska.edu> >Apr 8 15:28:21 net vmunix: WARNING: too many Processor corrected errors >detected on cpu 0. Reporting suspended. >Apr 8 15:28:21 net vmunix: Machine Check error corrected by processor >Apr 8 15:32:44 net vmunix: fffd4 >Apr 8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error You have a hardware problem, probably memory or cpu. Check your errorlog... with uerf you'll see CPU exceptions, with decevent you may be able to tell exactly what. >Could errant software cause this no >or should I be looking for loose chips, simms or cards? reseating cards might help, might not. you need to find out what's having the problem... check the error logs. _____________________________________________________________________ Kurt Carlson, University of Alaska SOIS/TS, (907)474-6266 sxkac_at_alaska.edu 910 Yukon Drive #105.63, Fairbanks, AK 99775-6200 From: Dave Cherkus <cherkus_at_homerun.unimaster.com> Blue Moon Network Administrator writes: |> |> This has happened to me twice now. |> |> the machine has been running with no problems for months. Solid as a rock. |> |> Now I get this twice in a row within 60 minutes of each other: |> |> Apr 8 15:28:21 net vmunix: WARNING: too many Processor corrected errors |> detected on cpu 0. Reporting suspended. |> Apr 8 15:28:21 net vmunix: Machine Check error corrected by processor |> Apr 8 15:32:44 net vmunix: fffd4 |> Apr 8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error |> |> Could errant software cause this or should I be looking for loose chips, simms |> or cards? The last thing we did was hack wuftpd to not allow logins for users |> with the shell /bin/nologin which is for a POP only type of account. This is definitely a hardware problem. |> The machine and cables haven't been moved at all in a while and nothing major |> has changed since it was stable. |> |> I do have the messages file with all the addresses, but our service contract |> with those thieves at DEC has expired and the dump info is just so much |> gibberish. Well, those theives might also suggest you install a program called DEC-Event that at least tries to decode the gibberish for you. |> Do I have a definate hardware fault here or can software ellicit such a crash? |> Dust in the case of the 1000? Loose simms? Dirty card edges? Power is |> conditioned through a UPS. Hardware. The cpu can correct errors in the cache chips or the memory simms. Try reseating the card that the CPU is on and memory chips. Don't try reseating the CPU itself - you will almost certainly bend a pin and ruin the CPU. I hate to say but I have had problems with this pattern and did have to have the CPU card replaced. -- Dave Cherkus ------- UniMaster, Inc. ------ Contract Software Development Specialties: UNIX Internals/Kernel TCP/IP Alpha Clusters Performance ISDN Email: cherkus_at_UniMaster.COM When the music's over, turn out the lights! From: Olle Eriksson <olle_at_cb.uu.se> You have a bad memory card or a bad cache memory. From: "Knut =?iso-8859-1?Q?Helleb=F8?=" <Knut.Hellebo_at_nho.hydro.com> Regards, Try shutting down to PROM mode and do 'set d_group field;memory' to test the memory. If anything fails do 'showit' to get the status from the memory test (and hopefully the failing SIMM(s)). To interrupt the test you have to hard reset. Good Luck ;-) -- = ****************************************************************** * Knut Helleb=F8 | DAMN GOOD COFFEE !! = * * Norsk Hydro a.s | (and hot too) * * Phone: +47 55 996870, Fax: +47 55 996342 | * * Cellular Phone: +47 93092402 | * * E-mail: Knut.Hellebo_at_nho.hydro.com | Dale Cooper, FBI * ****************************************************************** From: TetraPakDA_at_t-online.de (Tetra Pak APS GmbH) Hi, I'd got a similar problem on my Alpha Server. Changing the simms solved it (we had DEC and third party simms) In console mode init the system, if it says something about corretable error, try and excange the simms. Best regards Claudia From: "Dr. Tom Blinn, 603-881-0646" <tpb_at_zk3.dec.com> > Now I get this twice in a row within 60 minutes of each other: > > Apr 8 15:28:21 net vmunix: WARNING: too many Processor corrected errors > detected on cpu 0. Reporting suspended. > Apr 8 15:28:21 net vmunix: Machine Check error corrected by processor > Apr 8 15:32:44 net vmunix: fffd4 > Apr 8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error > > Could errant software cause this or should I be looking for loose chips, simms > or cards? The last thing we did was hack wuftpd to not allow logins for users > with the shell /bin/nologin which is for a POP only type of account. I passed your message along to one of the engineers who works with the platform you are seeing the problem on, and he pointed out some things you might not be aware of. One class of processor correctable errors is single-bit memory errors. These are corrected by the ECC logic, but an event is logged through the "Machine Check" logic (an interface between the PALcode that runs in interrupt mode to deal with hardware problems) and the operating system. When there are large numbers of Processor corrected errors in a short period, you get the message about reporting being turned off (so the error logs don't grow without bound, but you'll know we stopped logging the errors). A double bit error would NOT be corrected, and would panic the system, perhaps with a "Machine check - Hardware error". You need to run UERF or DECevent (see the reference pages) against the binary error log (which gets updated after the panic during the reboot with the data from the hardware logout frame) and get the detailed log of what made the system fail. (Unfortunately, on your system, there is no UERF support, so you have to use DECevent.) Once you have that information, it's possible to tell exactly what went wrong. If you have a hardware support contract, I'd recommend you call this in and ask that your system be repaired, because something's broken in the hardware. If you are self-maintenance, then you need to do the analysis of the error log yourself and decide what components to replace. Tom Dr. Thomas P. Blinn, UNIX Software Group, Digital Equipment Corporation 110 Spit Brook Road, MS ZKO3-2/U20 Nashua, New Hampshire 03062-2698 Technology Partnership Engineering Phone: (603) 881-0646 Internet: tpb_at_zk3.dec.com Digital's Easynet: alpha::tpb ACM Member: tpblinn_at_acm.org PC_at_Home: tom_at_felines.mv.net Worry kills more people than work because more people worry than work. Keep your stick on the ice. -- Steve Smith ("Red Green") My favorite palindrome is: Satan, oscillate my metallic sonatas. -- Phil Agre, pagre_at_ucsd.edu Opinions expressed herein are my own, and do not necessarily represent those of my employer or anyone else, living or dead, real or imagined. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<Received on Fri Apr 11 1997 - 22:50:55 NZST
This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT