This summary is a long time coming, but that's because it took a long time
to isolate the problem.
Thanks to:
Ronny Eliahu
Kevin Partin
Stan Horwitz
Denis Sylvester
John P Speno
Brad Bell
alan_at_nabeth.cxo.dec.com
Phil Baldwin
Yogesh Bhanu
Bryan Lavelle
Udo de Boer
Michael A Crowley
Almost all agreed that it wasn't a memory problem. One suggested I could be
a known bug caused by mixing different types of memory boards. Others
suggested CPU board problems. The overwhelming majority was that this was
NOT a hardware at all but a software bug, probably in the OS. I was
reluctant to accept this because it just popped up all of the sudden and I
could booted up a 5.1 kernel, a 4.0G kernel, or a 4.0F kernel and see the
same error.
Once again, the award goes to Dr. Tom Blinn.
His first impression was "software" too but after taking a gander at a
couple of my crash-data files he started leaning towards "hardware" as well.
He said it appeared as if one of my hardware cards were overwriting my
kernel data structures. (And his explanations showed me quite a bit about
reading those darn files that fill up my crash filesystem. They are good for
something after all!)
We don't have a support contract on "this" machine, and being a government
agency it'll probably take forever to get it straightened out, and I
couldn't wait for that long. So I grabbed a static strap and a screw driver
and dove in head first.
I pulled out all the cards on the right side of the machine, a SCSI Card,
two GB Ethernet cards, a FC Adapter and an MC Card. (I was actually
suspicious of the MC Card.) I booted the machine and it ran fine, no
problems. Hardware problem confirmed, but which one.
I decided to put half of them back, I put back the SCSI Card and the two GB
Ethernet cards and booted. It ran okay, but not for long..... "kernel memory
fault"! After review of the crash-data file it appeared to be the same
random type error Dr. Blinn pointed out in the earlier files. I really
didn't think it was the SCSI Card so I pulled one of the GB Ethernet cards
and booted again. It ran fine.
I put the FC Adapter and the MC Card back in and booted back up again. It
ran fine, just like a normal healthy system. Out of curiosity I put that
suspect GB Ethernet card back in, and guess what happened? In less than 3
hours "kernel memory fault"! A quick review of the crash-data file confirmed
it. We definitely have a bad GB Ethernet Card! Now we need to determine if
that is the one we recently purchased, and still under warranty. Or if it's
the older one, then warranty has run out, we'll have to buy one to replace
it.
Bottom line: *MY* "kernel memory fault" was due to a faulty GB Ethernet card
overwriting my kernel data structures in memory. Thanks again for all the
responses, but especially to Dr. Blinn, for his patience, keeping an open
mind, and teaching me a thing or two about crash-data files. (I never would
have learned THAT by calling support and letting them evaluate them.)
Jim
jpfitz_at_fnal.gov
"Take your WinXP! Use it. Strike at LINUX with all of your hatred and your
journey towards the dark side will be complete!" - Bill Gates
----- Original Message -----
> Hello,
>
> I've found lots of entries in the archives, this seems like a common
> problem, but not much in the way of summaries.
>
> I have a 4100 (2-CPU's 1GB Memory), running Tru64 V5.1 PK3. A few
weeks
> after the upgrade to 5.1, several days after updating to PK3, I started
> getting "panic (cpu 0): kernel memory fault" at random intervals. The
> machine won't stays up for more than a day, usually failing within a few
> hours. We replaced the memory, but the same errors continue with no
change.
> Prior to the upgrade we added couple new cards, a second DEGPA Gigabit
> Ethernet card, and a Memory Channel card, but I don't see how they could
> possibly be related to these errors.
>
> To make the problem even more confusing, we have an identical 4100
> (2-CPU's 1GB Memory), running Tru64 V5.1 PK3 and it works just fine! The
> ONLY difference between these two machines is, the failing one has 466MHz
> CPU's and the other has 600MHz CPU's, all other cards and adapters are the
> same.
>
> It's getting a little frustrating, can anybody help? Here's the
> information from "dia".
>
> --------------------------------------------------------------------------
--
> -------
> Logging OS 2. Digital UNIX
> System Architecture 2. Alpha
> Event sequence number 35.
> Timestamp of occurrence 12-JUL-2001 15:30:56
> Host name XXXXX
>
> System type register x00000016 Alpha 4000/1200 Series
> Number of CPUs (mpnum) x00000002
> CPU logging event (mperr) x00000000
>
> Event validity 1. O/S claims event is valid
> Event severity 1. Severe Priority
> Entry type 302. ASCII Panic Message Type
>
> SWI Minor class 9. ASCII Message
> SWI Minor sub class 1. Panic
>
> ASCII Message panic (cpu 0): kernel memory fault
> --------------------------------------------------------------------------
--
> ----------
>
> It look as though we still have a memory problem, but with brand new
> memory...? Maybe it's something else that just looks like a memory
problem,
> but really isn't?
>
> Any help would be appreciated.
>
> Thanks,
>
> Jim Fitzmaurice
> jpfitz_at_fnal.gov
>
> UNIX is very user friendly, It's just very particular about who it makes
> friends with.
Received on Mon Aug 06 2001 - 15:04:38 NZST