Interim Summary: Crash!

From: Cyndi Smith <cyn_at_odin.mdacc.tmc.edu>
Date: Thu, 09 Mar 2000 10:07:45 -0600 (CST)

I just wanted to bring you all up to date.
I still have not tracked down the problem. So far, the system seems to be
functioning OK since the reboot, but I am keeping an eye on it.

Thanks to
      John J. Francini
      Sean O'Connell
      Rodrigo Poblete
      Robert Carsey

for sending their advice. The overall impression seems to be that a
"simple_lock: time limit" panic was usually due to an operating system
bug -- unfortunately our service level doesn't allow me to call Compaq
myself and open a ticket -- I have to go through our main campus, which
is in Austin (200 miles away). From experience, this is usually not
worth it...
The last advice I got (from Robert Carsey) was a bit different --
he indicated that it sounds like we ran out of memory and swap space.
If that is so, it is due to a bug or other problem since we have
4GB RAM and 9GB swap and our usage has never strained it before.

If anyone has any other ideas -- or ideas on how I might could follow up
on Robert's idea -- I'd love to hear from you!

Attached below is my original post(s).

Cyndi
-- 
-Cyndi Smith			     Programmer Analyst III, Biomathematics
-cyn_at_odin.mdacc.tmc.edu		M.D. Anderson Cancer Center, Houston, Texas
-phone: (713) 794-4938					fax: (713) 792-4262
			<http://odin.mdacc.tmc.edu/~cyn>
------------------begin attachment--------------------
Our 4100 5/400 crashed this afternoon.  All day (up to the crash), it was 
intermittantly slow -- to the point where it might be 30 seconds for it to 
echo your keystrokes.  
Since all access to this machine is generally remote, I ran netstat -id 
and all looked normal.  I also looked at the load and it was, if anything, 
low.
So, I chalked it up to our intermittant network glitches and didn't worry 
too much.
Then came the crash....
>From what I can tell from the crash logs (and I am far from an expert on 
these things), the crash was caused by
      _panic_string:  0xfffffc00005b5e50 = "simple_lock: time limit exceeded"
The current PID is shown as:
      l3 address 0xffffffffffffffd8 not mapped, pte 0x0
which is frequently repeated in other places in the log as well...
About a week ago, we had a crash that seemed to be caused by
      _panic_string:  0xfffffc0000590898 = "pmap_begin_shared_region timeout" 
and I was able to track down the PID and find the program that caused it.
Any clues or ideas as to how I might proceed.
The only change I can think of in the system recently is that I installed ssh2
yesterday.
[followup post]
Sorry folks, I forgot to include vital info such as:
Tru64 4.0F, PatchKit 2 + kern_mod patch 
setld -i shows:
      OSFPAT00007600440 installed  Patch: Fix for simple lock panic (Kernel Patches)
So that is not the source of the problem unless the patch does not REALLY 
fix it <grin>.
Thanks for any leads!
Received on Thu Mar 09 2000 - 16:08:43 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT