SUMMARY: Crash/It is happening again

From: Cyndi Smith <cyn_at_odin.mdacc.tmc.edu>
Date: Fri, 14 Apr 2000 12:51:11 -0500 (CDT)

I am so sorry this took so long, but I finally got the problems
resolved -- pretty much.

To refresh your memory:

Date: Wed, 8 Mar 2000 16:08:37 -0600 (CST)
Subject: Crash!

Our 4100 5/400 crashed this afternoon. All day (up to the crash), it was
intermittantly slow -- to the point where it might be 30 seconds for it
to echo your keystrokes.
Since all access to this machine is generally remote, I ran netstat -id
and all looked normal. I also looked at the load and it was, if anything, low.
So, I chalked it up to our intermittant network glitches and didn't worry
too much.
Then came the crash....
>From what I can tell from the crash logs (and I am far from an expert
on these things), the crash was caused by
      _panic_string: 0xfffffc00005b5e50 = "simple_lock: time limit exceeded"
The current PID is shown as:
      l3 address 0xffffffffffffffd8 not mapped, pte 0x0
which is frequently repeated in other places in the log as well...
Tru64 4.0F, PatchKit 2 + kern_mod patch
setld -i shows:
 OSFPAT00007600440 installed Patch: Fix for simple lock panic (Kernel Patches)

Date: Thu, 9 Mar 2000 10:07:45 -0600 (CST)
Subject: Interim Summary: Crash!

I still have not tracked down the problem. So far, the system seems to be
functioning OK since the reboot, but I am keeping an eye on it.
Thanks to
      John J. Francini, Sean O'Connell, Rodrigo Poblete, Robert Carsey
for sending their advice. The overall impression seems to be that a
"simple_lock: time limit" panic was usually due to an operating system
bug -- unfortunately our service level doesn't allow me to call Compaq
myself and open a ticket -- I have to go through our main campus, which
is in Austin (200 miles away). From experience, this is usually not
worth it...

Date: Thu, 9 Mar 2000 14:54:48 -0600 (CST)
Subject: It is happening again...

I apologize for asking again, but I don't know where else to turn and I
appreciate all of your advice after yesterday's crash (special thanks go
to Dr. Tom Blinn and Whitney Latta for their exhaustive explanations today).
Even with the wonderful help, I have been unable to diagnose the
cause of the crash, but the slow-downs are happening again.
Intermittantly, the system seems to buffer all keystrokes. I usually have
several windows open, today one as root and several as a normal user. When
the slow-down occurs, it is in ALL windows -- even root's. The reason I say
the system is buffering the keystrokes is that I keep typing even though the
keystrokes aren't showing up in the window, & after a bit, they all show up.
dia reports nothing, no new messages, uerf reports nothing ...
# vmstat 5 5
Virtual Memory Statistics: (pagesize = 8192)
 procs memory pages intr cpu
 r w u act free wire fault cow zero react pin pout in sy cs us sy id
 6 415 32 312K 153K 47K 16M 1M 3M 10K 10M 434 130 4K 771 8 3 90
 8 408 32 312K 153K 48K 14K 1341 2959 0 9506 0 531 9K 2K 34 11 56
 7 413 32 312K 153K 48K 5663 420 938 0 4088 0 376 8K 2K 28 4 68
 6 417 32 312K 152K 48K 3987 292 630 0 2887 0 227 8K 2K 26 3 71
 7 420 32 313K 152K 48K 3926 187 705 0 2951 0 452 9K 2K 31 8 61
# iostat 5 5
      tty re0 re1 re2 re3 cpu
 tin tout bps tps bps tps bps tps bps tps us ni sy id
   3 327 869 16 76 3 115 3 156 7 5 3 3 90
   5 255 222 5 0 0 2 0 0 0 26 0 1 73
   5 273 80 2 40 4 0 0 11 0 29 0 6 65
   4 324 43 1 0 0 0 0 8 1 28 0 4 68
   7 1170 829 23 0 0 45 4 75 6 30 0 8 63
# top
Load averages: 1.29, 1.19, 1.17 14:50:58
357 processes: 2 running, 1 waiting, 68 sleeping, 284 idle, 2 zombie
CPU states: 29.0% user, 0.0% nice, 7.2% system, 63.7% idle
Memory: Real: 2188M/4016M act/tot Virtual: 7M/8678M use/tot Free: 1240M

Date: Thu, 16 Mar 2000 14:44:06 -0600 (CST)
Subject: UPDATE: Crash/It is happening again

After applying the latest patch kit for 4.0F onto our 4100 5/400 (PK3,
uploaded to the dec patches site on Friday), I hoped that our problems were
over. To remind you, we were having intermittant "buffering" of keystrokes
-- they seem to occur in all windows - with all logins - at the same time(s).
After a few seconds, everything a user had typed in a given window, showed
up correctly in that window...
Despite my best efforts and the wonderful help (I learned a lot!) of this
group, I was unable to figure out exactly what happened, but the behavior
improved (fewer instances of the "buffering") and then the new patch kit
came out -- with several patches mentioning simple_lock...
With High Hopes, I installed the kit on Saturday. Unfortunately, the
behavior is still there -- not very often, but still.

-------

OK, on to the solution(?).

Since we never again crashed, despite the slowdowns worsening and all
the kernel tuning that I did trying to diagnose or fix the problem, I
came to the decision that the crash was simply coincidental to the
manifestation of the slowdown.

After all that, it turns out that the primary fault did not lie
with the AlphaServer -- it was just the place that showed the most
obvious symptoms. Our subnet had simply outgrown our old concentrator
or the concentrator was intermittantly failing. Anyway, we replaced
it with a new switch and replaced some old cat 3 wiring with cat 5 and
viola! Things are much better! We still have 1 or 2 second glitches,
but we have always had that. Netstat has reported no new errors since
the switch install (error rate had been 1-10 per minute) and the collision
rate has dropped from 25-30% to 1-15%. User perception is that things
are better than they were before the problems started!

Yay!

I want to thank all of you (too many to list here) for your advice and
help in this process. I was able to justify asking our Network Services
for help and eventually for the new switch by virtue of the diagnosis
and tuning I had done with your assistance.

Cyndi
-- 
-Cyndi Smith			     Programmer Analyst III, Biomathematics
-cyn_at_odin.mdacc.tmc.edu		M.D. Anderson Cancer Center, Houston, Texas
-phone: (713) 794-4938					fax: (713) 792-4262
			<http://odin.mdacc.tmc.edu/~cyn>
Received on Fri Apr 14 2000 - 17:52:10 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT