I am so sorry this took so long, but I finally got the problems
resolved -- pretty much.
To refresh your memory:
Date: Wed, 8 Mar 2000 16:08:37 -0600 (CST)
Subject: Crash!
Our 4100 5/400 crashed this afternoon.  All day (up to the crash), it was 
intermittantly slow -- to the point where it might be 30 seconds for it 
to echo your keystrokes.  
Since all access to this machine is generally remote, I ran netstat -id 
and all looked normal. I also looked at the load and it was, if anything, low.
So, I chalked it up to our intermittant network glitches and didn't worry 
too much.
Then came the crash....
>From what I can tell from the crash logs (and I am far from an expert 
on these things), the crash was caused by
      _panic_string:  0xfffffc00005b5e50 = "simple_lock: time limit exceeded"
The current PID is shown as:
      l3 address 0xffffffffffffffd8 not mapped, pte 0x0
which is frequently repeated in other places in the log as well...
Tru64 4.0F, PatchKit 2 + kern_mod patch 
setld -i shows:
 OSFPAT00007600440 installed Patch: Fix for simple lock panic (Kernel Patches)
Date: Thu, 9 Mar 2000 10:07:45 -0600 (CST)
Subject: Interim Summary: Crash!
I still have not tracked down the problem. So far, the system seems to be
functioning OK since the reboot, but I am keeping an eye on it.
Thanks to
      John J. Francini, Sean O'Connell, Rodrigo Poblete, Robert Carsey
for sending their advice.  The overall impression seems to be that a 
"simple_lock: time limit" panic was usually due to an operating system
bug -- unfortunately our service level doesn't allow me to call Compaq
myself and open a ticket -- I have to go through our main campus, which
is in Austin (200 miles away).  From experience, this is usually not 
worth it...
Date: Thu, 9 Mar 2000 14:54:48 -0600 (CST)
Subject: It is happening again...
I apologize for asking again, but I don't know where else to turn and I
appreciate all of your advice after yesterday's crash (special thanks go 
to Dr. Tom Blinn and Whitney Latta for their exhaustive explanations today).
Even with the wonderful help, I have been unable to diagnose the
cause of the crash, but the slow-downs are happening again.
Intermittantly, the system seems to buffer all keystrokes. I usually have 
several windows open, today one as root and several as a normal user.  When 
the slow-down occurs, it is in ALL windows -- even root's. The reason I say
the system is buffering the keystrokes is that I keep typing even though the
keystrokes aren't showing up in the window, & after a bit, they all show up.
dia reports nothing, no new messages, uerf reports nothing ...
# vmstat 5 5
Virtual Memory Statistics: (pagesize = 8192)
 procs    memory         pages                          intr        cpu      
 r  w  u  act  free wire fault cow zero react pin pout  in  sy  cs  us  sy  id
 6 415 32  312K 153K  47K  16M   1M   3M  10K  10M  434 130  4K 771   8   3  90
 8 408 32  312K 153K  48K  14K 1341 2959    0 9506    0 531  9K  2K  34  11  56
 7 413 32  312K 153K  48K 5663  420  938    0 4088    0 376  8K  2K  28   4  68
 6 417 32  312K 152K  48K 3987  292  630    0 2887    0 227  8K  2K  26   3  71
 7 420 32  313K 152K  48K 3926  187  705    0 2951    0 452  9K  2K  31   8  61
# iostat 5 5
      tty     re0      re1      re2      re3     cpu
 tin tout bps tps  bps tps  bps tps  bps tps  us ni sy id
   3  327 869  16   76   3  115   3  156   7   5  3  3 90
   5  255 222   5    0   0    2   0    0   0  26  0  1 73
   5  273  80   2   40   4    0   0   11   0  29  0  6 65
   4  324  43   1    0   0    0   0    8   1  28  0  4 68
   7 1170 829  23    0   0   45   4   75   6  30  0  8 63
# top
Load averages:  1.29,  1.19,  1.17                               14:50:58
357 processes: 2 running, 1 waiting, 68 sleeping, 284 idle, 2 zombie
CPU states: 29.0% user,  0.0% nice,  7.2% system, 63.7% idle
Memory: Real: 2188M/4016M act/tot  Virtual: 7M/8678M use/tot  Free: 1240M
Date: Thu, 16 Mar 2000 14:44:06 -0600 (CST)
Subject: UPDATE: Crash/It is happening again
After applying the latest patch kit for 4.0F onto our 4100 5/400 (PK3, 
uploaded to the dec patches site on Friday), I hoped that our problems were
over. To remind you, we were having intermittant "buffering" of keystrokes 
-- they seem to occur in all windows - with all logins - at the same time(s).  
After a few seconds, everything a user had typed in a given window, showed 
up correctly in that window... 
Despite my best efforts and the wonderful help (I learned a lot!) of this 
group, I was unable to figure out exactly what happened, but the behavior 
improved (fewer instances of the "buffering") and then the new patch kit 
came out -- with several patches mentioning simple_lock...
With High Hopes, I installed the kit on Saturday.  Unfortunately, the
behavior is still there -- not very often, but still.
-------
OK, on to the solution(?).
Since we never again crashed, despite the slowdowns worsening and all
the kernel tuning that I did trying to diagnose or fix the problem, I
came to the decision that the crash was simply coincidental to the
manifestation of the slowdown.
After all that, it turns out that the primary fault did not lie
with the AlphaServer -- it was just the place that showed the most
obvious symptoms.  Our subnet had simply outgrown our old concentrator
or the concentrator was intermittantly failing.  Anyway, we replaced
it with a new switch and replaced some old cat 3 wiring with cat 5 and
viola!  Things are much better!  We still have 1 or 2 second glitches,
but we have always had that.  Netstat has reported no new errors since
the switch install (error rate had been 1-10 per minute) and the collision
rate has dropped from 25-30% to 1-15%.  User perception is that things
are better than they were before the problems started!
Yay!
I want to thank all of you (too many to list here) for your advice and
help in this process.  I was able to justify asking our Network Services
for help and eventually for the new switch by virtue of the diagnosis
and tuning I had done with your assistance.
Cyndi
-- 
-Cyndi Smith			     Programmer Analyst III, Biomathematics
-cyn_at_odin.mdacc.tmc.edu		M.D. Anderson Cancer Center, Houston, Texas
-phone: (713) 794-4938					fax: (713) 792-4262
			<http://odin.mdacc.tmc.edu/~cyn>
Received on Fri Apr 14 2000 - 17:52:10 NZST