While I haven't been able to pin anything down yet, I did get some
comments from a few folks that start to shed some light. Thanks to:
alan_at_nabeth.cxo.dec.com
Craig.T.Biggerstaff_at_USAHQ.UnitedSpaceAlliance.com
robert_at_digi-data.com
for taking the time to respond. The general synopsis was that we have an
application (CICS) that is starting up way to many threads at one time
which ultimately have nothing to do. This response by Craig Biggerstaff
was particularly interesting if only from its thorough analysis given
such little data:
>Hmmm.  You have a system that is grinding away until 300-odd new
processes
>are created, causing CPU usage to go up, page faults to multiply, and
page
>copying from the parent process (cow) to momentarily increase, until
the
>system pages out enough other stuff to handle the load, then the new
>processes mysteriously disappear, but CPU usage stays high and is
mostly
>"system"-related.
>
>I'd say you have a multi-threaded application that is forking many many
more
>threads than is necessary to handle the workload.  The majority of
threads
>have nothing to do, and terminate themselves quickly, but not before
causing
>massive paging activity.  The remaining threads stay around to do the
>required work, but make intensive use of system calls to synchronize
>themselves and don't really do that much independent processing that
would
>consume "user" CPU time.
My original query:
> We have a 4100 with 512Mb running Digital's CICS as the front end of
our
> application. Every once and a while the system starts to report what I
> find to be incredibly high page faults. Here is an excerpt from my
> always running vmstat at 15 minute intervals:  
> 
>   procs    memory         pages                          intr
cpu
> 
>   r  w  u     act  free   wire fault  cow  zero  react pin pout  in
sy
> cs  us  sy  id
> 2632 28   33K  13K  15K 125K  26K  39K    2  40K    0 354 826  1K   4
> 5  91
>   3766 28   37K  10K  15K 216K  59K  43K  807 106K    0 330  1K  1K
10
> 6  83
>   3956 28   39K 7680  15K 202K  59K  41K  757 106K    0 407  1K  1K
10
> 7  83
>   41284 28   43K 2918  16K  47M  70K  42K    5 124K    0 434  1K  1K
14
> 30  57
>   31102 28   39K 7076  16K 147M  94K  49K 8331 177K 2701 469  1K  1K
21
> 79   0
>   3708 28   37K  12K  12K 171M  46K  44K  460  79K    0 444  1K  1K
11
> 89   0
>   3925 28   37K 9190  16K 168M  49K  42K 1355  86K    0 461  1K  1K
11
> 89   0
>   3700 28   35K  12K  14K 168M  44K  42K  905  75K    0 462  1K  1K
11
> 89   0
>  
> Notice the numbers in the fault column, 47M, 147M, 171M, 168M...
> 
> At 15 minute intervals this works out to about 100,000 faults sec! And
> sure enough I can measure that real time also. Ive seen upwards of
> 500,000 faults second. Oddly enough when this is happening, there is
no
> other noticeable impact on performance, cpu or disk.
> 
> Can anyone shed any light on what is going on here?
> 
> Thanks
> John Hergert
> 
Received on Tue Aug 18 1998 - 23:41:52 NZST