While I haven't been able to pin anything down yet, I did get some
comments from a few folks that start to shed some light. Thanks to:
alan_at_nabeth.cxo.dec.com
Craig.T.Biggerstaff_at_USAHQ.UnitedSpaceAlliance.com
robert_at_digi-data.com
for taking the time to respond. The general synopsis was that we have an
application (CICS) that is starting up way to many threads at one time
which ultimately have nothing to do. This response by Craig Biggerstaff
was particularly interesting if only from its thorough analysis given
such little data:
>Hmmm. You have a system that is grinding away until 300-odd new
processes
>are created, causing CPU usage to go up, page faults to multiply, and
page
>copying from the parent process (cow) to momentarily increase, until
the
>system pages out enough other stuff to handle the load, then the new
>processes mysteriously disappear, but CPU usage stays high and is
mostly
>"system"-related.
>
>I'd say you have a multi-threaded application that is forking many many
more
>threads than is necessary to handle the workload. The majority of
threads
>have nothing to do, and terminate themselves quickly, but not before
causing
>massive paging activity. The remaining threads stay around to do the
>required work, but make intensive use of system calls to synchronize
>themselves and don't really do that much independent processing that
would
>consume "user" CPU time.
My original query:
> We have a 4100 with 512Mb running Digital's CICS as the front end of
our
> application. Every once and a while the system starts to report what I
> find to be incredibly high page faults. Here is an excerpt from my
> always running vmstat at 15 minute intervals:
>
> procs memory pages intr
cpu
>
> r w u act free wire fault cow zero react pin pout in
sy
> cs us sy id
> 2632 28 33K 13K 15K 125K 26K 39K 2 40K 0 354 826 1K 4
> 5 91
> 3766 28 37K 10K 15K 216K 59K 43K 807 106K 0 330 1K 1K
10
> 6 83
> 3956 28 39K 7680 15K 202K 59K 41K 757 106K 0 407 1K 1K
10
> 7 83
> 41284 28 43K 2918 16K 47M 70K 42K 5 124K 0 434 1K 1K
14
> 30 57
> 31102 28 39K 7076 16K 147M 94K 49K 8331 177K 2701 469 1K 1K
21
> 79 0
> 3708 28 37K 12K 12K 171M 46K 44K 460 79K 0 444 1K 1K
11
> 89 0
> 3925 28 37K 9190 16K 168M 49K 42K 1355 86K 0 461 1K 1K
11
> 89 0
> 3700 28 35K 12K 14K 168M 44K 42K 905 75K 0 462 1K 1K
11
> 89 0
>
> Notice the numbers in the fault column, 47M, 147M, 171M, 168M...
>
> At 15 minute intervals this works out to about 100,000 faults sec! And
> sure enough I can measure that real time also. Ive seen upwards of
> 500,000 faults second. Oddly enough when this is happening, there is
no
> other noticeable impact on performance, cpu or disk.
>
> Can anyone shed any light on what is going on here?
>
> Thanks
> John Hergert
>
Received on Tue Aug 18 1998 - 23:41:52 NZST