I'm waiting on my user to re-run their job before
we'll know for sure, but it looks like it has to do
with a per-process memory limit, as suggested by Rob
and explained in some detail by Alan Nabeth.
The default per-proc-data-size is only 128MB, and we
most likely hit this. Using "limit -h" in the csh
set this up to the hard limit, which is a much more
reasonable (for us) 1GB value. It appears that
the process dying when memory was filled was a coincidence.
The process was indeed allocating memory on the fly, and it
evidently hit the limit.
Here's Alan's message - thanks to all who replied!
-------------------------------------------------------
There are per-process limits on virtual memory use that
nothing to do with available memory or page/swap use.
It could be that the program reached one of those. That
it happened when you would have expected the system to
start paging was probably coincedence.
The are three limits that apply; maximum virtual memory
use, data space use and stack space use. Each has two
values of interest, the maximum and the current limit.
Without changing the limits, the data size max. is 1 GB,
with the default limit being 128 MB. The stack size limits
are 32 MB max and 2 MB default. Most shells have built-in
command that allow changing the default up to the maximum.
Look for "limit" and "ulimit" in your shell's manual page.
The following sysconfigtab parameters control these
limits:
per-proc-stack-size
max-per-proc-stack-size
per-proc-data-size
max-per-proc-data-size
max-per-proc-address-space
per-proc-address-space
These are part of the "proc" subsystem. There's also a
"vm" subsystem wide limit that controls the maximum virtual
address space avalible; vm-maxvas.
From the description it sounds like the process hit the
default 128 MB data size limit. Since the process was
already running it was probably allocating memory dynamically
and would have gotten a NULL point back from malloc or
sbrk. Why it died depends on how it would have handled
the error.
The quick and dirty check would be to use the shell to
raise the data-size limit and try again. If it makes more
progress than you want to look at raising the default
value for everybody.
Received on Fri Sep 08 2000 - 20:37:20 NZST