The original question was how to suspend a process that uses a large amount of
memory while the system is heavily used and make it resume its work when the
load goes down. The most common suggestion was to use
kill -STOP <pid>
to suspend the job so the system would slowly recover its memory by paging
it away, and later issue a
kill -CONT <pid>
to bring it back. I tried it and it seemed to work, eventhough I was thinking
about moving the whole thing out so, in the event of a crash, the job could be
recovered. The programs I had in mind typically require over a week of cputime!
But this solution is better than nothing.
Thanks to:
Ezra Peisach <epeisach_at_MIT.EDU>
John Stoffel <john_at_WPI.EDU>
Mike Iglesias <iglesias_at_draco.acs.uci.edu>
Doug Johnson <drjohn_at_pizero.Colorado.EDU>
Alan Rollow <alan_at_nabeth.cxo.dec.com>
Gyula <szgyula_at_skysrv.Pha.Jhu.EDU>
Phil Farrell <farrell_at_pangea.Stanford.EDU>
for pointing so rapidly this solution!
There was also a suggestion to use condor (see
http://www.cs.wisc.edu/condor/)
given by Ernie Rael <ernie_at_MasPar.COM> which does have checkpoint as I
mentioned above if the program links with a certain library. However I don't
think that a system administrator should count on the good behaviour of the
users so they would auto-checkpoint or make their program checkpointable! But
it is an interresting solution that I will ckeck in more detail.
Received on Fri Mar 22 1996 - 01:21:22 NZST