SUMMARY:software to "checkpoint" a process

From: <pgouffon_at_charme.if.usp.br>
Date: Thu, 21 Mar 96 20:37:47 -0300

The original question was how to suspend a process that uses a large amount of
memory while the system is heavily used and make it resume its work when the
load goes down. The most common suggestion was to use

        kill -STOP <pid>
        
to suspend the job so the system would slowly recover its memory by paging
it away, and later issue a

        kill -CONT <pid>

to bring it back. I tried it and it seemed to work, eventhough I was thinking
about moving the whole thing out so, in the event of a crash, the job could be
recovered. The programs I had in mind typically require over a week of cputime!
But this solution is better than nothing.

Thanks to:

Ezra Peisach <epeisach_at_MIT.EDU>
John Stoffel <john_at_WPI.EDU>
Mike Iglesias <iglesias_at_draco.acs.uci.edu>
Doug Johnson <drjohn_at_pizero.Colorado.EDU>
Alan Rollow <alan_at_nabeth.cxo.dec.com>
Gyula <szgyula_at_skysrv.Pha.Jhu.EDU>
Phil Farrell <farrell_at_pangea.Stanford.EDU>

for pointing so rapidly this solution!

There was also a suggestion to use condor (see http://www.cs.wisc.edu/condor/)
given by Ernie Rael <ernie_at_MasPar.COM> which does have checkpoint as I
mentioned above if the program links with a certain library. However I don't
think that a system administrator should count on the good behaviour of the
users so they would auto-checkpoint or make their program checkpointable! But
it is an interresting solution that I will ckeck in more detail.
Received on Fri Mar 22 1996 - 01:21:22 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT