Large Memory Jobs Not releasing RAM from Russell G Auld on 2000-02-02 (tru64-unix-managers)

From: Russell G Auld <rauld_at_grove.ufl.edu>
Date: Tue, 01 Feb 2000 12:00:52 -0500 (EST)

We have a couple Tru64 5.0 boxes that are being hammered
by a user. He is running a large fortran job.
Here's the output of 'top'

load averages: 1.37, 1.10, 1.06 11:32:07
67 processes: 2 running, 15 sleeping, 50 idle
CPU states: % user, % nice, % system, % idle
Memory: Real: 321M/493M act/tot Virtual: 1039M use/tot Free: 99M

  PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
24391 prakit 55 0 555M 32M run 17.2H 97.50% <z-koop21.exe>
24942 rauld 44 0 4096K 1015K run 0:00 0.00% <top>

This is from a machine with 500M RAM and 1024M swap.
The memory seems to be ok on this machine.

There are two other machines that have 1024M RAM and 500M swap.
On these machines, the job that he runs seems to not release the
physical memory.
After running the job once, he gets 'swap space below 10%' errors
and cannot run the job again.
After a reboot of the machine, he is able to run the code again.
Once.

He is not doing any dynamic memory allocation in his code.
He's compiling with 'f90'

On one of the machines with 1024M RAM, top looks like this:

load averages: 0.00, 0.00, 0.00 11:52:45
59 processes: 1 running, 15 sleeping, 43 idle
CPU states: 0.4% user, 0.0% nice, 5.3% system, 94.1% idle
Memory: Real: 607M/992M act/tot Virtual: 596M use/tot Free: 302M

  PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
7809 rauld 44 0 5648K 3186K run 0:00 3.10% <top>
  636 root 44 0 14M 9977K sleep 4:57 0.30% <Xdec>

(This is top 3.5beta8)

vmstat results in:
Virtual Memory Statistics: (pagesize = 8192)
  procs memory pages intr cpu
  r w u act free wire fault cow zero react pin pout in sy cs us
sy id
  2 117 26 156K 38K 7071 866K 182K 278K 2387 200K 0 12 84 265 0
0 99

There are not enough process accounted for in the process list
to consume 607M of RAM.
I'm thinking that for some reason the kernel isn't freeing up
some of the RAM after the job exits.
The job he runs is exiting on its own accord and not by and error.
I've told him to check his compiler flags and to make sure he is
explicitly closing all open file handles.

Not sure what else to do, but having to reboot the machines all the time
is clearly not the best option.
Aside from increasing the amount of swap space, what other solution is
there?
Is there something else to look into?

Thanks,

Russ

O===========================================O
  | R U S S E L L G A U L D |
  | +------------------+------------------+ |
  | Computational Fluid Dynamics Group |
  | Department of Mechanical Engineering |
  | P.O. Box 116300 |
  | University of Florida |
  | Gainesville, FL 32611 |
  | 352.392.4442 |
  | * * * |
O===========================================O
Received on Tue Feb 01 2000 - 17:01:57 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT