[Q] mpi processes terminate incorrectly

From: <kbecker_at_npac.syr.edu>
Date: Wed, 29 Nov 95 10:33:35 -0500

Hi all,

I have and alphafarm consisting of eight Alpha 3000/400
boxes running Digital Unix 3.0. I'm trying to determine
if the problem I'm experiencing is due to an incorrect
operating system configuration or MPI. I believe MPI
to be at fault but would like to verify that fact.

Here's the problem:

MPI processes that terminate incorrectly, either by the user
doing a <ctrl> C or a problem in the user's MPI program,
result in a hung parent process and a <defunct> child process.
All our user home directories are NFS mounted from the fileservers.
The result is that the hung process beats mercilessly on the
fileserver containing the process owner's home directory.
This is the beginning of a domino effect that has the potential to
hang our entire installation.

I'm not that knowledgeable about MPI. We have noticed that we
have the same problem on the SP2.

The only way we can minimize the potential damage is to have our
users copy their program to /scratch on the alphas and run it locally.
Many users see this as an inconvenience and I need to make sure
that this is really a problem with MPI and not my OS setup.

I have heard that an MPI process that is hung can generate
a huge number of system calls per second (2500+ system calls
per second per process). Is this true? And, if the problem
is with MPI and lots of system calls, why does this bring the
process owner's file system to it's knees?

Sorry this is so long, but I wanted to try to get all the
details in.

Thanks much,

Kathy Becker
NPAC Systems
kbecker_at_npac.syr.edu
Received on Wed Nov 29 1995 - 17:12:27 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT