SUMMARY: MPI problem on MC

From: <K.McManus_at_greenwich.ac.uk>
Date: Wed, 07 Apr 1999 11:45:27 +0100 (BST)

---Original Posting---

> A curious MPI problem running on Memory Channel between two
> quad processor 4100's.
>
> My 8 process job runs without problems until I get the machines
> all to myself, no other user processes running. The MPI job
> starts up, all processes communicating, but then dies with...
>
> [ 1] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
> [ 2] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
> [ 3] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
>
> Let the users back onto the boxes and my job runs fine.
> This is not good for benchmarking.
>
> DU4.0C MPI170

---------------

Judging by the silence there aren't many MPI'ers out there
but Liam from the Galway team got straight onto the case...

> I suspect that the problem you are running into is a bug in the
> lower memory channel layers which is exposed when running intensive jobs
> over a large number of machines.
>
> You can mask this problem by using the -ump_error_mode none switch
> with dmpirun.

Which works just fine but is rather like sweeping the dust
under the carpet.

Ciao

k.mcmanus_at_gre.ac.uk - http://www.gre.ac.uk/~k.mcmanus
-------------------------------------------------------------
Dr Kevin McManus ||
School of Computing & Math Science ||
The University of Greenwich ||
Wellington St. Woolwich ||Tel +44 (0)181 331 8719
London SE18 6PF UK ||Fax +44 (0)181 331 8665
Received on Wed Apr 07 1999 - 10:48:31 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT