---Original Posting---
> A curious MPI problem running on Memory Channel between two
> quad processor 4100's.
>
> My 8 process job runs without problems until I get the machines
> all to myself, no other user processes running. The MPI job
> starts up, all processes communicating, but then dies with...
>
> [ 1] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
> [ 2] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
> [ 3] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
>
> Let the users back onto the boxes and my job runs fine.
> This is not good for benchmarking.
>
> DU4.0C MPI170
---------------
Judging by the silence there aren't many MPI'ers out there
but Liam from the Galway team got straight onto the case...
> I suspect that the problem you are running into is a bug in the
> lower memory channel layers which is exposed when running intensive jobs
> over a large number of machines.
>
> You can mask this problem by using the -ump_error_mode none switch
> with dmpirun.
Which works just fine but is rather like sweeping the dust
under the carpet.
Ciao
k.mcmanus_at_gre.ac.uk -
http://www.gre.ac.uk/~k.mcmanus
-------------------------------------------------------------
Dr Kevin McManus ||
School of Computing & Math Science ||
The University of Greenwich ||
Wellington St. Woolwich ||Tel +44 (0)181 331 8719
London SE18 6PF UK ||Fax +44 (0)181 331 8665
Received on Wed Apr 07 1999 - 10:48:31 NZST