---Original Posting---
> A curious MPI problem running on Memory Channel between two 
> quad processor 4100's.
>  
> My 8 process job runs without problems until I get the machines
> all to myself, no other user processes running. The MPI job
> starts up, all processes communicating, but then dies with...
>  
> [    1] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
> [    2] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
> [    3] MPID Die - ump2chck.c:91 "ump_wait failure" (-16)
>  
> Let the users back onto the boxes and my job runs fine.
> This is not good for benchmarking.
>  
> DU4.0C MPI170
---------------
Judging by the silence there aren't many MPI'ers out there
but Liam from the Galway team got straight onto the case...
>       I suspect that the problem you are running into is a bug in the
> lower memory channel layers which is exposed when running intensive jobs
> over a large number of machines.
>
>       You can mask this problem by using the -ump_error_mode none switch
> with dmpirun.
Which works just fine but is rather like sweeping the dust 
under the carpet.
Ciao
k.mcmanus_at_gre.ac.uk  -  
http://www.gre.ac.uk/~k.mcmanus
-------------------------------------------------------------
Dr Kevin McManus                     ||
School of Computing & Math Science   ||
The University of Greenwich          ||
Wellington St.  Woolwich             ||Tel +44 (0)181 331 8719 
London SE18 6PF  UK                  ||Fax +44 (0)181 331 8665
Received on Wed Apr 07 1999 - 10:48:31 NZST