MPI: Why does my MPICH program fails for large no. of processes? - mpi

/* C Example */
#include <mpi.h>
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
int main (int argc, char* argv[])
{
int rank, size;
int buffer_length = MPI_MAX_PROCESSOR_NAME;
char hostname[buffer_length];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(hostname, &buffer_length); /* get hostname */
printf( "Hello world from process %d running on %s of %d\n", rank, hostname, size );
MPI_Finalize();
return 0;
}
The above program compiles and run successfully on ubuntu 12.04 for smaller no. of processes. But it fails when I try to execute with 1000s of processes. Why it is so?
I am expecting that scheduler would keep the threads in queue and can dispatch one by one (I am running this code on a single core machine)
Why the following error is coming for large no. of processes and how to resolve this issue?
root#ubuntu:/home# mpiexec -n 1000 ./hello
[proxy:0:0#ubuntu] HYDU_create_process (./utils/launch/launch.c:26): pipe error (Too many open files)
[proxy:0:0#ubuntu] launch_procs (./pm/pmiserv/pmip_cb.c:751): create process returned error
[proxy:0:0#ubuntu] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:935): launch_procs returned error
[proxy:0:0#ubuntu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#ubuntu] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
Killed

You are running into the open file limit on your system. Default in Ubuntu is 1024. You can try raising the limit in your session with the ulimit command.
ulimit -n 2048

Related

OpenMPI "nooversubscribe" does not work when using MPI_Comm_spawn

I have a simple MPI_Comm_spawn-based code that spawns 4 processes:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[]){
MPI_Init(&argc, &argv);
MPI_Comm intercomm;
int final_nranks = 4, len;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name(name, &len);
MPI_Comm_get_parent(&intercomm);
if(intercomm == MPI_COMM_NULL){
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "hostfile", "hostfile");
MPI_Info_set(info, "pernode", "true");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, final_nranks, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
printf("PARENT %s\n", name);
} else {
printf("CHILD %s\n", name);
}
MPI_Finalize();
return 0;
}
When I run the code with
$ mpirun -np 2 --hostfile hostfile --map-by node ./a.out
I would expect something like:
PARENT node00
PARENT node01
CHILD node00
CHILD node01
CHILD node02
CHILD node03
I want that each process runs in a different node.
Since by default OpenMPI does not oversubscribe slots, I would expect that child processes were spawned in different nodes. But it is not the case. Instead, I get:
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[node00:50113] *** An error occurred in MPI_Init
[node00:50113] *** reported by process [3761700866,2]
[node00:50113] *** on a NULL communicator
[node00:50113] *** Unknown error
[node00:50113] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node00:50113] *** and potentially your MPI job)
[node00:200682] 3 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[node00:200682] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node00:200682] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
If I remove MPI_Info_set(info, "pernode", "true"); from the code, the run does not crash but child processes are spawned in the same host:
PARENT node00
PARENT node01
CHILD node00
CHILD node00
CHILD node00
CHILD node00
OpenMPI version 4.1.2

OpenMPI MPI_Comm_spawn() giving unreachable errors

I've got a master and worker system, where the master uses OpenMPI to spawn and communicate with its workers. I've used versions 4.0.4 and 3.1.6, both are giving similar errors.
master.cxx
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
int NUM_JOBS = 10;
MPI_Init(&argc, &argv);
char * args[] = {"Arg1","Arg2",NULL,};
MPI_Info mpi_info;
MPI_Info_create(&mpi_info);
MPI_Info_set(mpi_info, "add-hostfile", "nodefile");
MPI_Comm child_comm;
MPI_Comm_spawn("worker", args, NUM_JOBS, mpi_info, 0, MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE);
doStuff();
MPI_Finalize();
return 0;
}
worker.cxx
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
doStuff();
MPI_Finalize();
return 0;
}
Both are compiled via cmake. I have 101 cores available to me, for 1 master and up to 100 workers, with 24 cores per node. For the first run, I'm just doing workers to prove that it can run. I had to add in some infiniband parameters
run1.sh
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 101 --mca btl openib,self --hostfile nodefile worker
Now, I try to run the workers using the master. Note that the code as written above spawns 10 workers, so the 11 total procs can fit on one 24-core node.
run2.sh
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 1 --mca btl openib,self --hostfile nodefile master
That works fine. So I up master.cxx NUM_JOBS=100 and repeat the run using run2.sh
This time, I get an MPI error and the job fails
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[53646,2],88]) is on host: mymachine04
Process 2 ([[53636,1],0]) is on host: unnknown1
BTLs attempted: self openib
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[mymachine04:22480] [[53646,2],88] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
That last line is repeated 77 total times. Presumably uncoincidentally, 101 procs - 24 procs per node = 77. The hostfile is just 101 lines of hostnames, and I get the same error whether I supply it or not. Interestingly, if I remove the first line of the hostfile before spawning my workers, on the grounds that that slot is already used by the master, I get an error about running out of slots. Any idea what I'm doing wrong, how I can get all of my processes to reach each other?

Can I link two separate executables with MPI_open_port and share port information in a text file?

I'm trying to create a shared MPI COMM between two executables which are both started independently, e.g.
mpiexec -n 1 ./exe1
mpiexec -n 1 ./exe2
I use MPI_Open_port to generate port details and write these to a file in exe1 and then read with exe2. This is followed by MPI_Comm_connect/MPI_Comm_accept and then send/recv communication (minimal example below).
My question is: can we write port information to file in this way, or is the MPI_Publish_name/MPI_Lookup_name required for MPI to work as in this, this and this? As supercomputers usually share a file system, this file based approach seems simpler and maybe avoids establishing a server.
It seems this should work according to the MPI_Open_Port documentation in the MPI 3.1 standard,
port_name is essentially a network address. It is unique within the communication universe to which it belongs (determined by the implementation), and may be used by any client within that communication universe. For instance, if it is an internet (host:port) address, it will be unique on the internet. If it is a low level switch address on an IBM SP, it will be unique to that SP
In addition, according to documentation on the MPI forum:
The following should be compatible with MPI: The server prints out an address to the terminal, the user gives this address to the client program.
MPI does not require a nameserver
A port_name is a system-supplied string that encodes a low-level network address at which a server can be contacted.
By itself, the port_name mechanism is completely portable ...
Writing the port information to file does work as expected, i.e creates a shared communicator and exchanges information using MPICH (3.2) but hangs at the MPI_Comm_connect/MPI_Comm_accept line when using OpenMPI versions 2.0.1 and 4.0.1 (on my local workstation running Ubuntu 12.04 but eventually needs to work on a tier 1 supercomputer). I have raised as an issue here but welcome a solution or workaround in the meantime.
Further Information
If I use the MPMD mode with OpenMPI,
mpiexec -n 1 ./exe1 : -n 1 ./exe2
this works correctly, so must be an issue with allowing the jobs to share ompi_global_scope as in this question. I've also tried adding,
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");
with info passed to all commands, with no success. I'm not running a server/client model as both codes run simultaneously so sharing a URL/PID from one is not ideal, although I cannot get this to work even using the suggested approach, which for OpenMPI 2.0.1,
mpirun -n 1 --report-pid + ./OpenMPI_2.0.1 0
1234
mpirun -n 1 --ompi-server pid:1234 ./OpenMPI_2.0.1 1
gives,
ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
pmix server init failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
and with OpenMPI 4.0.1,
mpirun -n 1 --report-pid + ./OpenMPI_4.0.1 0
1234
mpirun -n 1 --ompi-server pid:1234 ./OpenMPI_4.0.1 1
gives,
ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 50
...
A publish/lookup server was provided, but we were unable to connect
to it - please check the connection info and ensure the server
is alive:
Using 4.0.1 means the error should not be related to this bug in OpenMPI.
Minimal code
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <iostream>
#include <fstream>
using namespace std;
int main( int argc, char *argv[] )
{
int num_errors = 0;
int rank, size;
char port1[MPI_MAX_PORT_NAME];
char port2[MPI_MAX_PORT_NAME];
MPI_Status status;
MPI_Comm comm1, comm2;
int data = 0;
char *ptr;
int runno = strtol(argv[1], &ptr, 10);
for (int i = 0; i < argc; ++i)
printf("inputs %d %d %s \n", i,runno, argv[i]);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (runno == 0)
{
printf("0: opening ports.\n");fflush(stdout);
MPI_Open_port(MPI_INFO_NULL, port1);
printf("opened port1: <%s>\n", port1);
//Write port file
ofstream myfile;
myfile.open("port");
if( !myfile )
cout << "Opening file failed" << endl;
myfile << port1 << endl;
if( !myfile )
cout << "Write failed" << endl;
myfile.close();
printf("Port %s written to file \n", port1); fflush(stdout);
printf("Attempt to accept port1.\n");fflush(stdout);
//Establish connection and send data
MPI_Comm_accept(port1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm1);
printf("sending 5 \n");fflush(stdout);
data = 5;
MPI_Send(&data, 1, MPI_INT, 0, 0, comm1);
MPI_Close_port(port1);
}
else if (runno == 1)
{
//Read port file
size_t chars_read = 0;
ifstream myfile;
//Wait until file exists and is avaialble
myfile.open("port");
while(!myfile){
myfile.open("port");
cout << "Opening file failed" << myfile << endl;
usleep(30000);
}
while( myfile && chars_read < 255 ) {
myfile >> port1[ chars_read ];
if( myfile )
++chars_read;
if( port1[ chars_read - 1 ] == '\n' )
break;
}
printf("Reading port %s from file \n", port1); fflush(stdout);
remove( "port" );
//Establish connection and recieve data
MPI_Comm_connect(port1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm1);
MPI_Recv(&data, 1, MPI_INT, 0, 0, comm1, &status);
printf("Received %d 1\n", data); fflush(stdout);
}
//Barrier on intercomm before disconnecting
MPI_Barrier(comm1);
MPI_Comm_disconnect(&comm1);
MPI_Finalize();
return 0;
}
The 0 and 1 simply specify if this code writes a port file or reads it in the example above. This is then run with,
mpiexec -n 1 ./a.out 0
mpiexec -n 1 ./a.out 1

How can I run multiple threads inside of a given MPI process?

I understand that a single MPI job Launches many processes which could be run on multiple nodes.
How do I run multiple threads inside of a given MPI process using MPI_THREAD_MULTIPLE?
I was unable to find enough information in relation to the topic.
Assuming your using OpenMP to run multiple threads
You will write the OpenMP code as you would do with out the MPI. (this statement is over simplified)
When the MPI comes you need to consider how your process will communicate. MPI is not sending messages to individual threads but individual process. For that reason MPI provides four modes of interaction with threads.
MPI_THREAD_SINGLE: Provides only one thread
MPI_THREAD_FUNNELED: Can provide many threads, but only the master thread can make MPI calls. The master thread is the one who call MPI_Init...
MPI_THREAD_SERIALIZED: Can provide many threads, but only one can make MPI calls at a time.
MPI_THREAD_MULTIPE: Can provide many threads, and all of them can make MPI call at any time.
You need to specify the mode you want at MPI_Init, which becomes:
MPI_Init_thread(&argc, &argv, HERE_PUT_THE_MODE_YOU_NEED, PROVIDED_MODE)
Ex:
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPE, &provided)
At the provided field the MPI_Init_thread returns the provided mode. Make sure that you got a mode that your code can cope with it.
Also, avoid the use of MPI_Probe and MPI_IProbe, because they are not thread save. You should use MPI_Mprobe and MPI_Improbe.
Here is a simple 'hello world' example as #ab2050 asked:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#include "mpi.h"
int main(int argc, char *argv[]) {
int provided;
int rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED) {
fprintf(stderr, "Warning MPI did not provide MPI_THREAD_FUNNELED\n");
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel default(none), \
shared(rank), \
shared(ompi_mpi_comm_world), \
shared(ompi_mpi_int), \
shared(ompi_mpi_char)
{
printf("Hello from thread %d at rank %d parallel region\n",
omp_get_thread_num(), rank);
#pragma omp master
{
char helloWorld[12];
if (rank == 0) {
strcpy(helloWorld, "Hello World");
MPI_Send(helloWorld, 12, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
printf("Rank %d send: %s\n", rank, helloWorld);
}
else {
MPI_Recv(helloWorld, 12, MPI_CHAR, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Rank %d received: %s\n", rank, helloWorld);
}
}
}
MPI_Finalize();
return 0;
}
You have to run this code on two process. Because 'MPI_THREAD_FUNNELED' is selected only the master thread makes MPI calls.
The following variables are specified at OpenMP data scoping place
because is needed by gcc version 6.1.1. Older versions like 4.8 do not require to declare them.
ompi_mpi_comm_world
ompi_mpi_char

mpiexec checkpointing error (RPi)

When I try to run an application (just a simple hello_world.c doesn't work) I receive this error every time:
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 10 -machinefile /tmp/machinefile -n 1 ./app_name
[proxy:0:0#masterpi] requesting checkpoint
[proxy:0:0#masterpi] checkpoint completed
[proxy:0:0#masterpi] requesting checkpoint
[proxy:0:0#masterpi] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0#masterpi] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:0#masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#masterpi] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec#masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
I want just to make a checkpoint and nothing else (and restart later).
Thanks in advance
UPDATE:
I have tried with MPICH2, no chance. Or maybe I'm wrong somewhere...
pi#raspberrypi ~ $ mpiexec -n 1 -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 2 ./test3
Count to: 0
[proxy:0:0#raspberrypi] requesting checkpoint
[proxy:0:0#raspberrypi] checkpoint completed
Count to: 1
[proxy:0:0#raspberrypi] requesting checkpoint
[proxy:0:0#raspberrypi] HYDT_ckpoint_checkpoint (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0#raspberrypi] HYD_pmcd_pmip_control_cmd_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip_cb.c:902): checkpoint suspend failed
[proxy:0:0#raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event
[mpiexec#raspberrypi] control_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed
[mpiexec#raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#raspberrypi] HYD_pmci_wait_for_completion (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec#raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion
Test3-Code:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[]) {
int rank;
int size;
int i = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Status status;
if (rank == 0) {
for(i; i <=100; i++){
int j = 0;
while(j < 100000000){
j++;
}
printf("Count to: %i\n", i);
}
} else {
}
MPI_Finalize();
return 0;
}
I just need to have one successful checkpoint and to show the restart.
If someone has a working example (irrelevant what it makes, simple working "Hello World" would make me happy!) I would be very glad.
Happy new year!
Unfortunately, the checkpoint/restart code in MPICH 3.0.4 is known to be buggy at the moment. That will hopefully get fixed in a future release. It looks like you're probably using it correctly. It's possible that if you go back to a previous version, you might have better luck.
Here the problem was with the too small interval for checkpointing.
Setting it to 20s or more has solved this (but not the other :( ) problem.

Resources