How can I run multiple threads inside of a given MPI process? - mpi

I understand that a single MPI job Launches many processes which could be run on multiple nodes.
How do I run multiple threads inside of a given MPI process using MPI_THREAD_MULTIPLE?
I was unable to find enough information in relation to the topic.

Assuming your using OpenMP to run multiple threads
You will write the OpenMP code as you would do with out the MPI. (this statement is over simplified)
When the MPI comes you need to consider how your process will communicate. MPI is not sending messages to individual threads but individual process. For that reason MPI provides four modes of interaction with threads.
MPI_THREAD_SINGLE: Provides only one thread
MPI_THREAD_FUNNELED: Can provide many threads, but only the master thread can make MPI calls. The master thread is the one who call MPI_Init...
MPI_THREAD_SERIALIZED: Can provide many threads, but only one can make MPI calls at a time.
MPI_THREAD_MULTIPE: Can provide many threads, and all of them can make MPI call at any time.
You need to specify the mode you want at MPI_Init, which becomes:
MPI_Init_thread(&argc, &argv, HERE_PUT_THE_MODE_YOU_NEED, PROVIDED_MODE)
Ex:
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPE, &provided)
At the provided field the MPI_Init_thread returns the provided mode. Make sure that you got a mode that your code can cope with it.
Also, avoid the use of MPI_Probe and MPI_IProbe, because they are not thread save. You should use MPI_Mprobe and MPI_Improbe.
Here is a simple 'hello world' example as #ab2050 asked:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#include "mpi.h"
int main(int argc, char *argv[]) {
int provided;
int rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED) {
fprintf(stderr, "Warning MPI did not provide MPI_THREAD_FUNNELED\n");
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel default(none), \
shared(rank), \
shared(ompi_mpi_comm_world), \
shared(ompi_mpi_int), \
shared(ompi_mpi_char)
{
printf("Hello from thread %d at rank %d parallel region\n",
omp_get_thread_num(), rank);
#pragma omp master
{
char helloWorld[12];
if (rank == 0) {
strcpy(helloWorld, "Hello World");
MPI_Send(helloWorld, 12, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
printf("Rank %d send: %s\n", rank, helloWorld);
}
else {
MPI_Recv(helloWorld, 12, MPI_CHAR, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Rank %d received: %s\n", rank, helloWorld);
}
}
}
MPI_Finalize();
return 0;
}
You have to run this code on two process. Because 'MPI_THREAD_FUNNELED' is selected only the master thread makes MPI calls.
The following variables are specified at OpenMP data scoping place
because is needed by gcc version 6.1.1. Older versions like 4.8 do not require to declare them.
ompi_mpi_comm_world
ompi_mpi_char

Related

OpenMPI MPI_Comm_spawn() giving unreachable errors

I've got a master and worker system, where the master uses OpenMPI to spawn and communicate with its workers. I've used versions 4.0.4 and 3.1.6, both are giving similar errors.
master.cxx
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
int NUM_JOBS = 10;
MPI_Init(&argc, &argv);
char * args[] = {"Arg1","Arg2",NULL,};
MPI_Info mpi_info;
MPI_Info_create(&mpi_info);
MPI_Info_set(mpi_info, "add-hostfile", "nodefile");
MPI_Comm child_comm;
MPI_Comm_spawn("worker", args, NUM_JOBS, mpi_info, 0, MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE);
doStuff();
MPI_Finalize();
return 0;
}
worker.cxx
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
doStuff();
MPI_Finalize();
return 0;
}
Both are compiled via cmake. I have 101 cores available to me, for 1 master and up to 100 workers, with 24 cores per node. For the first run, I'm just doing workers to prove that it can run. I had to add in some infiniband parameters
run1.sh
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 101 --mca btl openib,self --hostfile nodefile worker
Now, I try to run the workers using the master. Note that the code as written above spawns 10 workers, so the 11 total procs can fit on one 24-core node.
run2.sh
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 1 --mca btl openib,self --hostfile nodefile master
That works fine. So I up master.cxx NUM_JOBS=100 and repeat the run using run2.sh
This time, I get an MPI error and the job fails
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[53646,2],88]) is on host: mymachine04
Process 2 ([[53636,1],0]) is on host: unnknown1
BTLs attempted: self openib
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[mymachine04:22480] [[53646,2],88] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
That last line is repeated 77 total times. Presumably uncoincidentally, 101 procs - 24 procs per node = 77. The hostfile is just 101 lines of hostnames, and I get the same error whether I supply it or not. Interestingly, if I remove the first line of the hostfile before spawning my workers, on the grounds that that slot is already used by the master, I get an error about running out of slots. Any idea what I'm doing wrong, how I can get all of my processes to reach each other?

Can I link two separate executables with MPI_open_port and share port information in a text file?

I'm trying to create a shared MPI COMM between two executables which are both started independently, e.g.
mpiexec -n 1 ./exe1
mpiexec -n 1 ./exe2
I use MPI_Open_port to generate port details and write these to a file in exe1 and then read with exe2. This is followed by MPI_Comm_connect/MPI_Comm_accept and then send/recv communication (minimal example below).
My question is: can we write port information to file in this way, or is the MPI_Publish_name/MPI_Lookup_name required for MPI to work as in this, this and this? As supercomputers usually share a file system, this file based approach seems simpler and maybe avoids establishing a server.
It seems this should work according to the MPI_Open_Port documentation in the MPI 3.1 standard,
port_name is essentially a network address. It is unique within the communication universe to which it belongs (determined by the implementation), and may be used by any client within that communication universe. For instance, if it is an internet (host:port) address, it will be unique on the internet. If it is a low level switch address on an IBM SP, it will be unique to that SP
In addition, according to documentation on the MPI forum:
The following should be compatible with MPI: The server prints out an address to the terminal, the user gives this address to the client program.
MPI does not require a nameserver
A port_name is a system-supplied string that encodes a low-level network address at which a server can be contacted.
By itself, the port_name mechanism is completely portable ...
Writing the port information to file does work as expected, i.e creates a shared communicator and exchanges information using MPICH (3.2) but hangs at the MPI_Comm_connect/MPI_Comm_accept line when using OpenMPI versions 2.0.1 and 4.0.1 (on my local workstation running Ubuntu 12.04 but eventually needs to work on a tier 1 supercomputer). I have raised as an issue here but welcome a solution or workaround in the meantime.
Further Information
If I use the MPMD mode with OpenMPI,
mpiexec -n 1 ./exe1 : -n 1 ./exe2
this works correctly, so must be an issue with allowing the jobs to share ompi_global_scope as in this question. I've also tried adding,
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");
with info passed to all commands, with no success. I'm not running a server/client model as both codes run simultaneously so sharing a URL/PID from one is not ideal, although I cannot get this to work even using the suggested approach, which for OpenMPI 2.0.1,
mpirun -n 1 --report-pid + ./OpenMPI_2.0.1 0
1234
mpirun -n 1 --ompi-server pid:1234 ./OpenMPI_2.0.1 1
gives,
ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
pmix server init failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
and with OpenMPI 4.0.1,
mpirun -n 1 --report-pid + ./OpenMPI_4.0.1 0
1234
mpirun -n 1 --ompi-server pid:1234 ./OpenMPI_4.0.1 1
gives,
ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 50
...
A publish/lookup server was provided, but we were unable to connect
to it - please check the connection info and ensure the server
is alive:
Using 4.0.1 means the error should not be related to this bug in OpenMPI.
Minimal code
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <iostream>
#include <fstream>
using namespace std;
int main( int argc, char *argv[] )
{
int num_errors = 0;
int rank, size;
char port1[MPI_MAX_PORT_NAME];
char port2[MPI_MAX_PORT_NAME];
MPI_Status status;
MPI_Comm comm1, comm2;
int data = 0;
char *ptr;
int runno = strtol(argv[1], &ptr, 10);
for (int i = 0; i < argc; ++i)
printf("inputs %d %d %s \n", i,runno, argv[i]);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (runno == 0)
{
printf("0: opening ports.\n");fflush(stdout);
MPI_Open_port(MPI_INFO_NULL, port1);
printf("opened port1: <%s>\n", port1);
//Write port file
ofstream myfile;
myfile.open("port");
if( !myfile )
cout << "Opening file failed" << endl;
myfile << port1 << endl;
if( !myfile )
cout << "Write failed" << endl;
myfile.close();
printf("Port %s written to file \n", port1); fflush(stdout);
printf("Attempt to accept port1.\n");fflush(stdout);
//Establish connection and send data
MPI_Comm_accept(port1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm1);
printf("sending 5 \n");fflush(stdout);
data = 5;
MPI_Send(&data, 1, MPI_INT, 0, 0, comm1);
MPI_Close_port(port1);
}
else if (runno == 1)
{
//Read port file
size_t chars_read = 0;
ifstream myfile;
//Wait until file exists and is avaialble
myfile.open("port");
while(!myfile){
myfile.open("port");
cout << "Opening file failed" << myfile << endl;
usleep(30000);
}
while( myfile && chars_read < 255 ) {
myfile >> port1[ chars_read ];
if( myfile )
++chars_read;
if( port1[ chars_read - 1 ] == '\n' )
break;
}
printf("Reading port %s from file \n", port1); fflush(stdout);
remove( "port" );
//Establish connection and recieve data
MPI_Comm_connect(port1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm1);
MPI_Recv(&data, 1, MPI_INT, 0, 0, comm1, &status);
printf("Received %d 1\n", data); fflush(stdout);
}
//Barrier on intercomm before disconnecting
MPI_Barrier(comm1);
MPI_Comm_disconnect(&comm1);
MPI_Finalize();
return 0;
}
The 0 and 1 simply specify if this code writes a port file or reads it in the example above. This is then run with,
mpiexec -n 1 ./a.out 0
mpiexec -n 1 ./a.out 1

How to do a non-blocking read on a non-socket fd

Is there a way to do a single read() in non-blocking mode on a pipe/terminal/etc, the way I can do it on a socket with recv(MSG_DONTWAIT)?
The reason I need that is because I cannot find any guarantee that a read() on a file-descriptor returned as ready for reading by select() or poll() will not block.
I know can make the file descriptor non-blocking with fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK) but this will change the mode on that file descriptor globally, not just in the calling thread/process. For example:
% perl -MFcntl=F_SETFL,F_GETFL,O_NONBLOCK -e 'fcntl STDIN, F_SETFL, fcntl(STDIN, F_GETFL, 0) | O_NONBLOCK; select undef, undef, undef, undef'
^Z # put it in the background
% cat
cat: -: Resource temporarily unavailable
This will also make the fd non blocking for both reading and writing, which may confuse the hell out of another process doing the opposite on the same fd, as in:
non_blocking_read | filter | blocking_write
One way I think of is to save the file status flags on starting up and SIGCONT, and restore them on exiting and on SIGTSTP (just the way it's done with the termios settings), but this is very limited, race-prone, and will leave a mess behind in the case where the program exited abnormally.
Putting a save/restore with fcntl() before/after each read() also feels ugly and dumb, and may have other issues too. The same with an ioctl(FIONREAD) just before the read (which I'm not even sure it will work reliably with any fd; assurances in that direction will be welcome, though).
I would be happy even with system specific (eg. linux or bsd-only) solutions.
For reference, here is a discussion about fixing it in linux; the idea didn't seem to get anywhere, though.
A Linux only solution would be to reopen the file descriptor via
"/dev/stdin"|"/dev/tty"|"/dev/fd/$fd".
C example:
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
int main()
{
int fd;
char buf[8];
int flags;
if(0>(fd=open("/dev/stdin", O_RDONLY))) return 1;
if(0>(flags = fcntl(fd,F_GETFL))) return 1;
if(0>(flags = fcntl(fd,F_SETFL,flags|O_NONBLOCK))) return 1;
sleep(3);
puts("reading");
ssize_t nr = read(fd, buf, sizeof(buf));
printf("read=%zd\n", nr);
return 0;
}
Unlike a duplicated file descriptor, a reopened filedescriptor will have independent file status flags.

MPI: Why does my MPICH program fails for large no. of processes?

/* C Example */
#include <mpi.h>
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
int main (int argc, char* argv[])
{
int rank, size;
int buffer_length = MPI_MAX_PROCESSOR_NAME;
char hostname[buffer_length];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(hostname, &buffer_length); /* get hostname */
printf( "Hello world from process %d running on %s of %d\n", rank, hostname, size );
MPI_Finalize();
return 0;
}
The above program compiles and run successfully on ubuntu 12.04 for smaller no. of processes. But it fails when I try to execute with 1000s of processes. Why it is so?
I am expecting that scheduler would keep the threads in queue and can dispatch one by one (I am running this code on a single core machine)
Why the following error is coming for large no. of processes and how to resolve this issue?
root#ubuntu:/home# mpiexec -n 1000 ./hello
[proxy:0:0#ubuntu] HYDU_create_process (./utils/launch/launch.c:26): pipe error (Too many open files)
[proxy:0:0#ubuntu] launch_procs (./pm/pmiserv/pmip_cb.c:751): create process returned error
[proxy:0:0#ubuntu] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:935): launch_procs returned error
[proxy:0:0#ubuntu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#ubuntu] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
Killed
You are running into the open file limit on your system. Default in Ubuntu is 1024. You can try raising the limit in your session with the ulimit command.
ulimit -n 2048

How does a zombie process manifest itself?

kill -s SIGCHLD
The above is the code for killing any zombie process, But my question is:
Is there any way by which a Zombie process manifest itself??
steenhulthin is correct, but until it's moved someone may as well answer it here. A zombie process exists between the time that a child process terminates and the time that the parent calls one of the wait() functions to get its exit status.
A simple example:
/* Simple example that creates a zombie process. */
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main(void)
{
pid_t cpid;
char s[4];
int status;
cpid = fork();
if (cpid == -1) {
puts("Whoops, no child process, bye.");
return 1;
}
if (cpid == 0) {
puts("Child process says 'goodbye cruel world.'");
return 0;
}
puts("Parent process now cruelly lets its child exist as\n"
"a zombie until the user presses enter.\n"
"Run 'ps aux | grep mkzombie' in another window to\n"
"see the zombie.");
fgets(s, sizeof(s), stdin);
wait(&status);
return 0;
}

Resources