My goal was to send a vector from process 0, to process 1. Then, send it back from process 1 to process 0.
I have two questions from my implementation,
1- Why does the sending back from process 1 to process 0 takes longer than the vice versa?
The first send-recv takes ~1e-4 seconds in total and the second send-recv takes ~1 second.
2- When I increase size of the vector, I get the following error. What is the reason for that issue?
mpirun noticed that process rank 0 with PID 11248 on node server1 exited on signal 11 (Segmentation fault).
My Updated C++ code is as follows
#include <mpi.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <boost/timer/timer.hpp>
#include <math.h>
using namespace std;
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Request request, request2,request3,request4;
MPI_Status status;
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
srand( world_rank );
int n = 1e3;
double *myvector = new double[n];
if (world_rank==0){
myvector[n-1] = 1;
if (world_rank==0){
boost::timer::cpu_timer timer;
MPI_Isend(myvector, n, MPI_DOUBLE , 1, 0, MPI_COMM_WORLD, &request);
boost::timer::cpu_times elapsedTime1 = timer.elapsed();
cout << " Wallclock time on Process 1:"
<< elapsedTime1.wall / 1e9 << " (sec)" << endl;
MPI_Irecv(myvector, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &request4);
MPI_Wait(&request4, &status);
printf("Test if data is recieved from node 1: %1.0f\n",myvector[n-1]);
boost::timer::cpu_times elapsedTime2 = timer.elapsed();
cout <<" Wallclock time on Process 1:"
<< elapsedTime2.wall / 1e9 << " (sec)" << endl;
boost::timer::cpu_timer timer;
MPI_Irecv(myvector, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &request2);
MPI_Wait(&request2, &status);
boost::timer::cpu_times elapsedTime1 = timer.elapsed();
cout << " Wallclock time on Process 2:"
<< elapsedTime1.wall / 1e9 << " (sec)" << endl;
printf("Test if data is recieved from node 0: %1.0f\n",myvector[n-1]);
myvector[n-1] = 2;
MPI_Isend(myvector, n, MPI_DOUBLE , 0, 0, MPI_COMM_WORLD, &request3);
boost::timer::cpu_times elapsedTime2 = timer.elapsed();
cout<< " Wallclock time on Process 2:"
<< elapsedTime1.wall / 1e9 << " (sec)" << endl;
The output is:
Wallclock time on Process 1:2.484e-05 (sec)
Wallclock time on Process 2:0.000125325 (sec)
Test if data is recieved from node 0: 1
Wallclock time on Process 2:0.000125325 (sec)
Test if data is recieved from node 1: 2
Wallclock time on Process 1:1.00133 (sec)
Timing discrepancies
First of all, you don't measure the time to send the message. This is why posting the actual code you use for timing is essential.
You measure four times, for the two sends, you only time the call to MPI_Isend. This is the Immediate version of the API call. As the name suggests, it completes immediately. The timing has nothing to do with the actual time for sending the message.
For the receive operations, you measure MPI_Irecv and a corresponding MPI_Wait. This is the time between initiating the receive and the the local availability of the message. This is again different from the message transfer time, as it does not consider the time difference between posting the receive and the corresponding send. In general, you have to consider the late sender and late receiver cases. Further even for blocking send operations, a local completion does not imply a completed transfer, remote completion, or even initiation.
Timing MPI transfers is difficult.
Checking for completion
There is still the question as to why anything in this code could take an entire second. That is certainly not a sensible time unless your network uses IPoAC. The likely reason is that you do not check for completion of all messages. MPI implementations are often single threaded and can only make progress on communication during the respective API calls. To use immediate messages, you must either periodically call MPI_Test* until the request is finished or complete the request using MPI_Wait*.
I don't know why you chose to use immediate MPI functions in the first place. If you call MPI_Wait right after starting an MPI_Isend/MPI_Irecv, you might as well just call MPI_Send/MPI_Recv. You need immediate functions for concurrent communication and computation, to allow concurrent irregular communication patterns, and to avoid deadlocks in certain situations. If you don't need the immediate functions, use the blocking ones instead.
While I cannot reproduce, I suspect this is caused by using the same buffer (myvector) for two concurrently running MPI operations. Don't do that. Either use a separate buffer, or make sure the first operation completes. Generally - you are not allowed to touch a buffer in any way after passing it to MPI_Isend/MPI_Irecv until you know the request is completed via MPI_Test*/MPI_Wait*.
If you think you need immediate operations to avoid deadlocks while sending and receiving, consider MPI_Sendrecv instead.
I do not want to use mpiexec -n 4 ./a.out to run my program on my core i7 processor (with 4 cores). Instead, I want to run ./a.out, have it detect the number of cores and fire up MPI to run a process per core.
This SO question and answer MPI Number of processors? led me to use mpiexec.
The reason I want to avoid mpiexec is because my code is destined to be a library inside a larger project I'm working on. The larger project has a GUI and the user will be starting long computations that will call my library, which will in turn use MPI. The integration between the UI and the computation code is not trivial... so launching an external process and communicating via a socket or some other means is not an option. It must be a library call.
Is this possible? How do I do it?
This is quite a nontrivial thing to achieve in general. Also, there is hardly any portable solution that does not depend on some MPI implementation specifics. What follows is a sample solution that works with Open MPI and possibly with other general MPI implementations (MPICH, Intel MPI, etc.). It involves a second executable or a means for the original executable to directly call you library provided some special command-line argument. It goes like this.
Assume the original executable was started simply as ./a.out. When your library function is called, it calls MPI_Init(NULL, NULL), which initialises MPI. Since the executable was not started via mpiexec, it falls back to the so-called singleton MPI initialisation, i.e. it creates an MPI job that consists of a single process. To perform distributed computations, you have to start more MPI processes and that's where things get complicated in the general case.
MPI supports dynamic process management, in which one MPI job can start a second one and communicate with it using intercommunicators. This happens when the first job calls MPI_Comm_spawn or MPI_Comm_spawn_multiple. The first one is used to start simple MPI jobs that use the same executable for all MPI ranks while the second one can start jobs that mix different executables. Both need information as to where and how to launch the processes. This comes from the so-called MPI universe, which provides information not only about the started processes, but also about the available slots for dynamically started ones. The universe is constructed by mpiexec or by some other launcher mechanism that takes, e.g., a host file with list of nodes and number of slots on each node. In the absence of such information, some MPI implementations (Open MPI included) will simply start the executables on the same node as the original file. MPI_Comm_spawn[_multiple] has an MPI_Info argument that can be used to supply a list of key-value paris with implementation-specific information. Open MPI supports the add-hostfile key that can be used to specify a hostfile to be used when spawning the child job. This is useful for, e.g., allowing the user to specify via the GUI a list of hosts to use for the MPI computation. But let's concentrate on the case where no such information is provided and Open MPI simply runs the child job on the same host.
Assume the worker executable is called worker. Or that the original executable can serve as worker if called with some special command-line option, -worker for example. If you want to perform computation with N processes in total, you need to launch N-1 workers. This is simple:
(separate executable)
MPI_Comm child_comm;
MPI_Comm_spawn("./worker", MPI_ARGV_NULL, N-1, MPI_INFO_NULL, 0,
(same executable, with an option)
MPI_Comm child_comm;
char *argv[] = { "-worker", NULL };
MPI_Comm_spawn("./a.out", argv, N-1, MPI_INFO_NULL, 0,
If everything goes well, child_comm will be set to the handle of an intercommunicator that can be used to communicate with the new job. As intercommunicators are kind of tricky to use and the parent-child job division requires complex program logic, one could simply merge the two sides of the intercommunicator into a "big world" communicator that replaced MPI_COMM_WORLD. On the parent's side:
MPI_Comm bigworld;
MPI_Intercomm_merge(child_comm, 0, &bigworld);
On the child's side:
MPI_Comm parent_comm, bigworld;
MPI_Intercomm_merge(parent_comm, 1, &bigworld);
After the merge is complete, all processes can communicate using bigworld instead of MPI_COMM_WORLD. Note that child jobs do not share their MPI_COMM_WORLD with the parent job.
To put it all together, here is a complete functioning example with two separate program codes.
#include <stdio.h>
#include <mpi.h>
int main (void)
printf("[main] Spawning workers...\n");
MPI_Comm child_comm;
MPI_Comm_spawn("./worker", MPI_ARGV_NULL, 2, MPI_INFO_NULL, 0,
MPI_Comm bigworld;
MPI_Intercomm_merge(child_comm, 0, &bigworld);
int size, rank;
MPI_Comm_rank(bigworld, &rank);
MPI_Comm_size(bigworld, &size);
printf("[main] Big world created with %d ranks\n", size);
// Perform some computation
int data = 1, result;
MPI_Bcast(&data, 1, MPI_INT, 0, bigworld);
data *= (1 + rank);
MPI_Reduce(&data, &result, 1, MPI_INT, MPI_SUM, 0, bigworld);
printf("[main] Result = %d\n", result);
printf("[main] Shutting down\n");
return 0;
#include <stdio.h>
#include <mpi.h>
int main (void)
MPI_Comm parent_comm;
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("[worker] %d of %d here\n", rank, size);
MPI_Comm bigworld;
MPI_Intercomm_merge(parent_comm, 1, &bigworld);
MPI_Comm_rank(bigworld, &rank);
MPI_Comm_size(bigworld, &size);
printf("[worker] %d of %d in big world\n", rank, size);
// Perform some computation
int data;
MPI_Bcast(&data, 1, MPI_INT, 0, bigworld);
data *= (1 + rank);
MPI_Reduce(&data, NULL, 1, MPI_INT, MPI_SUM, 0, bigworld);
printf("[worker] Done\n");
return 0;
Here is how it works:
$ mpicc -o main main.c
$ mpicc -o worker worker.c
$ ./main
[main] Spawning workers...
[worker] 0 of 2 here
[worker] 1 of 2 here
[worker] 1 of 3 in big world
[worker] 2 of 3 in big world
[main] Big world created with 3 ranks
[worker] Done
[worker] Done
[main] Result = 6
[main] Shutting down
The child job has to use MPI_Comm_get_parent to obtain the intercommunicator to the parent job. When a process is not part of such a child job, the returned value will be MPI_COMM_NULL. This allows for an easy way to implement both the main program and the worker in the same executable. Here is a hybrid example:
#include <stdio.h>
#include <mpi.h>
MPI_Comm bigworld_comm = MPI_COMM_NULL;
MPI_Comm other_comm = MPI_COMM_NULL;
int parlib_init (const char *argv0, int n)
if (other_comm == MPI_COMM_NULL)
printf("[main] Spawning workers...\n");
MPI_Comm_spawn(argv0, MPI_ARGV_NULL, n-1, MPI_INFO_NULL, 0,
MPI_Intercomm_merge(other_comm, 0, &bigworld_comm);
return 0;
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("[worker] %d of %d here\n", rank, size);
MPI_Intercomm_merge(other_comm, 1, &bigworld_comm);
return 1;
int parlib_dowork (void)
int data = 1, result = -1, size, rank;
MPI_Comm_rank(bigworld_comm, &rank);
MPI_Comm_size(bigworld_comm, &size);
if (rank == 0)
printf("[main] Doing work with %d processes in total\n", size);
data = 1;
MPI_Bcast(&data, 1, MPI_INT, 0, bigworld_comm);
data *= (1 + rank);
MPI_Reduce(&data, &result, 1, MPI_INT, MPI_SUM, 0, bigworld_comm);
return result;
void parlib_finalize (void)
int main (int argc, char **argv)
if (parlib_init(argv[0], 4))
// Worker process
printf("[worker] Done\n");
return 0;
// Main process
// Show GUI, save the world, etc.
int result = parlib_dowork();
printf("[main] Result = %d\n", result);
printf("[main] Shutting down\n");
return 0;
And here is an example output:
$ mpicc -o hybrid hybrid.c
$ ./hybrid
[main] Spawning workers...
[worker] 0 of 3 here
[worker] 2 of 3 here
[worker] 1 of 3 here
[main] Doing work with 4 processes in total
[worker] Done
[worker] Done
[main] Result = 10
[worker] Done
[main] Shutting down
Some things to keep in mind when designing such parallel libraries:
MPI can only be initialised once. If necessary, call MPI_Initialized to check if the library has already been initialised.
MPI can only be finalized once. Again, MPI_Finalized is your friend. It can be used in something like an atexit() handler to implement a universal MPI finalisation on program exit.
When used in threaded contexts (usual when GUIs are involved), MPI must be initialised with support for threads. See MPI_Init_thread.
You can get number of CPUs by using for example this solution, and then start the MPI process by calling MPI_comm_spawn. But you will need to have a separate executable file.
It was my understanding that MPI communicators restrict the scope of communication, such
that messages sent from one communicator should never be received in a different one.
However, the program inlined below appears to contradict this.
I understand that the MPI_Send call returns before a matching receive is posted because of the internal buffering it does under the hood (as opposed to MPI_Ssend). I also understand that MPI_Comm_free doesn't destroy the communicator right away, but merely marks it for deallocation and waits for any pending operations to finish. I suppose that my unmatched send operation will be forever pending, but then I wonder how come the same object (integer value) is reused for the second communicator!?
Is this normal behaviour, a bug in the MPI library implementation, or is it that my program is just incorrect?
Any suggestions are much appreciated!
LATER EDIT: posted follow-up question
#include "stdio.h"
#include "unistd.h"
#include "mpi.h"
int main(int argc, char* argv[]) {
int rank, size;
MPI_Group group;
MPI_Comm my_comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_group(MPI_COMM_WORLD, &group);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 1) {
int msg = 123;
MPI_Send(&msg, 1, MPI_INT, 0, 0, my_comm);
printf("rank 1: message sent\n");
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 0) {
int msg;
MPI_Recv(&msg, 1, MPI_INT, 1, 0, my_comm, MPI_STATUS_IGNORE);
printf("rank 0: message received\n");
if (rank == 0) printf("freeing communicator %d\n", my_comm);
return 0;
created communicator -2080374784
rank 1: message sent
freeing communicator -2080374784
created communicator -2080374784
rank 0: message received
freeing communicator -2080374784
The number you're seeing is simply a handle for the communicator. It's safe to reuse the handle since you've freed it. As to why you're able to send the message, look at how you're creating the communicator. When you use MPI_Comm_group, you're getting a group containing the ranks associated with the specified communicator. In this case, you get all of the ranks, since you are getting the group for MPI_COMM_WORLD. Then, you are using MPI_Comm_create to create a communicator based on a group of ranks. You are using the same group you just got, which will contain all of the ranks. So your new communicator has all of the ranks from MPI_COMM_WORLD. If you want your communicator to only contain a subset of ranks, you'll need to use a different function (or multiple functions) to make the desired group(s). I'd recommend reading through Chapter 6 of the MPI Standard, it includes all of the functions you'll need. Pick what you need to build the communicator you want.
My first thought was MPI_Scatter and send-buffer allocation should be used in if(proc_id == 0) clause, because the data should be scattered only once and each process needs only a portion of data in send-buffer, however it didn't work correctly.
It appears that send-buffer allocation and MPI_Scatter must be executed by all processes before the application goes right.
So I wander, what's the philosophy for the existence of MPI_Scatter since all processes have access to the send-buffer.
Any help will be grateful.
Code I wrote like this:
if (proc_id == 0) {
int * data = (int *)malloc(size*sizeof(int) * proc_size * recv_size);
for (int i = 0; i < proc_size * recv_size; i++) data[i] = i;
ierr = MPI_Scatter(&(data[0]), recv_size, MPI_INT, &recievedata, recv_size, MPI_INT, 0, MPI_COMM_WORLD);
I thought, that's enough for root processes to scatter data, what the other processes need to do is just receiving data. So I put MPI_Scatter, along with send buffer definition & allocation, in the if(proc_id == 0) statement. No compile/runtime error/warning, but the receive buffer of other processes didn't receive it's corresponding part of data.
Your question isn't very clear, and would be a lot easier to understand if you showed some code that you were having trouble with. Here's what I think you're asking -- and I'm only guessing this because this is an error I've seen people new to MPI in C make.
If you have some code like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
ierr = MPI_Scatter(&(data[0]), 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
why doesn't it work, and why do you get a segmentation fault? Of course the other processes don't have access to data; that's the whole point.
The answer is that in the non-root processes, the sendbuf argument (the first argument to MPI_Scatter()) isn't used. So the non-root processes don't need access to data. But you still can't go around dereferencing a pointer that you haven't defined. So you need to make sure all the C code is valid. But data can be NULL or completely undefined on all the other processes; you just have to make sure you're not accidentally dereferencing it. So this works just fine, for instance:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
} else {
data = NULL;
ierr = MPI_Scatter(data, 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
If you're using "multidimensional arrays" in C, and say scattering a row of a matrix, then you have to jump through an extra hoop or two to make this work, but it's still pretty easy.
Note that in the above code, all routines called Scatter - both the sender and the recievers. (Actually, the sender is also a receiver).
In the message passing paradigm, both the sender and the receiver have to cooperate to send data. In principle, these tasks could be on different computers, housed perhaps in different buildings -- nothing is shared between them. So there's no way for Task 1 to just "put" data into some part of Task 2's memory. (Note that MPI2 has "one sided messages", but even that requires a significant degree of cordination between sender and reciever, as a window has to be put asside to push data into or pull data out of).
The classic example of this is send/recieve pairs; it's not enough that (say) process 0 sends data to process 3, process 3 also has to recieve data.
The MPI_Scatter function contains both send and recieve logic. The root process (specified here as 0) sends out the data, and all the recievers recieve; everyone participating has to call the routine. Scatter is an example of an MPI Collective Operation, where all tasks in the communicator have to call the same routine. Broadcast, barrier, reduction operations, and gather operations are other examples.
If you have only process 0 call the scatter operation, your program will hang, waiting forever for the other tasks to participate.
I want to use MPI (MPICH2) on windows. I write this command:
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)