mpi gather collect data - mpi

I convinced that MPI_Gather collects data from all processes including root process itself.
How to make MPI_Gather to collect data from all process NOT including root process itself?
Or is there any alternative function?

Duplicate the functionality of MPI_Gather using MPI_Gatherv but specify 0 as the chunk size for the root rank instead. Something like this:
int rank, size, disp = 0;
int *cnts, *displs;
MPI_Comm_size(MPI_COMM_WORLD, &size);
cnts = malloc(size * sizeof(int));
displs = malloc(size * sizeof(int));
for (rank = 0; rank < size; rank++)
{
cnts[i] = (rank != root) ? count : 0;
displs[i] = disp;
disp += cnts[i];
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Gatherv(data, cnts[rank], data_type,
bigdata, cnts, displs, data_type,
root, MPI_COMM_WORLD);
free(displs); free(cnts);
Note that MPI_Gatherv could be significantly slower than MPI_Gather because the MPI implementation would be most likely unable to optimise the communication path and would fall back to some dumb linear implementation of the gather operation. So it might make sense to still use MPI_Gather and to provide some dummy data in the root process.
You could also supply MPI_IN_PLACE as the value of the root process send buffer and it would not send data to itself, but then again you would have to reserve place for the root data in the receive buffer (the in-place operation expects that the root would place its data directly in the correct position inside the receive buffer):
if (rank != root)
MPI_Gather(data, count, data_type,
NULL, count, data_type, root, MPI_COMM_WORLD);
else
MPI_Gather(MPI_IN_PLACE, count, data_type,
big_data, count, data_type, root, MPI_COMM_WORLD);

Related

How to do an MPI_Scatter in MPI to all nodes except the root?

In MPI, if I perform an MPI_Scatter on MPI_COMM_WORLD, all the nodes receive some data (including the sending root).
How do I perform an MPI_Scatter from a root node to all the other nodes and make sure the root node does not receive any data?
Is creating a new MPI_Comm containing all the nodes but the root a viable approach?
Let's imagine your code looks like that:
int rank, size; // rank of the process and size of the communicator
int root = 0; // root process of our scatter
int recvCount = 4; // or whatever
double *sendBuf = rank == root ? new double[recvCount * size] : NULL;
double *recvBuf = new double[recvCount];
MPI_Scatter( sendBuf, recvCount, MPI_DOUBLE,
recvBuf, recvCount, MPI_DOUBLE,
root, MPI_COMM_WORLD );
So in here, indeed, the root process will send data to itself although this could be avoided.
Here are the two obvious methods that come to mind to achieve that.
Using MPI_IN_PLACE
The call to MPI_Scatter() wouldn't have to change. The only change in the code would be for the definition of the receiving buffer, which would become something like this:
double *recvBuf = rank == root ?
static_cast<double*>( MPI_IN_PLACE ) :
new double[recvCount];
Using MPI_Scatterv()
With that, you'd have to define an array of integers describing the individual receiving sizes, an array of displacements describing the starting indexes and use them in a call to MPI_Scatterv() which would replace you call to MPI_Scatter() like this:
int sendCounts[size] = {recvCount}; // everybody receives recvCount data
sendCounts[root] = 0; // but the root process
int displs[size];
for ( int i = 0; i < size; i++ ) {
displs[i] = i * recvCount;
}
MPI_Scatterv( sendBuf, sendCounts, displs, MPI_DOUBLE,
recvBuf, recvCount, MPI_DOUBLE,
root, MPI_COMM_WORLD );
Of course in both cases no data would be on receiving buffer for process root and this would have to be accounted for in your code.
I personally prefer the first option, but both work.

Could MPI_Bcast lead to the issue of data uncertainty?

If different process broadcast different value to the other processes in the group of a certain communicator,what would happen?
Take the following program run by two processes as an example,
int rank, size;
int x;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
cout << "Process " << rank << "'s value is:" << x << endl;
MPI_Finalize();
I think there could be different possibilities of the printed result at the end of the program. If the process 0 runs faster than process 1, it would broadcast its value earlier than process 1, so process 1 would have the same value with process 0 when it starts broadcasting its value, therefore making the printed value of x both 0. But if the process 0 runs slower than process 1, the process 0 would have the same value as process 1 which is 1 at the end. Does what I described happen actually?
I think you don't understand the MPI_Bcast function well. Actually MPI_Bcast is a kind of MPI collective communication method in which every process that belongs to a certain communicator need to get involve. So for the function of MPI_Bcast, not only the process who sends the data to be broadcasted, but also the processes who receive the broadcasted data need to call the function synchronously in order to achieve the goal of data broadcasting among all participating processes.
In your given program, especially this part:
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
I think you meant to let process whose rank is 0 (process 0) broadcasts its value of x to other processes, but in your code, only process 0 calls the MPI_Bcast function when you use the if-else segment. So what do other processes do? For process whose rank is 1 (process 1), it doesn't call the same MPI_Bcast function the process 0 calls, although it does call another MPI_Bcast function to broadcast its value of x (the argument of root is different between those two MPI_Bcast functions).Thus if only process 0 calls the MPI_Bcast function, it just broadcasts the value of x to only itself actually and the value of x stored in other processes wouldn't be affected at all. Also it's the same condition for process 1. As a result in your program, the printed value of x for each process should be the same as when it's assigned initially and there wouldn't be the data uncertainty issue as you concerned.
MPI_Bcast is used primarily so that rank 0 [root] can calculate values and broadcast them so everybody starts with the same values.
Here is a valid usage:
int x;
// not all ranks may get the same time value ...
srand(time(NULL));
// so we must get the value once ...
if (rank == 0)
x = rand();
// and broadcast it here ...
MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);
Notice the difference from your usage. The same MPI_Bcast call for all ranks. The root will do a send and the others will do recv.

MPI message received in different communicator

It was my understanding that MPI communicators restrict the scope of communication, such
that messages sent from one communicator should never be received in a different one.
However, the program inlined below appears to contradict this.
I understand that the MPI_Send call returns before a matching receive is posted because of the internal buffering it does under the hood (as opposed to MPI_Ssend). I also understand that MPI_Comm_free doesn't destroy the communicator right away, but merely marks it for deallocation and waits for any pending operations to finish. I suppose that my unmatched send operation will be forever pending, but then I wonder how come the same object (integer value) is reused for the second communicator!?
Is this normal behaviour, a bug in the MPI library implementation, or is it that my program is just incorrect?
Any suggestions are much appreciated!
LATER EDIT: posted follow-up question
#include "stdio.h"
#include "unistd.h"
#include "mpi.h"
int main(int argc, char* argv[]) {
int rank, size;
MPI_Group group;
MPI_Comm my_comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_group(MPI_COMM_WORLD, &group);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 1) {
int msg = 123;
MPI_Send(&msg, 1, MPI_INT, 0, 0, my_comm);
printf("rank 1: message sent\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
sleep(2);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 0) {
int msg;
MPI_Recv(&msg, 1, MPI_INT, 1, 0, my_comm, MPI_STATUS_IGNORE);
printf("rank 0: message received\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
MPI_Finalize();
return 0;
}
outputs:
created communicator -2080374784
rank 1: message sent
freeing communicator -2080374784
created communicator -2080374784
rank 0: message received
freeing communicator -2080374784
The number you're seeing is simply a handle for the communicator. It's safe to reuse the handle since you've freed it. As to why you're able to send the message, look at how you're creating the communicator. When you use MPI_Comm_group, you're getting a group containing the ranks associated with the specified communicator. In this case, you get all of the ranks, since you are getting the group for MPI_COMM_WORLD. Then, you are using MPI_Comm_create to create a communicator based on a group of ranks. You are using the same group you just got, which will contain all of the ranks. So your new communicator has all of the ranks from MPI_COMM_WORLD. If you want your communicator to only contain a subset of ranks, you'll need to use a different function (or multiple functions) to make the desired group(s). I'd recommend reading through Chapter 6 of the MPI Standard, it includes all of the functions you'll need. Pick what you need to build the communicator you want.

MPI Receive/Gather Dynamic Vector Length

I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system.
I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors.
class MachineData
{
unsigned long hostMemory;
long cpuCores;
int cudaDevices;
public:
std::vector<NviInfo> nviVec;
std::vector<AmdInfo> amdVec;
...
};
struct AmdInfo
{
int platformID;
int deviceID;
cl_device_id device;
long gpuMem;
float sgflops;
double dgflops;
};
Each machine in a cluster populates its instance of MachineData. I want to gather each of these instances, but I am unsure how to approach gathering nviVec and amdVec since their length varies on each machine.
You can use MPI_GATHERV in combination with MPI_GATHER to accomplish that. MPI_GATHERV is the variable version of MPI_GATHER and it allows for the root rank to gather differt number of elements from each sending process. But in order for the root rank to specify these numbers it has to know how many elements each rank is holding. This could be achieved using simple single element MPI_GATHER before that. Something like this:
// To keep things simple: root is fixed to be rank 0 and MPI_COMM_WORLD is used
// Number of MPI processes and current rank
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int *counts = new int[size];
int nelements = (int)vector.size();
// Each process tells the root how many elements it holds
MPI_Gather(&nelements, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Displacements in the receive buffer for MPI_GATHERV
int *disps = new int[size];
// Displacement for the first chunk of data - 0
for (int i = 0; i < size; i++)
disps[i] = (i > 0) ? (disps[i-1] + counts[i-1]) : 0;
// Place to hold the gathered data
// Allocate at root only
type *alldata = NULL;
if (rank == 0)
// disps[size-1]+counts[size-1] == total number of elements
alldata = new int[disps[size-1]+counts[size-1]];
// Collect everything into the root
MPI_Gatherv(vectordata, nelements, datatype,
alldata, counts, disps, datatype, 0, MPI_COMM_WORLD);
You should also register MPI derived datatype (datatype in the code above) for the structures (binary sends will work but won't be portable and will not work in heterogeneous setups).

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources