Questions about MPI_Scatter executer & its send buffer allocation - mpi

My first thought was MPI_Scatter and send-buffer allocation should be used in if(proc_id == 0) clause, because the data should be scattered only once and each process needs only a portion of data in send-buffer, however it didn't work correctly.
It appears that send-buffer allocation and MPI_Scatter must be executed by all processes before the application goes right.
So I wander, what's the philosophy for the existence of MPI_Scatter since all processes have access to the send-buffer.
Any help will be grateful.
Edit:
Code I wrote like this:
if (proc_id == 0) {
int * data = (int *)malloc(size*sizeof(int) * proc_size * recv_size);
for (int i = 0; i < proc_size * recv_size; i++) data[i] = i;
ierr = MPI_Scatter(&(data[0]), recv_size, MPI_INT, &recievedata, recv_size, MPI_INT, 0, MPI_COMM_WORLD);
}
I thought, that's enough for root processes to scatter data, what the other processes need to do is just receiving data. So I put MPI_Scatter, along with send buffer definition & allocation, in the if(proc_id == 0) statement. No compile/runtime error/warning, but the receive buffer of other processes didn't receive it's corresponding part of data.

Your question isn't very clear, and would be a lot easier to understand if you showed some code that you were having trouble with. Here's what I think you're asking -- and I'm only guessing this because this is an error I've seen people new to MPI in C make.
If you have some code like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
}
ierr = MPI_Scatter(&(data[0]), 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
}
why doesn't it work, and why do you get a segmentation fault? Of course the other processes don't have access to data; that's the whole point.
The answer is that in the non-root processes, the sendbuf argument (the first argument to MPI_Scatter()) isn't used. So the non-root processes don't need access to data. But you still can't go around dereferencing a pointer that you haven't defined. So you need to make sure all the C code is valid. But data can be NULL or completely undefined on all the other processes; you just have to make sure you're not accidentally dereferencing it. So this works just fine, for instance:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
} else {
data = NULL;
}
ierr = MPI_Scatter(data, 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
}
If you're using "multidimensional arrays" in C, and say scattering a row of a matrix, then you have to jump through an extra hoop or two to make this work, but it's still pretty easy.
Update:
Note that in the above code, all routines called Scatter - both the sender and the recievers. (Actually, the sender is also a receiver).
In the message passing paradigm, both the sender and the receiver have to cooperate to send data. In principle, these tasks could be on different computers, housed perhaps in different buildings -- nothing is shared between them. So there's no way for Task 1 to just "put" data into some part of Task 2's memory. (Note that MPI2 has "one sided messages", but even that requires a significant degree of cordination between sender and reciever, as a window has to be put asside to push data into or pull data out of).
The classic example of this is send/recieve pairs; it's not enough that (say) process 0 sends data to process 3, process 3 also has to recieve data.
The MPI_Scatter function contains both send and recieve logic. The root process (specified here as 0) sends out the data, and all the recievers recieve; everyone participating has to call the routine. Scatter is an example of an MPI Collective Operation, where all tasks in the communicator have to call the same routine. Broadcast, barrier, reduction operations, and gather operations are other examples.
If you have only process 0 call the scatter operation, your program will hang, waiting forever for the other tasks to participate.

Related

Need help understanding MPI_Comm_create

Regarding MPI_Comm_create, the MPI standard says
MPI_COMM_CREATE(comm, group, newcomm)
... The function is collective and must be called by all processes in
the group of comm.
I took this to mean that, for instance, if the comm argument is MPI_COMM_WORLD, then all processes must call MPI_COMM_WORLD.
However, I wrote a variation on a code available on the internet demonstrating the use of MPI_Comm_create. It is below. You can see that there are two spots where MPI_Comm_create is called, and not by all processes. And yet the code runs just fine.
Did I get lucky? Did I stumble onto some implementation-dependent feature? Am I misunderstanding the MPI standard? Is the idea that the two calls together result in everyone calling MPI_Comm_create so "at the end of the day" it's OK? Thanks. Here's the code:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main(int argc, char **argv) {
MPI_Comm even_comm, odd_comm;
MPI_Group even_group, odd_group, world_group;
int id, even_id, odd_id;
int *even_ranks, *odd_ranks;
int num_ranks, num_even_ranks, num_odd_ranks;
int err_mpi, i, j;
int even_sum, odd_sum;
err_mpi = MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_ranks);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_group(MPI_COMM_WORLD, &world_group);
num_even_ranks = (num_ranks+1)/2;
even_ranks = (int *)malloc(num_even_ranks*sizeof(int));
j=0;
for (i=0; i<num_ranks; i++){
if (i%2 == 0) {
even_ranks[j] = i;
j++;
}
}
num_odd_ranks = num_ranks/2;
odd_ranks = (int *)malloc(num_odd_ranks*sizeof(int));
j=0;
for (i=0; i<num_ranks; i++){
if (i%2 == 1) {
odd_ranks[j] = i;
j++;
}
}
if (id%2 == 0){
MPI_Group_incl(world_group, num_even_ranks, even_ranks, &even_group);
// RIGHT HERE, all procs are NOT calling!
MPI_Comm_create(MPI_COMM_WORLD, even_group, &even_comm);
MPI_Comm_rank(even_comm, &even_id);
odd_id = -1;
} else {
MPI_Group_incl(world_group, num_odd_ranks, odd_ranks, &odd_group);
// RIGHT HERE, all procs are NOT calling!
MPI_Comm_create(MPI_COMM_WORLD, odd_group, &odd_comm);
MPI_Comm_rank(odd_comm, &odd_id);
even_id = -1;
}
// Just to have something to do, we'll some up the ranks of
// the various procs in each communicator.
if (even_id != -1) MPI_Reduce(&id, &even_sum, 1, MPI_INT, MPI_SUM, 0, even_comm);
if (odd_id != -1) MPI_Reduce(&id, &odd_sum, 1, MPI_INT, MPI_SUM, 0, odd_comm);
if (odd_id == 0) printf("odd sum: %d\n", odd_sum);
if (even_id == 0) printf("even sum: %d\n", even_sum);
MPI_Finalize();
}
Although the comm_create is called from two different lines of code, the important point is that all processes in COMM_WORLD are calling comm_create at the same time. The fact that they are not from the same line of code is not relevant - in fact, the MPI library doesn't even know where comm_create is being called from.
A simpler example would be calling Barrier from the two branches; regardless of which line is executed, all processes are executing the same barrier so the code will work as expected.
You could easily rewrite the code to be called from the same line: simply have variables called "num_ranks", "mycomm", "mygroup" and "myid" and an array called "ranks" and set them equal to the even or odd variables depending on rank. All processes could then call:
MPI_Group_incl(world_group, num_ranks, ranks, &mygroup);
MPI_Comm_create(MPI_COMM_WORLD, mygroup, &mycomm);
MPI_Comm_rank(mycomm, &myid);
and if you really wanted you could reassign these back afterwards, e.g.
if (id%2 == 0) even_comm = mycomm;

MPI message received in different communicator

It was my understanding that MPI communicators restrict the scope of communication, such
that messages sent from one communicator should never be received in a different one.
However, the program inlined below appears to contradict this.
I understand that the MPI_Send call returns before a matching receive is posted because of the internal buffering it does under the hood (as opposed to MPI_Ssend). I also understand that MPI_Comm_free doesn't destroy the communicator right away, but merely marks it for deallocation and waits for any pending operations to finish. I suppose that my unmatched send operation will be forever pending, but then I wonder how come the same object (integer value) is reused for the second communicator!?
Is this normal behaviour, a bug in the MPI library implementation, or is it that my program is just incorrect?
Any suggestions are much appreciated!
LATER EDIT: posted follow-up question
#include "stdio.h"
#include "unistd.h"
#include "mpi.h"
int main(int argc, char* argv[]) {
int rank, size;
MPI_Group group;
MPI_Comm my_comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_group(MPI_COMM_WORLD, &group);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 1) {
int msg = 123;
MPI_Send(&msg, 1, MPI_INT, 0, 0, my_comm);
printf("rank 1: message sent\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
sleep(2);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 0) {
int msg;
MPI_Recv(&msg, 1, MPI_INT, 1, 0, my_comm, MPI_STATUS_IGNORE);
printf("rank 0: message received\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
MPI_Finalize();
return 0;
}
outputs:
created communicator -2080374784
rank 1: message sent
freeing communicator -2080374784
created communicator -2080374784
rank 0: message received
freeing communicator -2080374784
The number you're seeing is simply a handle for the communicator. It's safe to reuse the handle since you've freed it. As to why you're able to send the message, look at how you're creating the communicator. When you use MPI_Comm_group, you're getting a group containing the ranks associated with the specified communicator. In this case, you get all of the ranks, since you are getting the group for MPI_COMM_WORLD. Then, you are using MPI_Comm_create to create a communicator based on a group of ranks. You are using the same group you just got, which will contain all of the ranks. So your new communicator has all of the ranks from MPI_COMM_WORLD. If you want your communicator to only contain a subset of ranks, you'll need to use a different function (or multiple functions) to make the desired group(s). I'd recommend reading through Chapter 6 of the MPI Standard, it includes all of the functions you'll need. Pick what you need to build the communicator you want.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

MPI Receive/Gather Dynamic Vector Length

I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system.
I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors.
class MachineData
{
unsigned long hostMemory;
long cpuCores;
int cudaDevices;
public:
std::vector<NviInfo> nviVec;
std::vector<AmdInfo> amdVec;
...
};
struct AmdInfo
{
int platformID;
int deviceID;
cl_device_id device;
long gpuMem;
float sgflops;
double dgflops;
};
Each machine in a cluster populates its instance of MachineData. I want to gather each of these instances, but I am unsure how to approach gathering nviVec and amdVec since their length varies on each machine.
You can use MPI_GATHERV in combination with MPI_GATHER to accomplish that. MPI_GATHERV is the variable version of MPI_GATHER and it allows for the root rank to gather differt number of elements from each sending process. But in order for the root rank to specify these numbers it has to know how many elements each rank is holding. This could be achieved using simple single element MPI_GATHER before that. Something like this:
// To keep things simple: root is fixed to be rank 0 and MPI_COMM_WORLD is used
// Number of MPI processes and current rank
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int *counts = new int[size];
int nelements = (int)vector.size();
// Each process tells the root how many elements it holds
MPI_Gather(&nelements, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Displacements in the receive buffer for MPI_GATHERV
int *disps = new int[size];
// Displacement for the first chunk of data - 0
for (int i = 0; i < size; i++)
disps[i] = (i > 0) ? (disps[i-1] + counts[i-1]) : 0;
// Place to hold the gathered data
// Allocate at root only
type *alldata = NULL;
if (rank == 0)
// disps[size-1]+counts[size-1] == total number of elements
alldata = new int[disps[size-1]+counts[size-1]];
// Collect everything into the root
MPI_Gatherv(vectordata, nelements, datatype,
alldata, counts, disps, datatype, 0, MPI_COMM_WORLD);
You should also register MPI derived datatype (datatype in the code above) for the structures (binary sends will work but won't be portable and will not work in heterogeneous setups).

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources