The MPI standard states that once a buffer has been given to a non-blocking communication function, the application is not allowed to use it until the operation has completed (i.e., until after a successful TEST or WAIT function).
Is this also applied to the situation below:
I have a buffer and every part of that will go to different processors, e.g part of it will be copied from data available on the processor itself.
Am I allowed on every processor to MPI_Irecv different parts of the buffer from other processors, copy the part available in the processor then MPI_Isend the data that should go to others, do my other computations, and MPI_Waitall so my send and receive get completed?
n=0;
for (i = 0; i < size; i++) {
if (i != rank) {
MPI_Irecv(&recvdata[i*100], 100, MPI_INT, i, i, comm, &requests[n]);
n++;
}
}
process(&recvdata[rank*100], 100);
for (i = 0; i < size; i++) {
if (i != rank) {
MPI_Isend(&senddata[i*100], 100, MPI_INT, i, rank, comm, &requests[n]);
n++;
}
}
MPI_Waitall(n, requests, statuses);
I'm not 100% sure I understand what you're asking, so I'll restate the question first:
If I have a large array of data, can I create nonblocking calls to receive data from subsets of the array and then send the data back out to other processes?
The answer to that is yes, as long as you synchronize between the receives and sends. Remember that the data from the MPI_IRECV won't have arrived until you've completed the call with MPI_WAIT, so you can't send it to another process until that's happened. Otherwise, the sends will be sending out whatever garbage happens to be in the buffer at the time.
So your code can look like this and be safe:
for (i = 0; i < size; i++)
MPI_Irecv(&data[i*100], 100, MPI_INT, i, 0, comm, &requests[i]);
/* No touching data in here */
MPI_Waitall(i, requests, statuses);
/* You can touch data here */
for (i = 0; i < size; i++)
MPI_Isend(&data[i*100], 100, MPI_INT, i+1, 0, comm); /* i+1 is wherever you want to send the data */
/* No touching data in here either */
MPI_Waitall(i, requests, statuses);
Throughout the MPI standard the term locations is used and not the term variables in order to prevent such confusion. The MPI library does not care where the memory comes from as long outstanding MPI operations are operating on disjoint sets of memory locations. Different memory locations could be different variables or different elements of a big array. In fact, the whole process memory could be thought as one big anonymous array of bytes.
In many cases, it is possible to achieve the same memory layout given different set of variable declarations. For example, with most x86/x64 C/C++ compilers the following two sets of local variable declarations will result in the same stack layout:
int a, b; int d[3];
int c;
| .... | | .... | |
+--------------+ +--------------+ |
| a | | d[2] | |
+--------------+ +--------------+ | lower addresses
| b | | d[1] | |
+--------------+ +--------------+ |
| c | | d[0] | \|/
+--------------+ +--------------+ V
In that case:
int a, b;
int c;
MPI_Irecv(&a, 1, MPI_INT, ..., &req[0]);
MPI_Irecv(&c, 1, MPI_INT, ..., &req[1]);
MPI_Waitall(2, &req, MPI_STATUSES_IGNORE);
is equivalent to:
int d[3];
MPI_Irecv(&d[2], 1, MPI_INT, ..., &req[0]);
MPI_Irecv(&d[0], 1, MPI_INT, ..., &req[1]);
MPI_Waitall(2, &req, MPI_STATUSES_IGNORE);
In the second case, though d[0] and d[2] belong to the same variable, &d[0] and &d[2] specify different and - in combination with ..., 1, MPI_INT, ... - disjoint memory locations.
In any case, make sure that you are not simultaneously reading from and writing into the same memory location.
A somehow more complex version of the example given by Wesley Bland follows. It overlaps send and receive operations by using MPI_Waitsome instead:
MPI_Request rreqs[size], sreqs[size];
for (i = 0; i < size; i++)
MPI_Irecv(&data[i*100], 100, MPI_INT, i, 0, comm, &rreqs[i]);
while (1)
{
int done_idx[size], numdone;
MPI_Waitsome(size, rreqs, &numdone, done_idx, MPI_STATUSES_IGNORE);
if (numdone == MPI_UNDEFINED)
break;
for (i = 0; i < numdone; i++)
{
int id = done_idx[i];
process(&data[id*100], 100);
MPI_Isend(&data[id*100], 100, MPI_INT, id, 0, comm, &sreqs[id]);
}
}
MPI_Waitall(size, sreqs, MPI_STATUSES_IGNORE);
In that particular case, using size separate arrays could result in somehow more readable code.
Related
Suppose I have a very large array of things and I have to do some operation on all these things.
In case operation fails for one element, I want to stop the work [this work is distributed across number of processors] across all the array.
I want to achieve this while keeping the number of sent/received messages to a minimum.
Also, I don't want to block processors if there is no need to.
How can I do it using MPI?
This seems to be a common question with no easy answer. Both other answer have scalability issues. The ring-communication approach has linear communication cost, while in the one-sided MPI_Win-solution, a single process will be hammered with memory requests from all workers. This may be fine for low number of ranks, but pose issues when increasing the rank count.
Non-blocking collectives can provide a more scalable better solution. The main idea is to post a MPI_Ibarrier on all ranks except on one designated root. This root will listen to point-to-point stop messages via MPI_Irecv and complete the MPI_Ibarrier once it receives it.
The tricky part is that there are four different cases "{root, non-root} x {found, not-found}" that need to be handled. Also it can happen that multiple ranks send a stop message, requiring an unknown number of matching receives on the root. That can be solved with an additional reduction that counts the number of ranks that sent a stop-request.
Here is an example how this can look in C:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int iter_max = 10000;
const int difficulty = 20000;
int find_stuff()
{
int num_iters = rand() % iter_max;
for (int i = 0; i < num_iters; i++) {
if (rand() % difficulty == 0) {
return 1;
}
}
return 0;
}
const int stop_tag = 42;
const int root = 0;
int forward_stop(MPI_Request* root_recv_stop, MPI_Request* all_recv_stop, int found_count)
{
int flag;
MPI_Status status;
if (found_count == 0) {
MPI_Test(root_recv_stop, &flag, &status);
} else {
// If we find something on the root, we actually wait until we receive our own message.
MPI_Wait(root_recv_stop, &status);
flag = 1;
}
if (flag) {
printf("Forwarding stop signal from %d\n", status.MPI_SOURCE);
MPI_Ibarrier(MPI_COMM_WORLD, all_recv_stop);
MPI_Wait(all_recv_stop, MPI_STATUS_IGNORE);
// We must post some additional receives if multiple ranks found something at the same time
MPI_Reduce(MPI_IN_PLACE, &found_count, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
for (found_count--; found_count > 0; found_count--) {
MPI_Recv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &status);
printf("Additional stop from: %d\n", status.MPI_SOURCE);
}
return 1;
}
return 0;
}
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
srand(rank);
MPI_Request root_recv_stop;
MPI_Request all_recv_stop;
if (rank == root) {
MPI_Irecv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &root_recv_stop);
} else {
// You may want to use an extra communicator here, to avoid messing with other barriers
MPI_Ibarrier(MPI_COMM_WORLD, &all_recv_stop);
}
while (1) {
int found = find_stuff();
if (found) {
printf("Rank %d found something.\n", rank);
// Note: We cannot post this as blocking, otherwise there is a deadlock with the reduce
MPI_Request req;
MPI_Isend(NULL, 0, MPI_CHAR, root, stop_tag, MPI_COMM_WORLD, &req);
if (rank != root) {
// We know that we are going to receive our own stop signal.
// This avoids running another useless iteration
MPI_Wait(&all_recv_stop, MPI_STATUS_IGNORE);
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
MPI_Wait(&req, MPI_STATUS_IGNORE);
break;
}
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
if (rank == root) {
if (forward_stop(&root_recv_stop, &all_recv_stop, found)) {
break;
}
} else {
int stop_signal;
MPI_Test(&all_recv_stop, &stop_signal, MPI_STATUS_IGNORE);
if (stop_signal)
{
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
printf("Rank %d stopping after receiving signal.\n", rank);
break;
}
}
};
MPI_Finalize();
}
While this is not the simplest code, it should:
Introduce no additional blocking
Scale with the implementation of a barrier (usually O(log N))
Have a worst-case-latency from one found, to all stop of 2 * loop time ( + 1 p2p + 1 barrier + 1 reduction).
If many/all ranks find a solution at the same time, it still works but may be less efficient.
A possible strategy to derive this global stop condition in a non-blocking fashion is to rely on MPI_Test.
scenario
Consider that each process posts an asynchronous receive of type MPI_INT to its left rank with a given tag to build a ring. Then start your computation. If a rank encounters the stop condition it sends its own rank to its right rank. In the meantime each rank uses MPI_Test to check for the completion of the MPI_Irecv during the computation if it is completed then enter a branch first waiting the message and then transitively propagating the received rank to the right except if the right rank is equal to the payload of the message (not to loop).
This done you should have all processes in the branch, ready to trigger an arbitrary recovery operation.
Complexity
The topology retained is a ring as it minimizes the number of messages at most (n-1) however it augments the propagation time. Other topologies could be retained with more messages but lower spatial complexity for example a tree with a n.ln(n) complexity.
Implementation
Something like this.
int rank, size;
MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &rank);
MPI_Comm_size( MPI_COMM_WORLD, &size);
int left_rank = (rank==0)?(size-1):(rank-1);
int right_rank = (rank==(size-1))?0:(rank+1)%size;
int stop_cond_rank;
MPI_Request stop_cond_request;
int stop_cond= 0;
while( 1 )
{
MPI_Irecv( &stop_cond_rank, 1, MPI_INT, left_rank, 123, MPI_COMM_WORLD, &stop_cond_request);
/* Compute Here and set stop condition accordingly */
if( stop_cond )
{
/* Cancel the left recv */
MPI_Cancel( &stop_cond_request );
if( rank != right_rank )
MPI_Send( &rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
int did_recv = 0;
MPI_Test( &stop_cond_request, &did_recv, MPI_STATUS_IGNORE );
if( did_recv )
{
stop_cond = 1;
MPI_Wait( &stop_cond_request, MPI_STATUS_IGNORE );
if( right_rank != stop_cond_rank )
MPI_Send( &stop_cond_rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
}
if( stop_cond )
{
/* Handle the stop condition */
}
else
{
/* Cleanup */
MPI_Cancel( &stop_cond_request );
}
That is a question I've asked myself a few times without finding any completely satisfactory answer... The only thing I thought of (beside MPI_Abort() that does it but which is a bit extreme) is to create an MPI_Win storing a flag that will be raise by whichever process facing the problem, and checked by all processes regularly to see if they can go on processing. This is done using non-blocking calls, the same way as described in this answer.
The main weaknesses of this are:
This depends on the processes to willingly check the status of the flag. There is no immediate interruption of their work to notifying them.
The frequency of this checking must be adjusted by hand. You have to find the trade-off between the time you waste processing data while there's no need to because something happened elsewhere, and the time it takes to check the status...
In the end, what we would need is a way of defining a callback action triggered by an MPI call such as MPI_Abort() (basically replacing the abort action by something else). I don't think this exists, but maybe I overlooked it.
I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.
I have a struct:
typedef struct
{
double distance;
int* path;
} tour;
Then I trying to gather results from all processes:
MPI_Gather(&best, sizeof(tour), MPI_BEST, all_best, sizeof(tour)*proc_count, MPI_BEST, 0, MPI_COMM_WORLD);
After gather my root see that all_best containts only 1 normal element and trash in others.
Type of all_best is tour*.
Initialisation of MPI_BEST:
void ACO_Build_best(tour *tour,int city_count, MPI_Datatype *mpi_type /*out*/)
{
int block_lengths[2];
MPI_Aint displacements[2];
MPI_Datatype typelist[2];
MPI_Aint start_address;
MPI_Aint address;
block_lengths[0] = 1;
block_lengths[1] = city_count;
typelist[0] = MPI_DOUBLE;
typelist[1] = MPI_INT;
MPI_Address(&(tour->distance), &displacements[0]);
MPI_Address(&(tour->path), &displacements[1]);
displacements[1] = displacements[1] - displacements[0];
displacements[0] = 0;
MPI_Type_struct(2, block_lengths, displacements, typelist, mpi_type);
MPI_Type_commit(mpi_type);
}
Any ideas are welcome.
Apart from passing incorrect lengths to MPI_Gather, MPI actually does not follow pointers to pointers. With such a structured type you would be sending the value of distance and the value of the path pointer (essentially an address which makes no sense when sent to other processes). If one supposes that distance essentially gives the number of elements in path, then you can kind of achieve your goal with a combination of MPI_Gather and MPI_Gatherv:
First, gather the lengths:
int counts[proc_count];
MPI_Gather(&best->distance, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
Now that counts is populated with the correct lengths, you can continue and use MPI_Gatherv to receive all paths:
int disps[proc_count];
disps[0] = 0;
for (int i = 1; i < proc_count; i++)
disps[i] = disps[i-1] + counts[i-1];
// Allocate space for the concatenation of all paths
int *all_paths = malloc((disps[proc_count-1] + counts[proc_count-1])*sizeof(int));
MPI_Gatherv(best->path, best->distance, MPI_INT,
all_paths, counts, disps, MPI_INT, 0, MPI_COMM_WORLD);
Now you have the concatenation of all paths in all_paths. You can examine or extract an individual path by taking counts[i] elements starting at position disps[i] in all_paths. Or you can even build an array of tour structures and make them use the already allocated and populated path storage:
tour *all_best = malloc(proc_count*sizeof(tour));
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = &all_paths[disps[i]];
}
Or you can duplicate the segments instead:
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = malloc(counts[i]*sizeof(int));
memcpy(all_best[i].path, &all_paths[disps[i]], counts[i]*sizeof(int));
}
// all_paths is not needed any more and can be safely free()-ed
Edit: Because I've overlooked the definition of the tour structure, the above code actually works with:
struct
{
int distance;
int *path;
}
where distance holds the number of significant elements in path. This is different from your case, but without some information on how tour.path is being allocated (and sized), it's hard to give a specific solution.
I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system.
I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors.
class MachineData
{
unsigned long hostMemory;
long cpuCores;
int cudaDevices;
public:
std::vector<NviInfo> nviVec;
std::vector<AmdInfo> amdVec;
...
};
struct AmdInfo
{
int platformID;
int deviceID;
cl_device_id device;
long gpuMem;
float sgflops;
double dgflops;
};
Each machine in a cluster populates its instance of MachineData. I want to gather each of these instances, but I am unsure how to approach gathering nviVec and amdVec since their length varies on each machine.
You can use MPI_GATHERV in combination with MPI_GATHER to accomplish that. MPI_GATHERV is the variable version of MPI_GATHER and it allows for the root rank to gather differt number of elements from each sending process. But in order for the root rank to specify these numbers it has to know how many elements each rank is holding. This could be achieved using simple single element MPI_GATHER before that. Something like this:
// To keep things simple: root is fixed to be rank 0 and MPI_COMM_WORLD is used
// Number of MPI processes and current rank
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int *counts = new int[size];
int nelements = (int)vector.size();
// Each process tells the root how many elements it holds
MPI_Gather(&nelements, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Displacements in the receive buffer for MPI_GATHERV
int *disps = new int[size];
// Displacement for the first chunk of data - 0
for (int i = 0; i < size; i++)
disps[i] = (i > 0) ? (disps[i-1] + counts[i-1]) : 0;
// Place to hold the gathered data
// Allocate at root only
type *alldata = NULL;
if (rank == 0)
// disps[size-1]+counts[size-1] == total number of elements
alldata = new int[disps[size-1]+counts[size-1]];
// Collect everything into the root
MPI_Gatherv(vectordata, nelements, datatype,
alldata, counts, disps, datatype, 0, MPI_COMM_WORLD);
You should also register MPI derived datatype (datatype in the code above) for the structures (binary sends will work but won't be portable and will not work in heterogeneous setups).
My first thought was MPI_Scatter and send-buffer allocation should be used in if(proc_id == 0) clause, because the data should be scattered only once and each process needs only a portion of data in send-buffer, however it didn't work correctly.
It appears that send-buffer allocation and MPI_Scatter must be executed by all processes before the application goes right.
So I wander, what's the philosophy for the existence of MPI_Scatter since all processes have access to the send-buffer.
Any help will be grateful.
Edit:
Code I wrote like this:
if (proc_id == 0) {
int * data = (int *)malloc(size*sizeof(int) * proc_size * recv_size);
for (int i = 0; i < proc_size * recv_size; i++) data[i] = i;
ierr = MPI_Scatter(&(data[0]), recv_size, MPI_INT, &recievedata, recv_size, MPI_INT, 0, MPI_COMM_WORLD);
}
I thought, that's enough for root processes to scatter data, what the other processes need to do is just receiving data. So I put MPI_Scatter, along with send buffer definition & allocation, in the if(proc_id == 0) statement. No compile/runtime error/warning, but the receive buffer of other processes didn't receive it's corresponding part of data.
Your question isn't very clear, and would be a lot easier to understand if you showed some code that you were having trouble with. Here's what I think you're asking -- and I'm only guessing this because this is an error I've seen people new to MPI in C make.
If you have some code like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
}
ierr = MPI_Scatter(&(data[0]), 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
}
why doesn't it work, and why do you get a segmentation fault? Of course the other processes don't have access to data; that's the whole point.
The answer is that in the non-root processes, the sendbuf argument (the first argument to MPI_Scatter()) isn't used. So the non-root processes don't need access to data. But you still can't go around dereferencing a pointer that you haven't defined. So you need to make sure all the C code is valid. But data can be NULL or completely undefined on all the other processes; you just have to make sure you're not accidentally dereferencing it. So this works just fine, for instance:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
} else {
data = NULL;
}
ierr = MPI_Scatter(data, 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
}
If you're using "multidimensional arrays" in C, and say scattering a row of a matrix, then you have to jump through an extra hoop or two to make this work, but it's still pretty easy.
Update:
Note that in the above code, all routines called Scatter - both the sender and the recievers. (Actually, the sender is also a receiver).
In the message passing paradigm, both the sender and the receiver have to cooperate to send data. In principle, these tasks could be on different computers, housed perhaps in different buildings -- nothing is shared between them. So there's no way for Task 1 to just "put" data into some part of Task 2's memory. (Note that MPI2 has "one sided messages", but even that requires a significant degree of cordination between sender and reciever, as a window has to be put asside to push data into or pull data out of).
The classic example of this is send/recieve pairs; it's not enough that (say) process 0 sends data to process 3, process 3 also has to recieve data.
The MPI_Scatter function contains both send and recieve logic. The root process (specified here as 0) sends out the data, and all the recievers recieve; everyone participating has to call the routine. Scatter is an example of an MPI Collective Operation, where all tasks in the communicator have to call the same routine. Broadcast, barrier, reduction operations, and gather operations are other examples.
If you have only process 0 call the scatter operation, your program will hang, waiting forever for the other tasks to participate.