MPI Receive/Gather Dynamic Vector Length - mpi

I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system.
I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors.
class MachineData
{
unsigned long hostMemory;
long cpuCores;
int cudaDevices;
public:
std::vector<NviInfo> nviVec;
std::vector<AmdInfo> amdVec;
...
};
struct AmdInfo
{
int platformID;
int deviceID;
cl_device_id device;
long gpuMem;
float sgflops;
double dgflops;
};
Each machine in a cluster populates its instance of MachineData. I want to gather each of these instances, but I am unsure how to approach gathering nviVec and amdVec since their length varies on each machine.

You can use MPI_GATHERV in combination with MPI_GATHER to accomplish that. MPI_GATHERV is the variable version of MPI_GATHER and it allows for the root rank to gather differt number of elements from each sending process. But in order for the root rank to specify these numbers it has to know how many elements each rank is holding. This could be achieved using simple single element MPI_GATHER before that. Something like this:
// To keep things simple: root is fixed to be rank 0 and MPI_COMM_WORLD is used
// Number of MPI processes and current rank
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int *counts = new int[size];
int nelements = (int)vector.size();
// Each process tells the root how many elements it holds
MPI_Gather(&nelements, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Displacements in the receive buffer for MPI_GATHERV
int *disps = new int[size];
// Displacement for the first chunk of data - 0
for (int i = 0; i < size; i++)
disps[i] = (i > 0) ? (disps[i-1] + counts[i-1]) : 0;
// Place to hold the gathered data
// Allocate at root only
type *alldata = NULL;
if (rank == 0)
// disps[size-1]+counts[size-1] == total number of elements
alldata = new int[disps[size-1]+counts[size-1]];
// Collect everything into the root
MPI_Gatherv(vectordata, nelements, datatype,
alldata, counts, disps, datatype, 0, MPI_COMM_WORLD);
You should also register MPI derived datatype (datatype in the code above) for the structures (binary sends will work but won't be portable and will not work in heterogeneous setups).

Related

How to do an MPI_Scatter in MPI to all nodes except the root?

In MPI, if I perform an MPI_Scatter on MPI_COMM_WORLD, all the nodes receive some data (including the sending root).
How do I perform an MPI_Scatter from a root node to all the other nodes and make sure the root node does not receive any data?
Is creating a new MPI_Comm containing all the nodes but the root a viable approach?
Let's imagine your code looks like that:
int rank, size; // rank of the process and size of the communicator
int root = 0; // root process of our scatter
int recvCount = 4; // or whatever
double *sendBuf = rank == root ? new double[recvCount * size] : NULL;
double *recvBuf = new double[recvCount];
MPI_Scatter( sendBuf, recvCount, MPI_DOUBLE,
recvBuf, recvCount, MPI_DOUBLE,
root, MPI_COMM_WORLD );
So in here, indeed, the root process will send data to itself although this could be avoided.
Here are the two obvious methods that come to mind to achieve that.
Using MPI_IN_PLACE
The call to MPI_Scatter() wouldn't have to change. The only change in the code would be for the definition of the receiving buffer, which would become something like this:
double *recvBuf = rank == root ?
static_cast<double*>( MPI_IN_PLACE ) :
new double[recvCount];
Using MPI_Scatterv()
With that, you'd have to define an array of integers describing the individual receiving sizes, an array of displacements describing the starting indexes and use them in a call to MPI_Scatterv() which would replace you call to MPI_Scatter() like this:
int sendCounts[size] = {recvCount}; // everybody receives recvCount data
sendCounts[root] = 0; // but the root process
int displs[size];
for ( int i = 0; i < size; i++ ) {
displs[i] = i * recvCount;
}
MPI_Scatterv( sendBuf, sendCounts, displs, MPI_DOUBLE,
recvBuf, recvCount, MPI_DOUBLE,
root, MPI_COMM_WORLD );
Of course in both cases no data would be on receiving buffer for process root and this would have to be accounted for in your code.
I personally prefer the first option, but both work.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

mpi_gather for struct with dynamic array

I have a struct:
typedef struct
{
double distance;
int* path;
} tour;
Then I trying to gather results from all processes:
MPI_Gather(&best, sizeof(tour), MPI_BEST, all_best, sizeof(tour)*proc_count, MPI_BEST, 0, MPI_COMM_WORLD);
After gather my root see that all_best containts only 1 normal element and trash in others.
Type of all_best is tour*.
Initialisation of MPI_BEST:
void ACO_Build_best(tour *tour,int city_count, MPI_Datatype *mpi_type /*out*/)
{
int block_lengths[2];
MPI_Aint displacements[2];
MPI_Datatype typelist[2];
MPI_Aint start_address;
MPI_Aint address;
block_lengths[0] = 1;
block_lengths[1] = city_count;
typelist[0] = MPI_DOUBLE;
typelist[1] = MPI_INT;
MPI_Address(&(tour->distance), &displacements[0]);
MPI_Address(&(tour->path), &displacements[1]);
displacements[1] = displacements[1] - displacements[0];
displacements[0] = 0;
MPI_Type_struct(2, block_lengths, displacements, typelist, mpi_type);
MPI_Type_commit(mpi_type);
}
Any ideas are welcome.
Apart from passing incorrect lengths to MPI_Gather, MPI actually does not follow pointers to pointers. With such a structured type you would be sending the value of distance and the value of the path pointer (essentially an address which makes no sense when sent to other processes). If one supposes that distance essentially gives the number of elements in path, then you can kind of achieve your goal with a combination of MPI_Gather and MPI_Gatherv:
First, gather the lengths:
int counts[proc_count];
MPI_Gather(&best->distance, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
Now that counts is populated with the correct lengths, you can continue and use MPI_Gatherv to receive all paths:
int disps[proc_count];
disps[0] = 0;
for (int i = 1; i < proc_count; i++)
disps[i] = disps[i-1] + counts[i-1];
// Allocate space for the concatenation of all paths
int *all_paths = malloc((disps[proc_count-1] + counts[proc_count-1])*sizeof(int));
MPI_Gatherv(best->path, best->distance, MPI_INT,
all_paths, counts, disps, MPI_INT, 0, MPI_COMM_WORLD);
Now you have the concatenation of all paths in all_paths. You can examine or extract an individual path by taking counts[i] elements starting at position disps[i] in all_paths. Or you can even build an array of tour structures and make them use the already allocated and populated path storage:
tour *all_best = malloc(proc_count*sizeof(tour));
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = &all_paths[disps[i]];
}
Or you can duplicate the segments instead:
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = malloc(counts[i]*sizeof(int));
memcpy(all_best[i].path, &all_paths[disps[i]], counts[i]*sizeof(int));
}
// all_paths is not needed any more and can be safely free()-ed
Edit: Because I've overlooked the definition of the tour structure, the above code actually works with:
struct
{
int distance;
int *path;
}
where distance holds the number of significant elements in path. This is different from your case, but without some information on how tour.path is being allocated (and sized), it's hard to give a specific solution.

mpi gather collect data

I convinced that MPI_Gather collects data from all processes including root process itself.
How to make MPI_Gather to collect data from all process NOT including root process itself?
Or is there any alternative function?
Duplicate the functionality of MPI_Gather using MPI_Gatherv but specify 0 as the chunk size for the root rank instead. Something like this:
int rank, size, disp = 0;
int *cnts, *displs;
MPI_Comm_size(MPI_COMM_WORLD, &size);
cnts = malloc(size * sizeof(int));
displs = malloc(size * sizeof(int));
for (rank = 0; rank < size; rank++)
{
cnts[i] = (rank != root) ? count : 0;
displs[i] = disp;
disp += cnts[i];
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Gatherv(data, cnts[rank], data_type,
bigdata, cnts, displs, data_type,
root, MPI_COMM_WORLD);
free(displs); free(cnts);
Note that MPI_Gatherv could be significantly slower than MPI_Gather because the MPI implementation would be most likely unable to optimise the communication path and would fall back to some dumb linear implementation of the gather operation. So it might make sense to still use MPI_Gather and to provide some dummy data in the root process.
You could also supply MPI_IN_PLACE as the value of the root process send buffer and it would not send data to itself, but then again you would have to reserve place for the root data in the receive buffer (the in-place operation expects that the root would place its data directly in the correct position inside the receive buffer):
if (rank != root)
MPI_Gather(data, count, data_type,
NULL, count, data_type, root, MPI_COMM_WORLD);
else
MPI_Gather(MPI_IN_PLACE, count, data_type,
big_data, count, data_type, root, MPI_COMM_WORLD);

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources