mpi_gather for struct with dynamic array - mpi

I have a struct:
typedef struct
{
double distance;
int* path;
} tour;
Then I trying to gather results from all processes:
MPI_Gather(&best, sizeof(tour), MPI_BEST, all_best, sizeof(tour)*proc_count, MPI_BEST, 0, MPI_COMM_WORLD);
After gather my root see that all_best containts only 1 normal element and trash in others.
Type of all_best is tour*.
Initialisation of MPI_BEST:
void ACO_Build_best(tour *tour,int city_count, MPI_Datatype *mpi_type /*out*/)
{
int block_lengths[2];
MPI_Aint displacements[2];
MPI_Datatype typelist[2];
MPI_Aint start_address;
MPI_Aint address;
block_lengths[0] = 1;
block_lengths[1] = city_count;
typelist[0] = MPI_DOUBLE;
typelist[1] = MPI_INT;
MPI_Address(&(tour->distance), &displacements[0]);
MPI_Address(&(tour->path), &displacements[1]);
displacements[1] = displacements[1] - displacements[0];
displacements[0] = 0;
MPI_Type_struct(2, block_lengths, displacements, typelist, mpi_type);
MPI_Type_commit(mpi_type);
}
Any ideas are welcome.

Apart from passing incorrect lengths to MPI_Gather, MPI actually does not follow pointers to pointers. With such a structured type you would be sending the value of distance and the value of the path pointer (essentially an address which makes no sense when sent to other processes). If one supposes that distance essentially gives the number of elements in path, then you can kind of achieve your goal with a combination of MPI_Gather and MPI_Gatherv:
First, gather the lengths:
int counts[proc_count];
MPI_Gather(&best->distance, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
Now that counts is populated with the correct lengths, you can continue and use MPI_Gatherv to receive all paths:
int disps[proc_count];
disps[0] = 0;
for (int i = 1; i < proc_count; i++)
disps[i] = disps[i-1] + counts[i-1];
// Allocate space for the concatenation of all paths
int *all_paths = malloc((disps[proc_count-1] + counts[proc_count-1])*sizeof(int));
MPI_Gatherv(best->path, best->distance, MPI_INT,
all_paths, counts, disps, MPI_INT, 0, MPI_COMM_WORLD);
Now you have the concatenation of all paths in all_paths. You can examine or extract an individual path by taking counts[i] elements starting at position disps[i] in all_paths. Or you can even build an array of tour structures and make them use the already allocated and populated path storage:
tour *all_best = malloc(proc_count*sizeof(tour));
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = &all_paths[disps[i]];
}
Or you can duplicate the segments instead:
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = malloc(counts[i]*sizeof(int));
memcpy(all_best[i].path, &all_paths[disps[i]], counts[i]*sizeof(int));
}
// all_paths is not needed any more and can be safely free()-ed
Edit: Because I've overlooked the definition of the tour structure, the above code actually works with:
struct
{
int distance;
int *path;
}
where distance holds the number of significant elements in path. This is different from your case, but without some information on how tour.path is being allocated (and sized), it's hard to give a specific solution.

Related

How to do an MPI_Scatter in MPI to all nodes except the root?

In MPI, if I perform an MPI_Scatter on MPI_COMM_WORLD, all the nodes receive some data (including the sending root).
How do I perform an MPI_Scatter from a root node to all the other nodes and make sure the root node does not receive any data?
Is creating a new MPI_Comm containing all the nodes but the root a viable approach?
Let's imagine your code looks like that:
int rank, size; // rank of the process and size of the communicator
int root = 0; // root process of our scatter
int recvCount = 4; // or whatever
double *sendBuf = rank == root ? new double[recvCount * size] : NULL;
double *recvBuf = new double[recvCount];
MPI_Scatter( sendBuf, recvCount, MPI_DOUBLE,
recvBuf, recvCount, MPI_DOUBLE,
root, MPI_COMM_WORLD );
So in here, indeed, the root process will send data to itself although this could be avoided.
Here are the two obvious methods that come to mind to achieve that.
Using MPI_IN_PLACE
The call to MPI_Scatter() wouldn't have to change. The only change in the code would be for the definition of the receiving buffer, which would become something like this:
double *recvBuf = rank == root ?
static_cast<double*>( MPI_IN_PLACE ) :
new double[recvCount];
Using MPI_Scatterv()
With that, you'd have to define an array of integers describing the individual receiving sizes, an array of displacements describing the starting indexes and use them in a call to MPI_Scatterv() which would replace you call to MPI_Scatter() like this:
int sendCounts[size] = {recvCount}; // everybody receives recvCount data
sendCounts[root] = 0; // but the root process
int displs[size];
for ( int i = 0; i < size; i++ ) {
displs[i] = i * recvCount;
}
MPI_Scatterv( sendBuf, sendCounts, displs, MPI_DOUBLE,
recvBuf, recvCount, MPI_DOUBLE,
root, MPI_COMM_WORLD );
Of course in both cases no data would be on receiving buffer for process root and this would have to be accounted for in your code.
I personally prefer the first option, but both work.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

Mapping MPI processes to particular nodes

I think this question is irrelavant to ask here. But could n't help myself.
Suppose I have a cluster with 100 nodes with each node having 16 cores.
I have an mpi application whose communication pattern is already known and I also know the cluster topology(i.e hop distance between nodes).
Now I know the processes to node mapping that reduces the contention on the network. For example: process to node mappings are 10->20,30->90.
How do I map the process with rank 10 to the node-20?
Please help me in this.
If you are not constrained with any kind of a queueing system you can control the rank to node mapping by creating your own machinefile.
For instance if the file my_machine_file has the following 1600 lines
node001
node002
node003
....
node100
node001
node002
node003
....
node100
...
[repeat 13 more times]
...
node001
node002
node003
....
node100
it would correspond to the mapping
0-> node001, 1 -> node002, ... 99 -> node100, 100 -> node001, ...
you should run your application with
mpirun -machinefile my_machine_file -n 1600 my_app
When your application needs less than 1600 processes you can edit your machinefile accordingly.
Please remember though that the cluster admin has probably numbered the nodes respecting the topology of the interconnect. Yet there are reports of sensible increase (order of 10%-20%) in performance through careful exploitation of the cluster topology. (References to follow).
Note: Starting an MPI program with mpirun is neither standardized nor portable. However here the question is clearly related to a specific compute cluster and a specific implementation (OpenMPI) and does not request a portable solution.
A little late to this party, but here's a subroutine in C++ that will give you a node communicator and a master communicator (just for the masters of nodes), as well as the size and rank of each. It's clumsy, but I haven't found a better way to do this unfortunately. Luckily it only adds about 0.1s to the wall times. Maybe you or someone else will get some use out of it.
#define MASTER 0
using namespace std;
/*
* Make a comunicator for each node and another for just
* the masters of the nodes. Upon completion, everyone is
* in a new node communicator, knows its size and their rank,
* and the rank of their master in the master communicator,
* which can be useful to use for indexing.
*/
bool CommByNode(MPI::Intracomm &NodeComm,
MPI::Intracomm &MasterComm,
int &NodeRank, int &MasterRank,
int &NodeSize, int &MasterSize,
string &NodeNameStr)
{
bool IsOk = true;
int Rank = MPI::COMM_WORLD.Get_rank();
int Size = MPI::COMM_WORLD.Get_size();
/*
* ======================================================================
* What follows is my best attempt at creating a communicator
* for each node in a job such that only the cores on that
* node are in the node's communicator, and each core groups
* itself and the node communicator is made using the Split() function.
* The end of this (lengthly) process is indicated by another comment.
* ======================================================================
*/
char *NodeName, *NodeNameList;
NodeName = new char [1000];
int NodeNameLen,
*NodeNameCountVect,
*NodeNameOffsetVect,
NodeNameTotalLen = 0;
// Get the name and name character count of each core's node
MPI::Get_processor_name(NodeName, NodeNameLen);
// Prepare a vector for character counts of node names
if (Rank == MASTER)
NodeNameCountVect = new int [Size];
// Gather node name lengths to master to prepare c-array
MPI::COMM_WORLD.Gather(&NodeNameLen, 1, MPI::INT, NodeNameCountVect, 1, MPI::INT, MASTER);
if (Rank == MASTER){
// Need character count information for navigating node name c-array
NodeNameOffsetVect = new int [Size];
NodeNameOffsetVect[0] = 0;
NodeNameTotalLen = NodeNameCountVect[0];
// build offset vector and total char count for all node names
for (int i = 1 ; i < Size ; ++i){
NodeNameOffsetVect[i] = NodeNameCountVect[i-1] + NodeNameOffsetVect[i-1];
NodeNameTotalLen += NodeNameCountVect[i];
}
// char-array for all node names
NodeNameList = new char [NodeNameTotalLen];
}
// Gatherv node names to char-array in master
MPI::COMM_WORLD.Gatherv(NodeName, NodeNameLen, MPI::CHAR, NodeNameList, NodeNameCountVect, NodeNameOffsetVect, MPI::CHAR, MASTER);
string *FullStrList, *NodeStrList;
// Each core keeps its node's name in a str for later comparison
stringstream ss;
ss << NodeName;
ss >> NodeNameStr;
delete NodeName; // node name in str, so delete c-array
int *NodeListLenVect, NumUniqueNodes = 0, NodeListCharLen = 0;
string NodeListStr;
if (Rank == MASTER){
/*
* Need to prepare a list of all unique node names, so first
* need all node names (incl duplicates) as strings, then
* can make a list of all unique node names.
*/
FullStrList = new string [Size]; // full list of node names, each will be checked
NodeStrList = new string [Size]; // list of unique node names, used for checking above list
// i loops over node names, j loops over characters for each node name.
for (int i = 0 ; i < Size ; ++i){
stringstream ss;
for (int j = 0 ; j < NodeNameCountVect[i] ; ++j)
ss << NodeNameList[NodeNameOffsetVect[i] + j]; // each char into the stringstream
ss >> FullStrList[i]; // stringstream into string for each node name
ss.str(""); // This and below clear the contents of the stringstream,
ss.clear(); // since the >> operator doesn't clear as it extracts
//cout << FullStrList[i] << endl; // for testing
}
delete NodeNameList; // master is done with full c-array
bool IsUnique; // flag for breaking from for loop
stringstream ss; // used for a full c-array of unique node names
for (int i = 0 ; i < Size ; ++i){ // Loop over EVERY name
IsUnique = true;
for (int j = 0 ; j < NumUniqueNodes ; ++j)
if (FullStrList[i].compare(NodeStrList[j]) == 0){ // check against list of uniques
IsUnique = false;
break;
}
if (IsUnique){
NodeStrList[NumUniqueNodes] = FullStrList[i]; // add unique names so others can be checked against them
ss << NodeStrList[NumUniqueNodes].c_str(); // build up a string of all unique names back-to-back
++NumUniqueNodes; // keep a tally of number of unique nodes
}
}
ss >> NodeListStr; // make a string of all unique node names
NodeListCharLen = NodeListStr.size(); // char length of all unique node names
NodeListLenVect = new int [NumUniqueNodes]; // list of unique node name lengths
/*
* Because Bcast simply duplicates the buffer of the Bcaster to all cores,
* the buffer needs to be a char* so that the other cores can have a similar
* buffer prepared to receive. This wouldn't work if we passed string.c_str()
* as the buffer, becuase the receiving cores don't have string.c_str() to
* receive into, and even if they did, c_srt() is a method and can't be used
* that way.
*/
NodeNameList = new char [NodeListCharLen]; // even though c_str is used, allocate necessary memory
NodeNameList = const_cast<char*>(NodeListStr.c_str()); // c_str() returns const char*, so need to recast
for (int i = 0 ; i < NumUniqueNodes ; ++i) // fill list of unique node name char lengths
NodeListLenVect[i] = NodeStrList[i].size();
/*for (int i = 0 ; i < NumUnique ; ++i)
cout << UniqueNodeStrList[i] << endl;
MPI::COMM_WORLD.Abort(1);*/
//delete NodeStrList; // Arrays of string don't need to be deallocated,
//delete FullStrList; // I'm guessing becuase of something weird in the string class.
delete NodeNameCountVect;
delete NodeNameOffsetVect;
}
/*
* Now we send the list of node names back to all cores
* so they can group themselves appropriately.
*/
// Bcast the number of nodes in use
MPI::COMM_WORLD.Bcast(&NumUniqueNodes, 1, MPI::INT, MASTER);
// Bcast the full length of all node names
MPI::COMM_WORLD.Bcast(&NodeListCharLen, 1, MPI::INT, MASTER);
// prepare buffers for node name Bcast's
if (Rank > MASTER){
NodeListLenVect = new int [NumUniqueNodes];
NodeNameList = new char [NodeListCharLen];
}
// Lengths of node names for navigating c-string
MPI::COMM_WORLD.Bcast(NodeListLenVect, NumUniqueNodes, MPI::INT, MASTER);
// The actual full list of unique node names
MPI::COMM_WORLD.Bcast(NodeNameList, NodeListCharLen, MPI::CHAR, MASTER);
/*
* Similar to what master did before, each core (incl master)
* needs to build an actual list of node names as strings so they
* can compare the c++ way.
*/
int Offset = 0;
NodeStrList = new string[NumUniqueNodes];
for (int i = 0 ; i < NumUniqueNodes ; ++i){
stringstream ss;
for (int j = 0 ; j < NodeListLenVect[i] ; ++j)
ss << NodeNameList[Offset + j];
ss >> NodeStrList[i];
ss.str("");
ss.clear();
Offset += NodeListLenVect[i];
//cout << FullStrList[i] << endl;
}
// Now since everyone has the same list, just check your node and find your group.
int CommGroup = -1;
for (int i = 0 ; i < NumUniqueNodes ; ++i)
if (NodeNameStr.compare(NodeStrList[i]) == 0){
CommGroup = i;
break;
}
if (Rank > MASTER){
delete NodeListLenVect;
delete NodeNameList;
}
// In case process fails, error prints and job aborts.
if (CommGroup < 0){
cout << "**ERROR** Rank " << Rank << " didn't identify comm group correctly." << endl;
IsOk = false;
}
/*
* ======================================================================
* The above method uses c++ strings wherever possible so that things
* like node name comparisons can be done the c++ way. I'm sure there's
* a better way to do this because that was way too many lines of code...
* ======================================================================
*/
// Create node communicators
NodeComm = MPI::COMM_WORLD.Split(CommGroup, 0);
NodeSize = NodeComm.Get_size();
NodeRank = NodeComm.Get_rank();
// Group for master communicator
int MasterGroup;
if (NodeRank == MASTER)
MasterGroup = 0;
else
MasterGroup = MPI_UNDEFINED;
// Create master communicator
MasterComm = MPI::COMM_WORLD.Split(MasterGroup, 0);
MasterRank = -1;
MasterSize = -1;
if (MasterComm != MPI::COMM_NULL){
MasterRank = MasterComm.Get_rank();
MasterSize = MasterComm.Get_size();
}
MPI::COMM_WORLD.Bcast(&MasterSize, 1, MPI::INT, MASTER);
NodeComm.Bcast(&MasterRank, 1, MPI::INT, MASTER);
return IsOk;
}

MPI Receive/Gather Dynamic Vector Length

I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system.
I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors.
class MachineData
{
unsigned long hostMemory;
long cpuCores;
int cudaDevices;
public:
std::vector<NviInfo> nviVec;
std::vector<AmdInfo> amdVec;
...
};
struct AmdInfo
{
int platformID;
int deviceID;
cl_device_id device;
long gpuMem;
float sgflops;
double dgflops;
};
Each machine in a cluster populates its instance of MachineData. I want to gather each of these instances, but I am unsure how to approach gathering nviVec and amdVec since their length varies on each machine.
You can use MPI_GATHERV in combination with MPI_GATHER to accomplish that. MPI_GATHERV is the variable version of MPI_GATHER and it allows for the root rank to gather differt number of elements from each sending process. But in order for the root rank to specify these numbers it has to know how many elements each rank is holding. This could be achieved using simple single element MPI_GATHER before that. Something like this:
// To keep things simple: root is fixed to be rank 0 and MPI_COMM_WORLD is used
// Number of MPI processes and current rank
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int *counts = new int[size];
int nelements = (int)vector.size();
// Each process tells the root how many elements it holds
MPI_Gather(&nelements, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Displacements in the receive buffer for MPI_GATHERV
int *disps = new int[size];
// Displacement for the first chunk of data - 0
for (int i = 0; i < size; i++)
disps[i] = (i > 0) ? (disps[i-1] + counts[i-1]) : 0;
// Place to hold the gathered data
// Allocate at root only
type *alldata = NULL;
if (rank == 0)
// disps[size-1]+counts[size-1] == total number of elements
alldata = new int[disps[size-1]+counts[size-1]];
// Collect everything into the root
MPI_Gatherv(vectordata, nelements, datatype,
alldata, counts, disps, datatype, 0, MPI_COMM_WORLD);
You should also register MPI derived datatype (datatype in the code above) for the structures (binary sends will work but won't be portable and will not work in heterogeneous setups).

stream.read method accepts length as integer type ??? where as

i am trying to read file from a stream.
and i am using stream.read method to read the bytes. So the code goes like below
FileByteStream.Read(buffer, 0, outputMessage.FileByteStream.Length)
Now the above gives me error because the last parameter "outputMessage.FileByteStream.Length" returns a long type value but the method expects an integer type.
Please advise.
Convert it to an int...
FileByteStream.Read(buffer, 0, Convert.ToInt32(outputMessage.FileByteStream.Length))
It's probably an int because this operation blocks until it's done reading...so if you're in a high volume application, you may not want to block while you read in a massively large file.
If what you're reading in isn't reasonably sized, you may want to consider looping to read the data into a buffer (example from MSDN docs):
//s is the stream that I'm working with...
byte[] bytes = new byte[s.Length];
int numBytesToRead = (int) s.Length;
int numBytesRead = 0;
while (numBytesToRead > 0)
{
// Read may return anything from 0 to 10.
int n = s.Read(bytes, numBytesRead, 10);
// The end of the file is reached.
if (n == 0)
{
break;
}
numBytesRead += n;
numBytesToRead -= n;
}
This way you don't cast, and if you pick a reasonably large number to read into the buffer, you'll only go through the while loop once.

Resources