MPI hangs on MPI_Send for large messages - mpi

There is a simple program in c++ / mpi (mpich2), which sends an array of type double. If the size of the array more than 9000, then during the call MPI_Send my programm hangs. If array is smaller than 9000 (8000, for example) programm works fine. Source code is bellow:
main.cpp
using namespace std;
Cube** cubes;
int cubesLen;
double* InitVector(int N) {
double* x = new double[N];
for (int i = 0; i < N; i++) {
x[i] = i + 1;
}
return x;
}
void CreateCubes() {
cubes = new Cube*[12];
cubesLen = 12;
for (int i = 0; i < 12; i++) {
cubes[i] = new Cube(9000);
}
}
void SendSimpleData(int size, int rank) {
Cube* cube = cubes[0];
int nodeDest = rank + 1;
if (nodeDest > size - 1) {
nodeDest = 1;
}
double* coefImOut = (double *) malloc(sizeof (double)*cube->coefficentsImLength);
cout << "Before send" << endl;
int count = cube->coefficentsImLength;
MPI_Send(coefImOut, count, MPI_DOUBLE, nodeDest, 0, MPI_COMM_WORLD);
cout << "After send" << endl;
free(coefImOut);
MPI_Status status;
double *coefIm = (double *) malloc(sizeof(double)*count);
int nodeFrom = rank - 1;
if (nodeFrom < 1) {
nodeFrom = size - 1;
}
MPI_Recv(coefIm, count, MPI_DOUBLE, nodeFrom, 0, MPI_COMM_WORLD, &status);
free(coefIm);
}
int main(int argc, char *argv[]) {
int size, rank;
const int root = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
CreateCubes();
if (rank != root) {
SendSimpleData(size, rank);
}
MPI_Finalize();
return 0;
}
class Cube
class Cube {
public:
Cube(int size);
Cube(const Cube& orig);
virtual ~Cube();
int Id() { return id; }
void Id(int id) { this->id = id; }
int coefficentsImLength;
double* coefficentsIm;
private:
int id;
};
Cube::Cube(int size) {
this->coefficentsImLength = size;
coefficentsIm = new double[size];
for (int i = 0; i < size; i++) {
coefficentsIm[i] = 1;
}
}
Cube::Cube(const Cube& orig) {
}
Cube::~Cube() {
delete[] coefficentsIm;
}
The program runs on 4 processes:
mpiexec -n 4 ./myApp1
Any ideas?

The details of the Cube class aren't relevant here: consider a simpler version
#include <mpi.h>
#include <cstdlib>
using namespace std;
int main(int argc, char *argv[]) {
int size, rank;
const int root = 0;
int datasize = atoi(argv[1]);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank != root) {
int nodeDest = rank + 1;
if (nodeDest > size - 1) {
nodeDest = 1;
}
int nodeFrom = rank - 1;
if (nodeFrom < 1) {
nodeFrom = size - 1;
}
MPI_Status status;
int *data = new int[datasize];
for (int i=0; i<datasize; i++)
data[i] = rank;
cout << "Before send" << endl;
MPI_Send(&data, datasize, MPI_INT, nodeDest, 0, MPI_COMM_WORLD);
cout << "After send" << endl;
MPI_Recv(&data, datasize, MPI_INT, nodeFrom, 0, MPI_COMM_WORLD, &status);
delete [] data;
}
MPI_Finalize();
return 0;
}
where running gives
$ mpirun -np 4 ./send 1
Before send
After send
Before send
After send
Before send
After send
$ mpirun -np 4 ./send 65000
Before send
Before send
Before send
If in DDT you looked at the message queue window, you'd see everyone is sending, and no one is receiving, and you have a classic deadlock.
MPI_Send's semantics, wierdly, aren't well defined, but it is allowed to block until "the receive has been posted". MPI_Ssend is clearer in this regard; it will always block until the receive has been posted. Details about the different send modes can be seen here.
The reason it worked for smaller messages is an accident of the implementation; for "small enough" messages (for your case, it looks to be <64kB), your MPI_Send implementation uses an "eager send" protocol and doesn't block on the receive; for larger messages, where it isn't necessarily safe just to keep buffered copies of the message kicking around in memory, the Send waits for the matching receive (which it is always allowed to do anyway).
There's a few things you could do to avoid this; all you have to do is make sure not everyone is calling a blocking MPI_Send at the same time. You could (say) have even processors send first, then receive, and odd processors receive first, and then send. You could use nonblocking communications (Isend/Irecv/Waitall). But the simplest solution in this case is to use MPI_Sendrecv, which is a blocking (Send + Recv), rather than a blocking send plus a blocking receive. The send and receive will execute concurrently, and the function will block until both are complete. So this works
#include <mpi.h>
#include <cstdlib>
using namespace std;
int main(int argc, char *argv[]) {
int size, rank;
const int root = 0;
int datasize = atoi(argv[1]);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank != root) {
int nodeDest = rank + 1;
if (nodeDest > size - 1) {
nodeDest = 1;
}
int nodeFrom = rank - 1;
if (nodeFrom < 1) {
nodeFrom = size - 1;
}
MPI_Status status;
int *outdata = new int[datasize];
int *indata = new int[datasize];
for (int i=0; i<datasize; i++)
outdata[i] = rank;
cout << "Before sendrecv" << endl;
MPI_Sendrecv(outdata, datasize, MPI_INT, nodeDest, 0,
indata, datasize, MPI_INT, nodeFrom, 0, MPI_COMM_WORLD, &status);
cout << "After sendrecv" << endl;
delete [] outdata;
delete [] indata;
}
MPI_Finalize();
return 0;
}
Running gives
$ mpirun -np 4 ./send 65000
Before sendrecv
Before sendrecv
Before sendrecv
After sendrecv
After sendrecv
After sendrecv

Related

MPI_Test returning true flags for requests despite never sending?

I have some code that for testing purposes, I removed all sends and only have non-blocking receives. You can imagine my surprise when using MPI_Test the flags were indicating some of the requests were actually being completed. I have my code setup in a cartesian grid, with a small replica below, although this doesn't reproduce the error:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h> // for sleep
#include <mpi.h>
void test(int pos);
MPI_Comm comm_cart;
int main(int argc, char *argv[])
{
int i, j;
int rank, size;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* code for mpi cartesian gird topology */
int dim[1];
dim[0] = 2;
int periods[1];
periods[0] = 0;
int reorder = 1;
int coords[1];
MPI_Cart_create(MPI_COMM_WORLD, 1, dim, periods, 1, &comm_cart);
MPI_Cart_coords(comm_cart, rank, 2, coords);
test(coords[0]);
MPI_Finalize();
return (0);
}
void test(int pos)
{
float placeholder[4];
int other = (pos+1) % 2;
MPI_Request reqs[8];
int flags[4];
for(int iter = 0; iter < 20; iter++){
// Test requests from previous time cycle
for(int i=0;i<4;i++){
if(iter == 0) break;
MPI_Test(&reqs[0], &flags[0] , MPI_STATUS_IGNORE);
printf("Flag: %d\n", flags[0]);
}
MPI_Irecv(&placeholder[0], 1, MPI_FLOAT, other, 0, comm_cart, &reqs[0]);
}
}
Any help would be appreciated.
The issue is with MPI_Test and MPI_PROC_NULLs. Quite often when using MPI_Cart_shift, you end up with MPI_PROC_NULLs as if you're on the edge of the grid, a neighbouring cell simply doesn't exist in some directions.
I can't find any documentation for this anywhere, so I had to discover it myself, but when you do an MPI_Irecv with an MPI_PROC_NULL source, it will instantly complete and when tested using MPI_Test, the flag will return true for a completed request. Example code below:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int t;
int flag;
MPI_Request req;
MPI_Irecv(&t, 1, MPI_INT, MPI_PROC_NULL, 0, MPI_COMM_WORLD, &req);
MPI_Test(&req, &flag, MPI_STATUS_IGNORE);
printf("Flag: %d\n", flag);
MPI_Finalize();
return (0);
}
Which returns the following when run:
Flag: 1
Flag: 1

Is there some way of avoiding this implicit MPI_Allreduce() synchronisation?

I'm writing an MPI program that uses a library which makes its own MPI calls. In my program, I have a loop that calls a function from the library. The function that I'm calling from the library makes use of MPI_Allreduce.
The problem here is that in my program, some of the ranks can exit the loop before others and this causes the MPI_Allreduce call to just hang since not all ranks will be calling MPI_Allreduce again.
Is there any way of programming around this without modifying the sources of the library I'm using?
Below is the code for an example which demonstrates the execution pattern.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#include <math.h>
#include <assert.h>
#define N_ITEMS 100000
#define ITERATIONS 32
float *create_rand_nums(int num_elements) {
float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
assert(rand_nums != NULL);
int i;
for (i = 0; i < num_elements; i++) {
rand_nums[i] = (rand() / (float)RAND_MAX);
}
return rand_nums;
}
void reduce_stddev(int world_rank, int world_size, int num_elements_per_proc)
{
fprintf(stdout, "Calling %s: %d\n", __func__, world_rank);
fflush(stdout);
srand(time(NULL)*world_rank);
float *rand_nums = NULL;
rand_nums = create_rand_nums(num_elements_per_proc);
float local_sum = 0;
int i;
for (i = 0; i < num_elements_per_proc; i++) {
local_sum += rand_nums[i];
}
float global_sum;
fprintf(stdout, "%d: About to call all reduce\n", world_rank);
fflush(stdout);
MPI_Allreduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM,
MPI_COMM_WORLD);
fprintf(stdout, "%d: done calling all reduce\n", world_rank);
fflush(stdout);
float mean = global_sum / (num_elements_per_proc * world_size);
float local_sq_diff = 0;
for (i = 0; i < num_elements_per_proc; i++) {
local_sq_diff += (rand_nums[i] - mean) * (rand_nums[i] - mean);
}
float global_sq_diff;
MPI_Reduce(&local_sq_diff, &global_sq_diff, 1, MPI_FLOAT, MPI_SUM, 0,
MPI_COMM_WORLD);
if (world_rank == 0) {
float stddev = sqrt(global_sq_diff /
(num_elements_per_proc * world_size));
printf("Mean - %f, Standard deviation = %f\n", mean, stddev);
}
free(rand_nums);
}
int main(int argc, char* argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: avg num_elements_per_proc\n");
exit(1);
}
int num_elements_per_proc = atoi(argv[1]);
MPI_Init(NULL, NULL);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
unsigned long long j = 0;
for(j = 0; j < ITERATIONS; j++)
{
/* Function which calls MPI_Allreduce */
reduce_stddev(world_rank, world_size, num_elements_per_proc);
/* Simulates some processes leaving the loop early */
if( (j == (ITERATIONS/2)) && (world_rank % 2 == 0))
{
fprintf(stdout, "%d exiting\n", world_rank);
fflush(stdout);
break;
}
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return EXIT_SUCCESS;
}
This is always an issue in MPI - how do you tell all the other ranks when one rank is finished? The easiest approach is for everyone to set a true/false flag and then do an allreduce to see if anyone finished. Using this code at the end seems to work
for(j = 0; j < ITERATIONS; j++)
{
/* Function which calls MPI_Allreduce */
reduce_stddev(world_rank, world_size, num_elements_per_proc);
int finished = 0;
/* Simulates some processes leaving the loop early */
if( (j == (ITERATIONS/2)) && (world_rank % 2 == 0))
{
fprintf(stdout, "%d finished\n", world_rank);
fflush(stdout);
finished = 1;
}
/* Check to see if anyone has finished */
int anyfinished;
MPI_Allreduce(&finished, &anyfinished, 1, MPI_INT, MPI_LOR,
MPI_COMM_WORLD);
if (anyfinished)
{
fprintf(stdout, "%d exiting\n", world_rank);
break;
}
}
OK - I just reread your question and maybe I misunderstood it. Do you want everyone else to keep calculating?

MPI: how to distinguish send and recv in MPI_Wait

Let's say I use PMPI to write a wrapper for MPI_Wait, which waits for an MPI send or receive to complete.
/* ================== C Wrappers for MPI_Wait ================== */
_EXTERN_C_ int PMPI_Wait(MPI_Request *request, MPI_Status *status);
_EXTERN_C_ int MPI_Wait(MPI_Request *request, MPI_Status *status) {
int _wrap_py_return_val = 0;
_wrap_py_return_val = PMPI_Wait(request, status);
return _wrap_py_return_val;
}
The wrapper is generated by this.
What I would like to do is:
/* ================== C Wrappers for MPI_Wait ================== */
_EXTERN_C_ int PMPI_Wait(MPI_Request *request, MPI_Status *status);
_EXTERN_C_ int MPI_Wait(MPI_Request *request, MPI_Status *status) {
int _wrap_py_return_val = 0;
if(is a send request)
printf("send\n");
else // is a recv request
printf("recv\n");
_wrap_py_return_val = PMPI_Wait(request, status);
return _wrap_py_return_val;
}
How to distinguish send and recv in Open MPI? Let's say I use Open MPI 3.0.0.
I think since MPI_Request is opaque (I think in several release it is just an int) your only chance is to monitor yourself the created MPI_Request.
Here is a proposition (it is C++ oriented, because that's the way I like it) :
#include <mpi.h>
#include <iostream>
#include <map>
//To do opaque ordering
struct RequestConverter
{
char data[sizeof(MPI_Request)];
RequestConverter(MPI_Request * mpi_request)
{
memcpy(data, mpi_request, sizeof(MPI_Request));
}
RequestConverter()
{ }
RequestConverter(const RequestConverter & req)
{
memcpy(data, req.data, sizeof(MPI_Request));
}
RequestConverter & operator=(const RequestConverter & req)
{
memcpy(data, req.data, sizeof(MPI_Request));
return *this;
}
bool operator<(const RequestConverter & request) const
{
for(size_t i=0; i<sizeof(MPI_Request); i++)
{
if(data[i]!=request.data[i])
{
return data[i]<request.data[i];
}
}
return false;
}
};
//To store the created MPI_Request
std::map<RequestConverter, std::string> request_holder;
extern "C"
{
int MPI_Isend(
void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request
)
{
int ier = PMPI_Isend(buf, count, datatype, dest, tag, comm, request);
request_holder[RequestConverter(request)]="sending";
return ier;
}
int MPI_Irecv(
void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request
)
{
int ier = PMPI_Irecv(buf, count, datatype, dest, tag, comm, request);
request_holder[RequestConverter(request)]="receiving";
return ier;
}
int MPI_Wait(
MPI_Request *request,
MPI_Status * status
)
{
int myid;
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
std::cout << "waiting("<<myid<<")-> "<<request_holder[RequestConverter(request)]<<std::endl;
request_holder.erase(RequestConverter(request));
return PMPI_Wait(request, status);
}
}
RequestConverter is just a way of doing oblivious ordering to use a std::map
MPI_Isend stores the request in the global map, so does MPI_Irecv and MPI_Wait looks for the request and deletes it from the std::map.
Simple test gives :
int main(int argv, char ** args)
{
int myid, numprocs;
MPI_Init(&argv, &args);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
int i=123456789;
MPI_Request request;
MPI_Status status;
if(myid==0)
{
MPI_Isend(&i, 1, MPI_INT, 1, 44444, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
std::cout << myid <<' '<<i << std::endl;
}
else if(myid==1)
{
MPI_Irecv(&i, 1, MPI_INT, 0, 44444, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
std::cout << myid <<' '<<i << std::endl;
}
int * sb = new int[numprocs];
for(size_t i=0; i<numprocs; i++){sb[i]=(myid+1)*(i+1);}
int * rb = new int[numprocs];
MPI_Alltoall(sb, 1, MPI_INT, rb, 1, MPI_INT, MPI_COMM_WORLD );
MPI_Finalize();
}
output :
waiting(0)-> sending
0 123456789
waiting(1)-> receiving
1 123456789
However I just added a test with MPI_Alltoall to see if only the PMPI functions were called and it is the case. So no miracle there.

QSemaphore producer consumer problem

This is more or less Qt's example with some small changes.
The output is PcPcPcPc...etc. I don't understand why.
Namely, I am confused about how sProducer.acquire(256); works. I believe I understand how sProducer.acquire(1); works. It doesn't make sense to me to acquire anything more than 1 because I don't see how acquiring more than 1 makes any difference logically. Could someone explain this? On the surface, writing 1 byte and reading 1 byte doesn't seem very efficient due to semaphore overhead...but acquiring more resources doesn't seem to make a performance difference nor does the code make sense.
Logically I think both the acquire and release have to have the same number (whatever that number is). But how can I modify this code so I can acquire more (say 256) and thus reduce semaphore overhead? The code bellow just doesn't make sense to me when acquire and release is not 1.
#include <QtCore>
#include <iostream>
#include <QTextStream>
//Global variables.
QTextStream out(stdout);
QTextStream in(stdin);
const int DataSize = 1024;
const int BufferSize = 512;
char buffer[BufferSize];
QSemaphore sProducer(BufferSize);
QSemaphore sConsumer(0);
//-----------------------------
class Producer : public QThread
{
public:
void run();
};
void Producer::run()
{
for (int i = 0; i < DataSize; ++i) {
sProducer.acquire(256);
buffer[i % BufferSize] = 'P';
sConsumer.release(256);
}
}
class Consumer : public QThread
{
public:
void run();
};
void Consumer::run()
{
for (int i = 0; i < DataSize; ++i) {
sConsumer.acquire(256);
std::cerr << buffer[i % BufferSize];
out << "c";
out.flush();
sProducer.release(256);
}
std::cerr << std::endl;
}
int main()
{
Producer producer;
Consumer consumer;
producer.start();
consumer.start();
producer.wait();
consumer.wait();
in.readLine(); //so i can read console text.
return 0;
}
Since there is only one producer and one consumer, they can move freely their own private cursor, their i variable, of the amount of bytes they want, as long as there is enough room to do that (something higher that 256 on both sides with a 512 buffer would cause a deadlock).
Basically, when a thread successfully acquire 256 bytes, it means it can safely read or write these 256 bytes in one single operation, so you just have to put another loop inside the acquire/release block to handle that number of bytes.
For the producer:
void Producer::run()
{
for (int i = 0; i < DataSize; ++i) {
const int blockSize = 256;
sProducer.acquire(blockSize);
for(int j = 0; j < blockSize; ++i, ++j) {
buffer[i % BufferSize] = 'P';
}
sConsumer.release(blockSize);
}
}
And for the consumer
void Consumer::run()
{
for (int i = 0; i < DataSize; ++i) {
const int blockSize = 128;
sConsumer.acquire(blockSize);
for(int j = 0; j < blockSize; ++i, ++j) {
std::cerr << buffer[i % BufferSize];
out << "c";
out.flush();
}
sProducer.release(blockSize);
}
std::cerr << std::endl;
}

MPI Error: No output

The code below is for using 4 nodes to communicate using MPI. I am able to compile it successfully on the cluster using "mpiicpc".
However, the output screen just gives me a warning, ‘Warning: Cant read mpd.hosts for list of hosts start only on current’ and hangs.
Could you please suggest what the warning means and also if it is the reason why my code hangs?
#include <mpi.h>
#include <fstream>
using namespace std;
#define Cols 96
#define Rows 96
#define beats 1
ofstream fout("Vm0");
ofstream f1out("Vm1");
.....
.....
double V[Cols][Rows];
int r,i,y,ibeat;
int my_rank;
int p;
int source;
int dest;
int tag = 0;
//Allocating Memory
double *A = new double[Rows*sizeof(double)];
double *B = new double[Rows*sizeof(double)];
.....
......
void prttofile ();
// MAIN FUNCTION
int main (int argc, char *argv[])
{
//MPI Commands
MPI_Status status;
MPI_Request send_request, recv_request;
MPI_Init (&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
for (ibeat=0;ibeat<beats;ibeat++)
{
for (i=0; i<Cols/2; i++)
{
for (y=0; y<Rows/2; y++)
{
if (my_rank == 0)
if (i < 48)
if (y<48)
V[i][y] = 0;
....
.......
.....
}
}
//Load the Array with the edge values
for (r=0; r<Rows/2; y++)
{
if ((my_rank == 0) || (my_rank == 1))
{
A[r] = V[r][48];
BB[r] = V[r][48];
}
.....
.....
}
int test = 2;
if ((my_rank%test) == 0)
{
MPI_Isend(C, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &send_request);
MPI_Irecv(CC, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &recv_request);
}
else if ((my_rank%test) == 1)
......
......
ibeat = ibeat+1;
prttofile ();
} //close ibeat
MPI_Finalize ();
} //close main
//Print to File Function to save output values
void prttofile ()
{
for (i = 0; i<Cols/2; i++)
{
for (y = 0; y<Rows/2; y++)
{
if (my_rank == 0)
fout << V[i][y] << " " ;
....
.....
}
}
if (my_rank == 0)
fout << endl;
if ....
....
}
When you want to run on multiple nodes you have to tell mpirun which ones you want with the -machinefile switch. This machinefile is just a list of nodes, one per line. If you want to put 2 processes on one node, list it twice.
So if your machines are named node1 and node2 and you want to use two cores from each:
$ cat nodes
node1
node1
node2
node2
$ mpirun -machinefile nodes -np 4 ./a.out
If you're using a batch control system like PBS or TORQUE (you use qsub to submit your job) then this node file is created for you and its location is in the $PBS_NODEFILE environment variable:
mpirun -machinefile $PBS_NODEFILE -np 4 ./a.out

Resources