MPI Send Recv deadlock - mpi

I write programs using MPI and I have an access to two different clusters. I am not good in system administration, so I can not tell anything about software, OS, compilers which are used there. But, on one machine I have an deadlock using such code:
#include "mpi.h"
#include <iostream>
int main(int argc, char **argv) {
int rank, numprocs;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
int x = rank;
if (rank == 0) {
for (int i=0; i<numprocs; ++i)
MPI_Send(&x, 1, MPI_INT, i, 100500, MPI_COMM_WORLD);
}
MPI_Recv(&x, 1, MPI_INT, 0, 100500, MPI_COMM_WORLD, &status);
MPI_Finalize();
return 0;
}
The error message is related:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(184): MPI_Send(buf=0x7fffffffceb0, count=1, MPI_INT, dest=0, tag=100500, MPI_COMM_WORLD) failed
MPID_Send(54): DEADLOCK: attempting to send a message to the local process without a prior matching receive
Why is that so? I can't understand, why does it happen on one machine, but doesn't happen on another?

MPI_Send is a blocking operation. It may not complete until a matching receive is posted. In your case rank 0 is trying to send a message to itself before having posted a matching receive. If you must do something like this you would replace MPI_Send with MPI_Isend(+MPI_Wait...` after the receive). But you might as well just not make him send a message to itself.
The proper thing to use in your case is MPI_Bcast.

Since rank 0 already has the correct value of x, you do not need to send it in a message. This means that in the loop you should skip sending to rank 0 and instead start from rank 1:
if (rank == 0) {
for (int i=1; i<numprocs; ++i)
MPI_Send(&x, 1, MPI_INT, i, 100500, MPI_COMM_WORLD);
}
MPI_Recv(&x, 1, MPI_INT, 0, 100500, MPI_COMM_WORLD, &status);
Now rank 0 won't try to talk to itself, but since the receive is outside the conditional, it will still try to receive a message from itself. The solution is to simply make the receive the alternative branch:
if (rank == 0) {
for (int i=1; i<numprocs; ++i)
MPI_Send(&x, 1, MPI_INT, i, 100500, MPI_COMM_WORLD);
}
else
MPI_Recv(&x, 1, MPI_INT, 0, 100500, MPI_COMM_WORLD, &status);
Another more involved solution is to use non-blocking operations to post the receive before the send operation:
MPI_Request req;
MPI_Irecv(&x, 1, MPI_INT, 0, 100500, MPI_COMM_WORLD, &req);
if (rank == 0) {
int xx = x;
for (int i=0; i<numprocs; ++i)
MPI_Send(&xx, 1, MPI_INT, i, 100500, MPI_COMM_WORLD);
}
MPI_Wait(&req, &status);
Now rank 0 will not block in MPI_Send as there is already a matching receive posted earlier. In all other ranks MPI_Irecv will be immediately followed by MPI_Wait, which is equivalent to a blocking receive (MPI_Recv). Note that the value of x is copied to a different variable inside the conditional as simultaneously sending from and receiving into the same memory location is forbidden by the MPI standard for obvious correctness reasons.

Related

How to distribute trivial tasks to workes using MPI

I'm trying to teach myself some simple MPI. As an exercise I thought I'd try and distribute a simple task over some workers. The task as outlined below is simply to hold the CPU for some predetermined time. The goal of the program is to verify that a worker asks for new work as soon as it is finished with it's current task, and conversely that it does not wait for other workers in order to do so.
For three CPUs (one manager and two workers) expected output from the below program is something like
worker #1 waited for 1 seconds
worker #2 waited for 1 seconds
worker #1 waited for 1 seconds
worker #1 waited for 1 seconds
...
worker #2 waited for 5 seconds
worker #1 waited for 1 seconds
worker #2 waited for 1 seconds
...
However, in the current implementation, the program never gets past the first two output lines. I'm thinking this is because the workers does not correctly communicate to the manager that that they have finished their tasks, and therefore are never given their next tasks.
Any ideas where it goes wrong?
#include <iostream>
#include <mpi.h>
#include <windows.h>
using namespace std;
void task(int waittime, int worldrank) {
Sleep(waittime); // use sleep for unix systems
cout << "worker #" << worldrank << " waited for " << waittime << " seconds" << endl;
}
int main()
{
int waittimes[] = { 1,1,5,1,1,1,1,1,1,1,1,1,1 };
int nwaits = sizeof(waittimes) / sizeof(int);
MPI_Init(NULL, NULL);
int worldrank, worldsize;
MPI_Comm_rank(MPI_COMM_WORLD, &worldrank);
MPI_Comm_size(MPI_COMM_WORLD, &worldsize);
MPI_Status status;
int ready = 0;
if (worldrank == 0)
{
for (int k = 0; k < nwaits; k++)
{
MPI_Recv(&ready, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
MPI_Send(&waittimes[k], 1, MPI_INT, status.MPI_SOURCE, 0, MPI_COMM_WORLD);
}
}
else
{
int waittime;
ready = 1;
MPI_Send(&ready, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
MPI_Recv(&waittime, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
task(waittime,worldrank);
}
MPI_Finalize();
return 0;
}

MPI_Send does not fail with destination -1

I am encountering some strange behavior with MPICH. The following minimal example, which sends a message to the non-existing process with rank -1, causes a deadlock:
// Program: send-err
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
MPI_Init(NULL, NULL);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// We are assuming at least 2 processes for this task
if (world_size != 2) {
fprintf(stderr, "World size must be 2 for %s\n", argv[0]);
MPI_Abort(MPI_COMM_WORLD, 1);
}
int number;
if (world_rank == 0) {
number = -1;
MPI_Send(&number, // data buffer
1, // buffer size
MPI_INT, // data type
-1, //destination
0, //tag
MPI_COMM_WORLD); // communicator
} else if (world_rank == 1) {
MPI_Recv(&number,
1, // buffer size
MPI_INT, // data type
0, // source
0, //tag
MPI_COMM_WORLD, // communicator
MPI_STATUS_IGNORE);
}
MPI_Finalize();
}
If the call to the send function,
MPI_Send( start, count, datatype, destination_rank, tag, communicator )
uses destination_rank = -2, then the program fails with the error message:
> mpirun -np 2 send-err
Abort(402250246) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Send: Invalid rank, error stack:
PMPI_Send(157): MPI_Send(buf=0x7fffeb411b44, count=1, MPI_INT, dest=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD) failed
PMPI_Send(94).: Invalid rank has value -2 but must be nonnegative and less than 2
Based on the error message, I would expect a program that sends a message to the process with rank -1 to fail similarly to the program sending a message to the process with rank -2. What causes this difference in behavior?

Using of MPI Barrier lead to fatal error

I get a strange behavior of my simple MPI program. I spent time to find an answer myself, but I can't. I red some questions here, like OpenMPI MPI_Barrier problems, MPI_SEND stops working after MPI_BARRIER, Using MPI_Bcast for MPI communication. I red MPI tutorial on mpitutorial.
My program just modify array that was broadcasted from root process and then gather modified arrays to one array and print them.
So, the problem is, that when I use code listed below with uncommented MPI_Barrier(MPI_COMM_WORLD) I get an error.
#include "mpi/mpi.h"
#define N 4
void transform_row(int* row, const int k) {
for (int i = 0; i < N; ++i) {
row[i] *= k;
}
}
const int root = 0;
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, ranksize;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &ranksize);
if (rank == root) {
int* arr = new int[N];
for (int i = 0; i < N; ++i) {
arr[i] = i * i + 1;
}
MPI_Bcast(arr, N, MPI_INT, root, MPI_COMM_WORLD);
}
int* arr = new int[N];
MPI_Bcast(arr, N, MPI_INT, root, MPI_COMM_WORLD);
//MPI_Barrier(MPI_COMM_WORLD);
transform_row(arr, rank * 100);
int* transformed = new int[N * ranksize];
MPI_Gather(arr, N, MPI_INT, transformed, N, MPI_INT, root, MPI_COMM_WORLD);
if (rank == root) {
for (int i = 0; i < ranksize; ++i) {
for (int j = 0; j < N ; j++) {
printf("%i ", transformed[i * N + j]);
}
printf("\n");
}
}
MPI_Finalize();
return 0;
}
The error comes with number of thread > 1. The error:
Fatal error in PMPI_Barrier: Message truncated, error stack:
PMPI_Barrier(425)...................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(332)..............: Failure during collective
MPIR_Barrier_impl(327)..............:
MPIR_Barrier(292)...................:
MPIR_Barrier_intra(150).............:
barrier_smp_intra(111)..............:
MPIR_Bcast_impl(1452)...............:
MPIR_Bcast(1476)....................:
MPIR_Bcast_intra(1287)..............:
MPIR_Bcast_binomial(239)............:
MPIC_Recv(353)......................:
MPIDI_CH3U_Request_unpack_uebuf(568): Message truncated; 16 bytes received but buffer size is 1
I understand that some problem with buffer exists, but when I use MPI_buffer_attach to attach big buffer to MPI it don't help.
Seems I need to increase this buffer, but I don't now how to do this.
XXXXXX#XXXXXXXXX:~/test_mpi$ mpirun --version
HYDRA build details:
Version: 3.2
Release Date: Wed Nov 11 22:06:48 CST 2015
So help me please.
One issue is MPI_Bcast() is invoked twice by the root rank, but only once by the other ranks. And then root rank uses an uninitialized arr.
MPI_Barrier() might only hide the problem, but it cannot fix it.
Also, note that if N is "large enough", then the second MPI_Bcast() invoked by root rank will likely hang.
Here is how you can revamp the init/broadcast phase to fix these issues.
int* arr = new int[N];
if (rank == root) {
for (int i = 0; i < N; ++i) {
arr[i] = i * i + 1;
}
MPI_Bcast(arr, N, MPI_INT, root, MPI_COMM_WORLD);
Note in this case, you can simply initialize arr on all the ranks so you do not even need to broadcast the array.
As a side note, MPI program typically
#include <mpi.h>
and then use mpicc for the compilation/linking
(this is a wrapper that invoke the real compiler after setting the include/library paths and using the MPI libs)

Anomalous MPI behavior

I am wondering if anyone can offer an explanation.
I'll start with the code:
/*
Barrier implemented using tournament-style coding
*/
// Constraints: Number of processes must be a power of 2, e.g.
// 2,4,8,16,32,64,128,etc.
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
void mybarrier(MPI_Comm);
// global debug bool
int verbose = 1;
int main(int argc, char * argv[]) {
int rank;
int size;
int i;
int sum = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int check = size;
// check to make sure the number of processes is a power of 2
if (rank == 0){
while(check > 1){
if (check % 2 == 0){
check /= 2;
} else {
printf("ERROR: The number of processes must be a power of 2!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
return 1;
}
}
}
// simple task, with barrier in the middle
for (i = 0; i < 500; i++){
sum ++;
}
mybarrier(MPI_COMM_WORLD);
for (i = 0; i < 500; i++){
sum ++;
}
if (verbose){
printf("process %d arrived at finalize\n", rank);
}
MPI_Finalize();
return 0;
}
void mybarrier(MPI_Comm comm){
// MPI variables
int rank;
int size;
int * data;
MPI_Status * status;
// Loop variables
int i;
int a;
int skip;
int complete = 0;
int currentCycle = 1;
// Initialize MPI vars
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
// step 1, gathering
while (!complete){
skip = currentCycle * 2;
// if currentCycle divides rank evenly, then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to receive
if ((rank % skip) == 0){
MPI_Recv(data, 0, MPI_INT, rank + currentCycle, 99, comm, status);
if (verbose){
printf("1: %d from %d\n", rank, rank + currentCycle);
}
// otherwise, it needs to send. Once sent, the process is done
} else {
if (verbose){
printf("1: %d to %d\n", rank, rank - currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank - currentCycle, 99, comm);
complete = 1;
}
}
currentCycle *= 2;
// main process will never send, so this code will allow it to complete
if (currentCycle >= size){
complete = 1;
}
}
complete = 0;
currentCycle = size / 2;
// step 2, scattering
while (!complete){
// if currentCycle is 1, then this is the last loop
if (currentCycle == 1){
complete = 1;
}
skip = currentCycle * 2;
// if currentCycle divides rank evenly then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to send
if ((rank % skip) == 0){
if (verbose){
printf("2: %d to %d\n", rank, rank + currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank + currentCycle, 99, comm);
// otherwise, it needs to receive
} else {
if (verbose){
printf("2: %d waiting for %d\n", rank, rank - currentCycle);
}
MPI_Recv(data, 0, MPI_INT, rank - currentCycle, 99, comm, status);
if (verbose){
printf("2: %d from %d\n", rank, rank - currentCycle);
}
}
}
currentCycle /= 2;
}
}
Expected behavior
The code is to increment a sum to 500, wait for all other processes to reach that point using blocking MPI_Send and MPI_Recv calls, and then increment sum to 1000.
Observed behavior on cluster
Cluster behaves as expected
Anomalous behavior observed on my machine
All processes in main function are reported as being 99, which I have linked specifically to the tag of the second while loop of mybarrier.
In addition
My first draft was written with for loops, and with that one, the program executes as expected on the cluster as well, but on my machine execution never finishes, even though all processes call MPI_Finalize (but none move beyond it).
MPI Versions
My machine is running OpenRTE 2.0.2
The cluster is running OpenRTE 1.6.3
The questions
I have observed that my machine seems to run unexpectedly all of the time, while the cluster executes normally. This is true with other MPI code I have written as well. Was there major changes between 1.6.3 and 2.0.2 that I'm not aware of?
At any rate, I'm baffled, and I was wondering if anyone could offer some explanation as to why my machine seems to not run MPI correctly. I hope I have provided enough details, but if not, I will be happy to provide whatever additional information you require.
There is a problem with your code, maybe that's what causing the weird behavior you are seeing.
You are passing to the MPI_Recv routines a status object that hasn't been allocated. In fact, that pointer is not even initialized, so if it happens not to be NULL, the MPI_Recv will endup writing wherever in memory causing undefined behavior. The correct form is the following:
MPI_Status status;
...
MPI_Recv(..., &status);
Or if you want to use the heap:
MPI_Status *status = malloc(sizeof(MPI_Status));
...
MPI_Recv(..., status);
...
free(status);
Also since you are not using the value returned by the receive, you should instead use MPI_STATUS_IGNORE instead:
MPI_Recv(..., MPI_STATUS_IGNORE);

MPI_Waitall hangs

I have this MPI program which hangs without completing. Any idea where it is going wrong? May be I am missing something but I cannot think of a possible issue with the code. Changing the order of send and receive doesn't work either. (But I am guessing any order would do due to nonblocking nature of the calls.)
#include <mpi.h>
int main(int argc, char** argv) {
int p = 2;
int myrank;
double in_buf[1];
double out_buf[1];
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Status stat;
MPI_Init(&argc, &argv);
MPI_Comm_rank(comm, &myrank);
MPI_Comm_size(comm, &p);
MPI_Request requests[2];
MPI_Status statuses[2];
if (myrank == 0) {
MPI_Isend(out_buf,1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &requests[0]);
MPI_Irecv(in_buf, 1, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD, &requests[1]);
} else {
MPI_Irecv(in_buf, 1, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD, &requests[0]);
MPI_Isend(out_buf, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &requests[1]);
}
MPI_Waitall(2, requests, statuses);
printf("Done...\n");
}
Right of the bat it looks like your tags are mismatched.
You post isend from rank 0 with tag=0 but post irecv from rank 1 with tag=1.
I assume you launch two processes as well, right? the int p = 2 doesn't do anything useful.

Resources