Anomalous MPI behavior - mpi

I am wondering if anyone can offer an explanation.
I'll start with the code:
/*
Barrier implemented using tournament-style coding
*/
// Constraints: Number of processes must be a power of 2, e.g.
// 2,4,8,16,32,64,128,etc.
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
void mybarrier(MPI_Comm);
// global debug bool
int verbose = 1;
int main(int argc, char * argv[]) {
int rank;
int size;
int i;
int sum = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int check = size;
// check to make sure the number of processes is a power of 2
if (rank == 0){
while(check > 1){
if (check % 2 == 0){
check /= 2;
} else {
printf("ERROR: The number of processes must be a power of 2!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
return 1;
}
}
}
// simple task, with barrier in the middle
for (i = 0; i < 500; i++){
sum ++;
}
mybarrier(MPI_COMM_WORLD);
for (i = 0; i < 500; i++){
sum ++;
}
if (verbose){
printf("process %d arrived at finalize\n", rank);
}
MPI_Finalize();
return 0;
}
void mybarrier(MPI_Comm comm){
// MPI variables
int rank;
int size;
int * data;
MPI_Status * status;
// Loop variables
int i;
int a;
int skip;
int complete = 0;
int currentCycle = 1;
// Initialize MPI vars
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
// step 1, gathering
while (!complete){
skip = currentCycle * 2;
// if currentCycle divides rank evenly, then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to receive
if ((rank % skip) == 0){
MPI_Recv(data, 0, MPI_INT, rank + currentCycle, 99, comm, status);
if (verbose){
printf("1: %d from %d\n", rank, rank + currentCycle);
}
// otherwise, it needs to send. Once sent, the process is done
} else {
if (verbose){
printf("1: %d to %d\n", rank, rank - currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank - currentCycle, 99, comm);
complete = 1;
}
}
currentCycle *= 2;
// main process will never send, so this code will allow it to complete
if (currentCycle >= size){
complete = 1;
}
}
complete = 0;
currentCycle = size / 2;
// step 2, scattering
while (!complete){
// if currentCycle is 1, then this is the last loop
if (currentCycle == 1){
complete = 1;
}
skip = currentCycle * 2;
// if currentCycle divides rank evenly then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to send
if ((rank % skip) == 0){
if (verbose){
printf("2: %d to %d\n", rank, rank + currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank + currentCycle, 99, comm);
// otherwise, it needs to receive
} else {
if (verbose){
printf("2: %d waiting for %d\n", rank, rank - currentCycle);
}
MPI_Recv(data, 0, MPI_INT, rank - currentCycle, 99, comm, status);
if (verbose){
printf("2: %d from %d\n", rank, rank - currentCycle);
}
}
}
currentCycle /= 2;
}
}
Expected behavior
The code is to increment a sum to 500, wait for all other processes to reach that point using blocking MPI_Send and MPI_Recv calls, and then increment sum to 1000.
Observed behavior on cluster
Cluster behaves as expected
Anomalous behavior observed on my machine
All processes in main function are reported as being 99, which I have linked specifically to the tag of the second while loop of mybarrier.
In addition
My first draft was written with for loops, and with that one, the program executes as expected on the cluster as well, but on my machine execution never finishes, even though all processes call MPI_Finalize (but none move beyond it).
MPI Versions
My machine is running OpenRTE 2.0.2
The cluster is running OpenRTE 1.6.3
The questions
I have observed that my machine seems to run unexpectedly all of the time, while the cluster executes normally. This is true with other MPI code I have written as well. Was there major changes between 1.6.3 and 2.0.2 that I'm not aware of?
At any rate, I'm baffled, and I was wondering if anyone could offer some explanation as to why my machine seems to not run MPI correctly. I hope I have provided enough details, but if not, I will be happy to provide whatever additional information you require.

There is a problem with your code, maybe that's what causing the weird behavior you are seeing.
You are passing to the MPI_Recv routines a status object that hasn't been allocated. In fact, that pointer is not even initialized, so if it happens not to be NULL, the MPI_Recv will endup writing wherever in memory causing undefined behavior. The correct form is the following:
MPI_Status status;
...
MPI_Recv(..., &status);
Or if you want to use the heap:
MPI_Status *status = malloc(sizeof(MPI_Status));
...
MPI_Recv(..., status);
...
free(status);
Also since you are not using the value returned by the receive, you should instead use MPI_STATUS_IGNORE instead:
MPI_Recv(..., MPI_STATUS_IGNORE);

Related

Is there some way of avoiding this implicit MPI_Allreduce() synchronisation?

I'm writing an MPI program that uses a library which makes its own MPI calls. In my program, I have a loop that calls a function from the library. The function that I'm calling from the library makes use of MPI_Allreduce.
The problem here is that in my program, some of the ranks can exit the loop before others and this causes the MPI_Allreduce call to just hang since not all ranks will be calling MPI_Allreduce again.
Is there any way of programming around this without modifying the sources of the library I'm using?
Below is the code for an example which demonstrates the execution pattern.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#include <math.h>
#include <assert.h>
#define N_ITEMS 100000
#define ITERATIONS 32
float *create_rand_nums(int num_elements) {
float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
assert(rand_nums != NULL);
int i;
for (i = 0; i < num_elements; i++) {
rand_nums[i] = (rand() / (float)RAND_MAX);
}
return rand_nums;
}
void reduce_stddev(int world_rank, int world_size, int num_elements_per_proc)
{
fprintf(stdout, "Calling %s: %d\n", __func__, world_rank);
fflush(stdout);
srand(time(NULL)*world_rank);
float *rand_nums = NULL;
rand_nums = create_rand_nums(num_elements_per_proc);
float local_sum = 0;
int i;
for (i = 0; i < num_elements_per_proc; i++) {
local_sum += rand_nums[i];
}
float global_sum;
fprintf(stdout, "%d: About to call all reduce\n", world_rank);
fflush(stdout);
MPI_Allreduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM,
MPI_COMM_WORLD);
fprintf(stdout, "%d: done calling all reduce\n", world_rank);
fflush(stdout);
float mean = global_sum / (num_elements_per_proc * world_size);
float local_sq_diff = 0;
for (i = 0; i < num_elements_per_proc; i++) {
local_sq_diff += (rand_nums[i] - mean) * (rand_nums[i] - mean);
}
float global_sq_diff;
MPI_Reduce(&local_sq_diff, &global_sq_diff, 1, MPI_FLOAT, MPI_SUM, 0,
MPI_COMM_WORLD);
if (world_rank == 0) {
float stddev = sqrt(global_sq_diff /
(num_elements_per_proc * world_size));
printf("Mean - %f, Standard deviation = %f\n", mean, stddev);
}
free(rand_nums);
}
int main(int argc, char* argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: avg num_elements_per_proc\n");
exit(1);
}
int num_elements_per_proc = atoi(argv[1]);
MPI_Init(NULL, NULL);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
unsigned long long j = 0;
for(j = 0; j < ITERATIONS; j++)
{
/* Function which calls MPI_Allreduce */
reduce_stddev(world_rank, world_size, num_elements_per_proc);
/* Simulates some processes leaving the loop early */
if( (j == (ITERATIONS/2)) && (world_rank % 2 == 0))
{
fprintf(stdout, "%d exiting\n", world_rank);
fflush(stdout);
break;
}
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return EXIT_SUCCESS;
}
This is always an issue in MPI - how do you tell all the other ranks when one rank is finished? The easiest approach is for everyone to set a true/false flag and then do an allreduce to see if anyone finished. Using this code at the end seems to work
for(j = 0; j < ITERATIONS; j++)
{
/* Function which calls MPI_Allreduce */
reduce_stddev(world_rank, world_size, num_elements_per_proc);
int finished = 0;
/* Simulates some processes leaving the loop early */
if( (j == (ITERATIONS/2)) && (world_rank % 2 == 0))
{
fprintf(stdout, "%d finished\n", world_rank);
fflush(stdout);
finished = 1;
}
/* Check to see if anyone has finished */
int anyfinished;
MPI_Allreduce(&finished, &anyfinished, 1, MPI_INT, MPI_LOR,
MPI_COMM_WORLD);
if (anyfinished)
{
fprintf(stdout, "%d exiting\n", world_rank);
break;
}
}
OK - I just reread your question and maybe I misunderstood it. Do you want everyone else to keep calculating?

Using of MPI Barrier lead to fatal error

I get a strange behavior of my simple MPI program. I spent time to find an answer myself, but I can't. I red some questions here, like OpenMPI MPI_Barrier problems, MPI_SEND stops working after MPI_BARRIER, Using MPI_Bcast for MPI communication. I red MPI tutorial on mpitutorial.
My program just modify array that was broadcasted from root process and then gather modified arrays to one array and print them.
So, the problem is, that when I use code listed below with uncommented MPI_Barrier(MPI_COMM_WORLD) I get an error.
#include "mpi/mpi.h"
#define N 4
void transform_row(int* row, const int k) {
for (int i = 0; i < N; ++i) {
row[i] *= k;
}
}
const int root = 0;
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, ranksize;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &ranksize);
if (rank == root) {
int* arr = new int[N];
for (int i = 0; i < N; ++i) {
arr[i] = i * i + 1;
}
MPI_Bcast(arr, N, MPI_INT, root, MPI_COMM_WORLD);
}
int* arr = new int[N];
MPI_Bcast(arr, N, MPI_INT, root, MPI_COMM_WORLD);
//MPI_Barrier(MPI_COMM_WORLD);
transform_row(arr, rank * 100);
int* transformed = new int[N * ranksize];
MPI_Gather(arr, N, MPI_INT, transformed, N, MPI_INT, root, MPI_COMM_WORLD);
if (rank == root) {
for (int i = 0; i < ranksize; ++i) {
for (int j = 0; j < N ; j++) {
printf("%i ", transformed[i * N + j]);
}
printf("\n");
}
}
MPI_Finalize();
return 0;
}
The error comes with number of thread > 1. The error:
Fatal error in PMPI_Barrier: Message truncated, error stack:
PMPI_Barrier(425)...................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(332)..............: Failure during collective
MPIR_Barrier_impl(327)..............:
MPIR_Barrier(292)...................:
MPIR_Barrier_intra(150).............:
barrier_smp_intra(111)..............:
MPIR_Bcast_impl(1452)...............:
MPIR_Bcast(1476)....................:
MPIR_Bcast_intra(1287)..............:
MPIR_Bcast_binomial(239)............:
MPIC_Recv(353)......................:
MPIDI_CH3U_Request_unpack_uebuf(568): Message truncated; 16 bytes received but buffer size is 1
I understand that some problem with buffer exists, but when I use MPI_buffer_attach to attach big buffer to MPI it don't help.
Seems I need to increase this buffer, but I don't now how to do this.
XXXXXX#XXXXXXXXX:~/test_mpi$ mpirun --version
HYDRA build details:
Version: 3.2
Release Date: Wed Nov 11 22:06:48 CST 2015
So help me please.
One issue is MPI_Bcast() is invoked twice by the root rank, but only once by the other ranks. And then root rank uses an uninitialized arr.
MPI_Barrier() might only hide the problem, but it cannot fix it.
Also, note that if N is "large enough", then the second MPI_Bcast() invoked by root rank will likely hang.
Here is how you can revamp the init/broadcast phase to fix these issues.
int* arr = new int[N];
if (rank == root) {
for (int i = 0; i < N; ++i) {
arr[i] = i * i + 1;
}
MPI_Bcast(arr, N, MPI_INT, root, MPI_COMM_WORLD);
Note in this case, you can simply initialize arr on all the ranks so you do not even need to broadcast the array.
As a side note, MPI program typically
#include <mpi.h>
and then use mpicc for the compilation/linking
(this is a wrapper that invoke the real compiler after setting the include/library paths and using the MPI libs)

MPI Broadcast Very Slow

I am writing an MPI program and the MPI_Bcast function is very slow on one particular machine I am using. In order to narrow down the problem, I have the following two test programs. The first does many MPI_Send/MPI_Recv operations from process 0 to the others:
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#define N 1000000000
int main(int argc, char** argv) {
int rank, size;
/* initialize MPI */
MPI_Init(&argc, &argv);
/* get the rank (process id) and size (number of processes) */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* have process 0 do many sends */
if (rank == 0) {
int i, j;
for (i = 0; i < N; i++) {
for (j = 1; j < size; j++) {
if (MPI_Send(&i, 1, MPI_INT, j, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
printf("Error!\n");
exit(0);
}
}
}
}
/* have the rest receive that many values */
else {
int i;
for (i = 0; i < N; i++) {
int value;
if (MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE) != MPI_SUCCESS) {
printf("Error!\n");
exit(0);
}
}
}
/* quit MPI */
MPI_Finalize( );
return 0;
}
This program runs in only 2.7 seconds or so with 4 processes.
This next program does exactly the same thing, except it uses MPI_Bcast to send the values from process 0 to the other processes:
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#define N 1000000000
int main(int argc, char** argv) {
int rank, size;
/* initialize MPI */
MPI_Init(&argc, &argv);
/* get the rank (process id) and size (number of processes) */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* have process 0 do many sends */
if (rank == 0) {
int i, j;
for (i = 0; i < N; i++) {
if (MPI_Bcast(&i, 1, MPI_INT, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
printf("FAIL\n");
exit(0);
}
}
}
/* have the rest receive that many values */
else {
int i;
for (i = 0; i < N; i++) {
if (MPI_Bcast(&i, 1, MPI_INT, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
printf("FAIL\n");
exit(0);
}
}
}
/* quit MPI */
MPI_Finalize( );
return 0;
}
Both programs have the same value for N, and neither program returns an error from the communication calls. The second program should be at least a little bit faster. But it is not, it is much slower at roughly 34 seconds - around 12X slower!
This problem only manifests itself on one machine, but not others even though they are running the same operating system (Ubuntu) and don't have drastically different hardware. Also, I'm using OpenMPI on both.
I'm really pulling my hair out, does anyone have an idea?
Thanks for reading!
A couple of observations.
The MPI_Bcast is receiving the result into the "&i" buffer. The MPI_Recv is receiving the result into "&value". Is there some reason that decision was made?
The Send/Recv model will naturally synchronize. The MPI_Send calls are blocking and serialized. The matching MPI_Recv should always be ready when the MPI_Send is called.
In general, collectives tend to have larger advantages as the job size scales up.
I compiled and ran the programs using IBM Platform MPI. I lowered the N value by 100x to 10 Million, to speed up the testing. I changed the MPI_Bcast to receive the result in a "&value" buffer rather than into the "&i" buffer. I ran each case three times, and averaged the times. The times are the "real" value returned by "time" (this was necessary as the ranks were running remotely from the mpirun command).
With 4 ranks over shared memory, the Send/Recv model took 6.5 seconds, the Bcast model took 7.6 seconds.
With 32 ranks (8/node x 4 nodes, FDR InfiniBand), the Send/Recv model took 79 seconds, the Bcast model took 22 seconds.
With 128 ranks (16/node x 8 nodes, FDR Infiniband), the Send/Recv model took 134 seconds, the Bcast model took 44 seconds.
Given these timings AFTER the reduction in the N value by 100x to 10000000, I am going to suggest that the "2.7 second" time was a no-op. Double check that some actual work was done.

Need explanation MPI_Scatter()

Im trying to do the Monte Carlo problem using MPI were we generate x amount of rand. num between 0 and 1 and then send n-length numbers to each processor. I'm using scatter function but my code doesn't run right, it compiles but it doesn't ask for the input. I dont understand how MPI loops by itself without loops, can some explain this and what is wrong with my code?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include "mpi.h"
main(int argc, char* argv[]) {
int my_rank; /* rank of process */
int p; /* number of processes */
int source; /* rank of sender */
int dest; /* rank of receiver */
int tag = 0; /* tag for messages */
char message[100]; /* storage for message */
MPI_Status status; /* return status for */
double *total_xr, *p_xr, total_size_xr, p_size_xr; /* receive */
/* Start up MPI */
MPI_Init(&argc, &argv);
/* Find out process rank */
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Find out number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &p);
double temp;
int i, partial_sum, x, total_sum, ratio_p, area;
total_size_xr = 0;
partial_sum = 0;
if(my_rank == 0){
while(total_size_xr <= 0){
printf("How many random numbers should each process get?: ");
scanf("%f", &p_size_xr);
}
total_size_xr = p*p_size_xr;
total_xr = malloc(total_size_xr*sizeof(double));
//xr generator will generate numbers between 1 and 0
srand(time(NULL));
for(i=0; i<total_size_xr; i++)
{
temp = 2.0 * rand()/(RAND_MAX+1.0) -1.0;
//this will make sure if any number computer stays in the boundry of 0 and 1, doesn't go over into the negative
while(temp < 0.0)
{
temp = 2.0 * rand()/(RAND_MAX+1.0) -1.0;
}
//array set to total random numbers generated to be scatter into processors
total_xr[i] = temp;
}
}
else{
//this will be the buffer for the processors to hold their own numbers to add
p_xr = malloc(p_size_xr*sizeof(double));
printf("\n\narray set\n\n");
//scatter xr into processors
MPI_Scatter(total_xr, total_size_xr, MPI_DOUBLE, p_xr, p_size_xr, MPI_DOUBLE, 0, MPI_COMM_WORLD);
//while in processor the partial sum will be caluclated by using xr and the formula sqrt(1-x*x)
for(i=0; i<p_size_xr; i++)
{
x = p_xr[i];
temp = sqrt(1 - (x*x));
partial_sum = partial_sum + temp;
}
//}
//we will send the partial sums to master processor which is processor 0 and add them and place
//the result in total_sum
MPI_Reduce(&partial_sum, &total_sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
//once we have all of the sums we need to multiply the total sum and multiply it with 1/N
//N being the number of processors, the area should contain the value of pi.
ratio_p = 1/p;
area = total_sum*ratio_p;
printf("\n\nThe area under the curve of f(x) = sqrt(1-x*x), between 0 and 1 is, %f\n\n", area);
/* Shut down MPI */
MPI_Finalize();
} /* main */
In general, its not good to rely on STDIN/STDOUT for an MPI program. It's possible that MPI implementation could put rank 0 on some node other than the node on which you're launching your jobs. In that case you have to worry about forwarding correctly. While this will work most of the time, it's not usually a good idea.
A better way to do things is to have your user input be in a file that the application can read or via command line variables. Those will be much more portable.
I'm not sure what you mean by MPI looping by itself without loops. Maybe you can clarify that comment if you still need an answer there.

Segmentation fault while using MPI_File_open

I'm trying to read from a file for an MPI application. The cluster has 4 nodes with 12 cores in each node. I have tried running a basic program to compute rank and that works. When I added MPI_File_open it throws an exception at runtime
BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 139
The cluster has MPICH2 installed and has a Network File System. I check MPI_File_open with different parameters like ReadOnly mode, MPI_COMM_WORLD etc.
Can I use MPI_File_open with Network File System?
int main(int argc, char* argv[])
{
int myrank = 0;
int nprocs = 0;
int i = 0;
MPI_Comm icomm = MPI_COMM_WORLD;
MPI_Status status;
MPI_Info info;
MPI_File *fh = NULL;
int error = 0;
MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD); // Wait for all processor to start
MPI_Comm_size(MPI_COMM_WORLD, &nprocs); // Get number of processes
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); // Get own rank
usleep(myrank*100000);
if ( myrank == 1 || myrank == 0 )
printf("Hello from %d\r\n", myrank);
if (myrank == 0)
{
error = MPI_File_open( MPI_COMM_SELF, "lw1.wei", MPI_MODE_UNIQUE_OPEN,
MPI_INFO_NULL, fh);
if ( error )
{
printf("Error in opening file\r\n");
}
else
{
printf("File successfully opened\r\n");
}
MPI_File_close(fh);
}
MPI_Barrier(MPI_COMM_WORLD); //! Wait for all the processors to end
MPI_Finalize();
if ( myrank == 0 )
{
printf("Number of Processes %d\n\r", nprocs);
}
return 0;
}
You forgot to allocate an MPI_File object before opening the file. You may either change this line:
MPI_File *fh = NULL;
into:
MPI_File fh;
and open file by giving fh's address to MPI_File_open(..., &fh). Or you may simply allocate memory from heap using malloc().
MPI_File *fh = malloc(sizeof(MPI_File));

Resources