I'm trying to synchronize my processes at the beginning of the execution via MPI_Barrier but my program gets blocked on it. Nevertheless, I see all the processes are reaching that line by printing on the screen in the previous instruction.
int num_processes, packet_size, partner_rank;
double start_comm_time, end_comm_time, comm_time;
comm_time = 0;
if(argc==3) {
if (sscanf (argv[1], "%i", &num_processes) != 1) {
fprintf(stderr, "error - parameter 1 not an integer");
} else;
if (sscanf (argv[2], "%i", &packet_size) != 1) {
fprintf(stderr, "error - parameter 2 not an integer");
} else;
}
else {
printf("\n Usage: broadcast $count $packet_size");
return 0;
}
char buf_send[packet_size], buf_recv[packet_size];
buf_send[0] = 0;
// Initialize
MPI_Init(NULL, NULL);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
printf("\n Comm size %d \n", world_size);
printf("\n Process %d before barrier \n", world_rank);
// Time MPI_Bcast
MPI_Barrier(MPI_COMM_WORLD);
printf("\n Process %d after barrier \n", world_rank);
start_comm_time = MPI_Wtime();
MPI_Bcast(buf_send, packet_size, MPI_CHAR, 0, MPI_COMM_WORLD);
printf("\n Process %d before second barrier \n", world_rank);
MPI_Barrier(MPI_COMM_WORLD);
end_comm_time = MPI_Wtime();
This is what I get in the printout:
Comm size 6
Process 0 before barrier
Comm size 6
Process 2 before barrier
Comm size 6
Process 3 before barrier
Comm size 6
Process 1 before barrier
Comm size 6
Process 4 before barrier
Comm size 6
Process 5 before barrier
I removed the timing and MPI_Bcast parts of your program (because it is incomplete), and it runs well on my machine, without getting blocked. So there could be something wrong with other parts of your code.
According to your whole program, I don't think your code has some problems. The problem is probably your mpi environment. I also run it on my machine using mpirun -n 6 ./xxx 6 1, and this is what I get:
Comm size 6
Process 0 before barrier
Comm size 6
Process 5 before barrier
Comm size 6
Process 2 before barrier
Comm size 6
Comm size 6
Process 1 before barrier
Process 3 before barrier
Comm size 6
Process 4 before barrier
Process 0 after barrier
Process 0 before second barrier
Process 1 after barrier
Process 1 before second barrier
Process 4 after barrier
Process 4 before second barrier
Process 5 after barrier
Process 5 before second barrier
Process 2 after barrier
Process 2 before second barrier
Process 3 after barrier
Process 3 before second barrier
Related
I want to iteratively use insert to modify the first element in a vector<int>(I know that with vector it's better to insert element in the back, I was just playing).
int main() {
vector<int> v1 = {1,2,2,2,2};
auto itr = v1.begin();
print_vector(v1);
cout<<*itr<<endl; // ok, itr is pointing to first element
v1.insert(itr,3);
cout<<*itr<<endl; // after inserting 3 itr is still pointing to 1
print_vector(v1);
cout<<*itr<<endl; // but now itr is pointing to 3
v1.insert(itr,7);
print_vector(v1);
cout<<*itr<<endl;
return 0;
}
v[]: 1 2 2 2 2
1
1
v[]: 3 1 2 2 2 2
3
v[]: 131072 3 1 2 2 2 2
Process finished with exit code 0
So my problem here are mainly 2:
After v1.insert(itr,3), itr is still pointing to 1. After the call of print_vector() now itr is pointing to 3. Why?
Ok now itr its pointing to 3 (the first element of v1). I call v1.insert(itr,7) but instead of placing 7 as the first element, it place 131072. Again, why?
The print_vector function I have implemented is the following:
void print_vector(vector<int> v){
cout<<"v[]: ";
for(int i:v){
cout<<i<<" ";
}
cout<<endl;
}
After inserting an element to a vector, all of its iterators are invalidated, meaning any behavior involving them falls under undefined behavior. You can find a list of iterator invalidation conditions in the answers on Iterator invalidation rules for C++ containers.
Anything you're experiencing after the first v1.insert() call falls under undefined behavior, as you can clearly see with the placement of 131072 (an arbitrary value).
If you refresh the iterator after every insertion call, you should get normal behavior:
int main()
{
vector<int> v1 = { 1,2,2,2,2 };
auto itr = v1.begin();
print_vector(v1);
cout << *itr << endl;
v1.insert(itr, 3);
itr = v1.begin(); // Iterator refreshed
cout << *itr << endl;
print_vector(v1);
cout << *itr << endl;
v1.insert(itr, 7);
itr = v1.begin(); // Iterator refreshed
print_vector(v1);
cout << *itr << endl;
return 0;
}
And the output:
v[]: 1 2 2 2 2
1
3
v[]: 3 1 2 2 2 2
3
v[]: 7 3 1 2 2 2 2
7
I'm trying to perform a fractal picture parallel calcul with mpi.
I've divide my program in 4 part :
Balance the number of row treat by each rank
Perform the calcul on each row attribute to the rank
Sending the number of row and the rows to the rank 0
Treat the data in rank 0 (for the test just print the int)
The step 1 and 2 are working but when i'm trying to send the rows to rank 0 the program is stoping and block. I know that the MPI_Send could Block bu there is no reason for that here.
Here is the 2 first step:
Step 1 :
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Include the MPI library for function calls */
#include <mpi.h>
/* Define tags for each MPI_Send()/MPI_Recv() pair so distinct messages can be
* sent */
#define OTHER_N_ROWS_TAG 0
#define OTHER_PIXELS_TAG 1
int main(int argc, char **argv) {
const int nRows = 513;
const int nCols = 513;
const int middleRow = 0.5 * (nRows - 1);
const int middleCol = 0.5 * (nCols - 1);
const double step = 0.00625;
const int depth = 100;
int pixels[nRows][nCols];
int row;
int col;
double xCoord;
double yCoord;
int i;
double x;
double y;
double tmp;
int myRank;
int nRanks;
int evenSplit;
int nRanksWith1Extra;
int myRow0;
int myNRows;
int rank;
int otherNRows;
int otherPixels[nRows][nCols];
/* Each rank sets up MPI */
MPI_Init(&argc, &argv);
/* Each rank determines its ID and the total number of ranks */
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
printf("My rank is %d \n",myRank);
evenSplit = nRows / nRanks;
nRanksWith1Extra = nRows % nRanks;
/*Each rank determine the number of rows that he will have to perform (well balanced)*/
if (myRank < nRanksWith1Extra) {
myNRows = evenSplit + 1;
myRow0 = myRank * (evenSplit + 1);
}
else {
myNRows = evenSplit;
myRow0 = (nRanksWith1Extra * (evenSplit + 1)) +
((myRank - nRanksWith1Extra) * evenSplit);
}
/*__________________________________________________________________________________*/
Step 2 :
/*_____________________PERFORM CALCUL ON EACH PIXEL________________________________ */
for (row = myRow0; row < myRow0 + myNRows; row++) {
/* Each rank loops over the columns in the given row */
for (col = 0; col < nCols; col++) {
/* Each rank sets the (x,y) coordinate for the pixel in the given row and
* column */
xCoord = (col - middleCol) * step;
yCoord = (row - middleRow) * step;
/* Each rank calculates the number of iterations for the pixel in the
* given row and column */
i = 0;
x = 0;
y = 0;
while ((x*x + y*y < 4) && (i < depth)) {
tmp = x*x - y*y + xCoord;
y = 2*x*y + yCoord;
x = tmp;
i++;
}
/* Each rank stores the number of iterations for the pixel in the given
* row and column. The initial row is subtracted from the current row
* so the array starts at 0 */
pixels[row - myRow0][col] = i;
}
//printf("one row performed by %d \n",myRank);
}
printf("work done by %d \n",myRank);
/*_________________________________________________________________________________*/
Step 3:
/*__________________________SEND DATA TO RANK 0____________________________________*/
/* Each rank (including Rank 0) sends its number of rows to Rank 0 so Rank 0
* can tell how many pixels to receive */
MPI_Send(&myNRows, 1, MPI_INT, 0, OTHER_N_ROWS_TAG, MPI_COMM_WORLD);
printf("test \n");
/* Each rank (including Rank 0) sends its pixels array to Rank 0 so Rank 0
* can print it */
MPI_Send(&pixels, sizeof(int)*myNRows * nCols, MPI_BYTE, 0, OTHER_PIXELS_TAG,
MPI_COMM_WORLD);
printf("enter ranking 0 \n");
/*_________________________________________________________________________________*/
Step 4:
/*________________________TREAT EACH ROW IN RANK 0_________________________________*/
/* Only Rank 0 prints so the output is in order */
if (myRank == 0) {
/* Rank 0 loops over each rank so it can receive that rank's messages */
for (rank = 0; rank < nRanks; rank++){
/* Rank 0 receives the number of rows from the given rank so it knows how
* many pixels to receive in the next message */
MPI_Recv(&otherNRows, 1, MPI_INT, rank, OTHER_N_ROWS_TAG,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Rank 0 receives the pixels array from each of the other ranks
* (including itself) so it can print the number of iterations for each
* pixel */
MPI_Recv(&otherPixels, otherNRows * nCols, MPI_INT, rank,
OTHER_PIXELS_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Rank 0 loops over the rows for the given rank */
for (row = 0; row < otherNRows; row++) {
/* Rank 0 loops over the columns within the given row */
for (col = 0; col < nCols; col++) {
/* Rank 0 prints the value of the pixel at the given row and column
* followed by a comma */
printf("%d,", otherPixels[row][col]);
}
/* In between rows, Rank 0 prints a newline character */
printf("\n");
}
}
}
/* All processes clean up the MPI environment */
MPI_Finalize();
return 0;
}
I would like to understand why does it blocking , could you explain me ?
I'm a new user of MPI and i would like to learn it not just to have a program that is working.
Thank you in advance.
MPI_Send is by definition of the standard a blocking operation.
Note that blocking means:
it does not return until the message data and envelope have been safely stored away so that the sender is free to modify the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer.
Trying to have a rank send messages to itself with MPI_Send and MPI_Recv is a deadlock.
The idiomatic pattern for your situation is to use the appropriate collective communication operations MPI_Gather and MPI_Gatherv.
When you use blocking send/recv constructs when sending to the rank 0 itself, it might cause a deadlock.
From the MPI 3.0 standard, Section 3.2.4:
Source = destination is allowed, that is, a process can send a message to itself. (However, it is unsafe to do so with the blocking send and receive operations described above,
since this may lead to deadlock. See Section 3.5.)
Possible solutions:
Use non-blocking send/recv constructs when sending/receiving to/from rank 0 itself. For more information, take a look at the MPI_Isend, MPI_Irecv and MPI_Wait routines.
Eliminate communication with rank 0 itself. Since you are in rank 0, you already have a way to know how many pixels you have to compute.
As explained in a previous answer, MPI_Send() might block.
From a theoretical MPI point of view, your application is incorrect because of a potential deadlock (rank 0 MPI_Send() to itself when no receive is posted).
From a very pragmatic point of view, MPI_Send() generally returns immediately when a small message is sent (such as myNRows), but blocks until a matching receive is posted when a large message is sent (such as pixels). Please keep in mind
small and large depend at least on both the MPI library and the interconnect being used
it is incorrect from a MPI point of view to assume that MPI_Send() will return immediately for small messages
If you really want to make sure your application is deadlock free, you can simply replace MPI_Send() with MPI_Ssend().
Back to your question, there are several options here
revamp your app so rank 0 does not communicate with itself (all the info is available, so no communication is needed
post a MPI_Irecv() before MPI_Send(), and replace MPI_Recv(source=0) with MPI_Wait()
revamp you app so rank 0 does not MPI_Send() nor MPI_Recv(source=0), but MPI_Sendrecv instead. This is my recommended option since you only have to make a small change to the communication pattern (the computation pattern is kept untouched) which is more elegant imho.
I have written the program to print fibonacci numbers upto the limit as the user wants. I wrote that program in recursive fashion which should give the output as expected. It is giving the right output but with appended wrong values too. This happens if the user wants to print 4 or more than 4 fibonacci numbers. Also in the recursive function I have decreased the count value before passing it in the same function call. If i decrease the count value in the called function parameters then the while loop runs endlessly. When the loop finishes after some steps and the user limit input is 5 then the output is
Enter the limit number....
5
Fibonacci numbers are: 0 1 1 2 3 3 2 3 3
Finished.........
Can anyone tell me the fault in my program or the exact reason behind this output. Thanks in advance for it.
Program is as follows:
public class FibonacciNumbers
{
public static void main(String[] args)
{
int i=0, j=1;
Scanner sc = new Scanner(System.in);
System.out.println("Enter the limit number....");
int num = sc.nextInt();
System.out.print("Fibonacci numbers are: " + i + " " + j + " " );
fibonacci(num-2, i, j);
System.out.println("\nFinished.........");
}
public static void fibonacci(int count, int i, int j)
{
int sum = 0;
while(count > 0)
{
sum = i+j;
i=j;
j=sum;
System.out.print(sum + " ");
--count;
fibonacci(count, i, j);
}
}
}
You don't need both the while loop AND the recursive function calls. You have to choose between using a loop OR recursive calls.
The recursive solution:
public static void fibonacci(int count, int i, int j) {
if (count>0){
int sum = i+j;
i=j;
j=sum;
System.out.print(sum + " ");
--count;
fibonacci(count, i, j);
}
}
The solution involving a loop:
public static void fibonacci(int count, int i, int j) {
int sum = 0;
while(count > 0) {
sum = i+j;
i=j;
j=sum;
System.out.print(sum + " ");
--count;
}
}
The problem with your code
If you look closely at the following output of your code, you can see that in the beginning of the output there are the actual 7 first fibonacci numbers, and after that comes an unneeded series of the same fibonacci numbers. You printed two numbers from main, and then you expected 5 more numbers but got 31:
Enter the limit number.... 7
Fibonacci numbers are: 0 1 1 2 3 5 8 8 5 8 8 3 5 8 8 5 8 8 2 3 5 8 8 5
8 8 3 5 8 8 5 8 8
This happens because when you first call the fibonacci function with count=5, the while loop has 5 iterations, so it prints 5 fibonacci numbers and the fibonacci function is called 5 times from there with these count parameters: 4,3,2,1,0. When the fibonacci function is called with the parameter count=4, it prints 4 numbers and calls fibonacci 4 times with these parameters: 3,2,1,0 because the while loop then has 4 iterations. I drew an image of the recursive calls (I omitted the f(0) calls because they don't print anything):
If you add it all up, you can see that the program prints 31 fibonacci numbers altogether which is way too much because you wanted to print only 5! This trouble is caused by using while and recursive calls at the same time. You want the recursive behaviour to be like this instead, with no while loop:
OR you want one while loop and no recursion:
I am a new user of MVAPICH2, and I encountered troubles when I started with it.
First, I think I have installed it successfully, through this:
./configure --disable-fortran --enable-cuda
make -j 4
make install
There were not errors.
But when I attempted to run the example of cpi in the directory of example, I encountered like this:
I could connect node gpu-cluster-1 and gpu-cluster-4 through ssh without password;
I run the cpi example separately on gpu-cluster-1 and gpu-cluster-4 using mpirun_rsh, and it worked OK, just like this:
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-1 gpu-cluster-1 ./cpi
Process 0 of 2 is on gpu-cluster-1
Process 1 of 2 is on gpu-cluster-1
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000089
run#gpu-cluster-4:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-4 gpu-cluster-4 ./cpi
Process 0 of 2 is on gpu-cluster-4
Process 1 of 2 is on gpu-cluster-4
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000134
I run the cpi example both on gpu-cluster-1 and gpu-cluster-4 using mpiexec, and it worked OK, just like this:
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpiexec -np 2 -f hostfile ./cpi
Process 0 of 2 is on gpu-cluster-1
Process 1 of 2 is on gpu-cluster-4
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000352
The content in hostfile is "gpu-cluster-1\ngpu-cluster-4"
But, when I run cpi example, using mpirun_rsh, borh on gpu-cluster-1 and gpu-cluster-4, problem came out:
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 -hostfile hostfile ./cpi
Process 1 of 2 is on gpu-cluster-4
-----------------It stuck here, not going on ------------------------
After a long time, I press Ctrl + C, and it present this:
^C[gpu-cluster-1:mpirun_rsh][signal_processor] Caught signal 2, killing job
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu-cluster-4:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
[gpu-cluster-4:mpispawn_1][report_error] connect() failed: Connection refused (111)
I have been confused for a long time, could you give me some help to resolve this problems?
Here is the code of cpi example:
#include "mpi.h"
#include <stdio.h>
#include <math.h>
double f(double);
double f(double a)
{
return (4.0 / (1.0 + a*a));
}
int main(int argc,char *argv[])
{
int n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);
fprintf(stdout,"Process %d of %d is on %s\n",
myid, numprocs, processor_name);
fflush(stdout);
n = 10000; /* default # of rectangles */
if (myid == 0)
startwtime = MPI_Wtime();
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
h = 1.0 / (double) n;
sum = 0.0;
/* A slightly better approach starts from large i and works back */
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
endwtime = MPI_Wtime();
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
printf("wall clock time = %f\n", endwtime-startwtime);
fflush(stdout);
}
MPI_Finalize();
return 0;
}
I'm calling the kernel below with GlobalWorkSize 64 4 1 and WorkGroupSize 1 4 1 with the argument output initialized to zeros.
__kernel void kernelB(__global unsigned int * output)
{
uint gid0 = get_global_id(0);
uint gid1 = get_global_id(1);
output[gid0] += gid1;
}
I'm expecting 6 6 6 6 ... as the sum of the gid1's (0 + 1 + 2 + 3). Instead I get 3 3 3 3 ... Is there a way to get this functionality? In general I need the sum of the results of each work-item in a work group.
EDIT: It seems it must be said, I'd like to solve this problem without atomics.
You need to use local memory to store the output from all work items. After the work items are done their computation, you sum the results with an accumulation step.
__kernel void kernelB(__global unsigned int * output)
{
uint item_id = get_local_id(0);
uint group_id = get_group_id(0);
//memory size is hard-coded to the expected work group size for this example
local unsigned int result[4];
//the computation
result[item_id] = item_id % 3;
//wait for all items to write to result
barrier(CLK_LOCAL_MEM_FENCE);
//simple O(n) reduction using the first work item in the group
if(local_id == 0){
for(int i=1;i<4;i++){
result[0] += result[i];
}
output[group_id] = result[0];
}
}
Multiple work items are accessing elements of global simultaneously and the result is undefined. You need to use atomic operations or write unique location per work item.