Let's say I have 3 ranks.
Rank 0 receives 1 MPI_INT from rank 1 and receives 10 MPI_INT from rank 2:
MPI_Recv(buf1, 1, MPI_INT,
1, 0, MPI_COMM_WORLD, &status);
MPI_Recv(buf2, 10, MPI_INT,
2, 0, MPI_COMM_WORLD, &status);
Rank 1 and rank 2 sends 1 and 10 MPI_INT to rank 0, respectively. The MPI_Recv is a blocking call. Let's say the 10 MPI_INIT from rank 2 arrives before the 1 MPI_INT from rank 1. At this point, rank 0 blocks there waiting for data from rank 1.
In this case, could the first MPI_Recv return? Data from rank 2 arrives first, but the data couldn't fit into buf1 which could contain one integer.
And then the message from rank 1 arrives. Is MPI able to pick this message and let the first MPI_Recv return?
Since you specify a source when calling MPI_Recv(), you do not have time worry about the order of the messages. The first MPI_Recv() will return when 1 MPI_INT is received from rank 1, and the second MPI_Recv() will return when 10 MPI_INT are received from rank 2.
If you had MPI_Recv(..., source=MPI_ANY_SOURCE, ...) it would have been a different story.
Feel free to write a simple program with some sleep() here and there if you still need to convince yourself.
Related
I have a problem with understanding scatter and gatter. Let's say I have one table[tableSize]. In that table I want to do some calculation for every 25 elements. I want to divide it among all the processes I have in my MPI. I try something like this
MPI_Scatter(table, 25, MPI_INT, tmpTable, 25, MPI_INT, 0, MPI_COMM_WORLD);
tmpTable[12] = doTheCalculation(tmpTable);
MPI_Gather(tmpTable, 25, MPI_INT, table, 25, MPI_INT, 0, MPI_COMM_WORLD);
But it works only if this 25 * number of processes = tableSize is correct. How should I proceed if I would like to have tableSize of 125 but only 3 processes ran? My goal would be for process 0 and 1 to count twice (process 0 elements 1-25 and then 75-100 and process 1 to count 25-50 and 100-125). Is scatter and gather capable of doing this? Should I look into something else? Thanks in advance
I'm new in OpenCL and I'm trying to understand this piece of code:
size_t global_work1[3] = {BLOCK_SIZE, 1, 1};
size_t local_work1[3] = {BLOCK_SIZE, 1, 1};
err = clEnqueueNDRangeKernel(cmd_queue, diag, 2, NULL, global_work1, local_work1, 0, 0, 0);
So, in the clEnqueueNDRangeKernel 2 dimension for the kernel are specified (work_dim field), this means that:
the dimension 0 kernel got a number of threads equal to BLOCK_SIZE and only one group (I guess the number of groups can be calculated in this way => ( global_work1[0] ) / ( local_work1[0] ) ).
the dimension 1 Kernel got a number of threads equal to 1 and only one group.
and also why a dimension of 2 is specified in the queue instruction when three are the elements in global_work1 and local_work1.
You are telling CL:
"Run this kernel, in this queue, with 2D and these global/local sizes"
CL is just getting the first 2 dimensions of the argument, and ignoring the 3rd one.
About the difference between 1D and 2D. There is none. Since OpenCL kernels launched as 1D do not fail on get_global_id(1) and get_global_id(2) calls. They will just return 0. So there is no difference at all, apart from the hint that the kernel will probably support bigger sizes for the 2nd dimension argument (ie: 512x128)
Many routines in MPI that describe both sending and receiving - MPI_Sendrecv, MPI_Scatter, etc - have arguments for counts and types for both the send and the receive. For instance, in Fortran the signature for MPI_Scatter is:
MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE,
RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR)
If the amount of data sent has to be the same as the amount received, why are both needed? Doesn't this just introduce the possibility of inconsistency?
What is the use case where different counts/types are needed?
MPI requires that the sending and receiving processes agree on the type of data and amount (sort of; for point to point communications receiver can request more than is sent). But MPI data types also describe the layout in memory of the data, and that's a very common reason to need to use different types for the sender and the receiver.
You ask in particular about Scatter and Fortran, so let's take a look at that case. Let's consider scattering an size*n matrix, by rows, to different processes
|---n---| ---
0 0 0 0 |
a = 1 1 1 1 size
2 2 2 2 |
---
So that rank 0 gets [0 0 0 0], rank 1 gets [1 1 1 1], etc.
In Fortran, those aren't contiguous in memory; so to describe a row, you would have to use an MPI_Type_vector:
call MPI_Type_vector(n, 1, size, MPI_REAL, row_type, ierr)
That describes n reals, but with each separated by size reals between them.
On the other hand, if the receiving process is just receiving that data into a 1-d array:
real, dimension(n) :: b
Then it can not use that type to describe the data; b doesn't have enough room to hold n reals each with a gap of size between them! It wants to receive the data just as n * MPI_REAL. This mismatch would be the same in C if you had to send columns of data.
And so this is a common reason for specifying the type (and thus the count) of data differently; for the scatter-er, the data has to be described with a data type which includes the layout of the larger data structure holding the values to be sent; but the scatter-ee may well be receiving the data into a different data structure, with a different layout.
A working simple example follows.
program scatterdemo
use mpi
implicit none
real, allocatable, dimension(:,:) :: a
real, allocatable, dimension(:) :: b
integer :: ierr, rank, comsize
integer, parameter :: n=4
integer :: i
integer :: row_type, row_type_sized, real_size
integer(kind=MPI_ADDRESS_KIND) :: lb=0, extent
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr)
if (rank == 0) then
allocate( a(comsize, n) )
do i=1,comsize
a(i,:) = i-1
enddo
endif
allocate( b(n) )
call MPI_Type_size(MPI_REAL, real_size, ierr)
call MPI_Type_vector(n, 1, comsize, MPI_REAL, row_type, ierr)
extent = real_size*1
call MPI_Type_create_resized(row_type, lb, extent, row_type_sized, ierr)
call MPI_Type_commit(row_type_sized, ierr)
call MPI_Scatter(a, 1, row_type_sized, b, n, MPI_REAL, 0, MPI_COMM_WORLD, ierr)
print *, rank, b
if (rank == 0) deallocate (a)
deallocate(b)
call MPI_Finalize(ierr)
end program scatterdemo
and running it with six processors gives
$ mpirun -np 6 ./scatter
0 0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
3 3.000000 3.000000 3.000000 3.000000
1 1.000000 1.000000 1.000000 1.000000
5 5.000000 5.000000 5.000000 5.000000
2 2.000000 2.000000 2.000000 2.000000
4 4.000000 4.000000 4.000000 4.000000
When you use MPI derived types, you can view the array as n elements of some basic numeric type and also just as one or more elements of some MPI derived type. In that case not only the counts but also the datatypes can differ although they correspond to the same buffer.
On the other hand it is NOT affected by the number of processes in the communicator. The communicator size is always implicit and you do not enter it directly anywhere when calling a collective.
For the sender, these two would be the same, but the receivers might not know how many elements were received.
I'm trying to implement a program using MPI, for which I need to have a block of code to be executed in a particular processor and until the execution completes other processors must wait.I thought it can be achieved using MPI_Barrier (though I'm not clear with its actual functionality) and tried out the following program.
#include<mpi.h>
#include<stdio.h>
int main(int argc, char **argv) {
int rank=0,size;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
if(rank == 0){ //Block 1
printf("\nRank 0 Before Barrier");
}
MPI_Barrier(MPI_COMM_WORLD);
if(rank==1){
printf("\nRank 1 After Barrier");
printf("\nRank 1 After Barrier");
}
if(rank==2){
printf("\nRank 2 After Barrier");
}
MPI_Finalize();
}
I got the following output when I executed with np as 3
Rank 1 After Barrier
Rank 0 Before BarrierRank 2 After BarrierRank 1 After Barrier
How could I possibly make the other processors to wait until Block 1 completes its execution in the processor with Rank 0?
Intended Output
Rank 0 Before Barrier
Rank 1 After Barrier //After this, it might be interchanged
Rank 1 After Barrier
Rank 2 After Barrier
Besides the issue with concurrent writing to stdout that eduffy pointed out in a comment, the barrier you've used is only part of what you need to do to ensure ordering. Once all 3 or more ranks pass the one barrier you've inserted, all possible interleavings of rank 1 and 2 are allowed:
Rank 1 After Barrier
Rank 1 After Barrier
Rank 2 After Barrier
or:
Rank 1 After Barrier
Rank 2 After Barrier
Rank 1 After Barrier
or:
Rank 2 After Barrier
Rank 1 After Barrier
Rank 1 After Barrier
You need to do some kind of synchronization between ranks 1 and 2 after the barrier you've got now to make sure that rank 1 gets its first printf done before rank 2 can continue. That could be another barrier, a barrier on a smaller communicator just containing ranks 1 and 2 if you don't want to force other ranks to wait, a blocking MPI_Ssend/MPI_Recv pair with dummy data or similar.
I have a 2D processor grid (3*3):
P00, P01, P02 are in R0, P10, P11, P12, are in R1, P20, P21, P22 are in R2.
P*0 are in the same computer. So the same to P*1 and P*2.
Now I would like to let R0, R1, R2 call MPI_Bcast at the same time to broadcast from P*0 to p*1 and P*2.
I find that when I use MPI_Bcast, it takes three times the time I need to broadcast in only one row.
For example, if I only call MPI_Bcast in R0, it takes 1.00 s.
But if I call three MPI_Bcast in all R[0, 1, 2], it takes 3.00 s in total.
It means the MPI_Bcast cannot work parallel.
Is there any methods to make the MPI_Bcast broadcast at the same time?
(ONE node broadcast with three channels at the same time.)
Thanks.
If I understand your question right, you would like to have simultaneous row-wise broadcasts:
P00 -> P01 & P02
P10 -> P11 & P12
P20 -> P21 & P22
This could be done using subcommunicators, e.g. one that only has processes from row 0 in it, another one that only has processes from row 1 in it and so on. Then you can issue simultaneous broadcasts in each subcommunicator by calling MPI_Bcast with the appropriate communicator argument.
Creating row-wise subcommunicators is extreamly easy if you use Cartesian communicator in first place. MPI provides the MPI_CART_SUB operation for that. It works like that:
// Create a 3x3 non-periodic Cartesian communicator from MPI_COMM_WORLD
int dims[2] = { 3, 3 };
int periods[2] = { 0, 0 };
MPI_Comm comm_cart;
// We do not want MPI to reorder our processes
// That's why we set reorder = 0
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm_cart);
// Split the Cartesian communicator row-wise
int remaindims[2] = { 0, 1 };
MPI_Comm comm_row;
MPI_Cart_sub(comm_cart, remaindims, &comm_row);
Now comm_row will contain handle to a new subcommunicator that will only span the same row that the calling process is in. It only takes a single call to MPI_Bcast now to perform three simultaneous row-wise broadcasts:
MPI_Bcast(&data, data_count, MPI_DATATYPE, 0, comm_row);
This works because comm_row as returned by MPI_Cart_sub will be different in processes located at different rows. 0 here is the rank of the first process in comm_row subcommunicator which will correspond to P*0 because of the way the topology was constructed.
If you do not use Cartesian communicator but operate on MPI_COMM_WORLD instead, you can use MPI_COMM_SPLIT to split the world communicator into three row-wise subcommunicators. MPI_COMM_SPLIT takes a color that is used to group processes into new subcommunicators - processes with the same color end up in the same subcommunicator. In your case color should equal to the number of the row that the calling process is in. The splitting operation also takes a key that is used to order processes in the new subcommunicator. It should equal the number of the column that the calling process is in, e.g.:
// Compute grid coordinates based on the rank
int proc_row = rank / 3;
int proc_col = rank % 3;
MPI_Comm comm_row;
MPI_Comm_split(MPI_COMM_WORLD, proc_row, proc_col, &comm_row);
Once again comm_row will contain the handle of a subcommunicator that only spans the same row as the calling process.
The MPI-3.0 draft includes a non-blocking MPI_Ibcast collective. While the non-blocking collectives aren't officially part of the standard yet, they are already available in MPICH2 and (I think) in OpenMPI.
Alternatively, you could start the blocking MPI_Bcast calls from separate threads (I'm assuming R0, R1 and R2 are different communicators).
A third possibility (which may or may not be possible) is to restructure the data so that only one broadcast is needed.