MPI_Barrier in C - mpi

I'm trying to implement a program using MPI, for which I need to have a block of code to be executed in a particular processor and until the execution completes other processors must wait.I thought it can be achieved using MPI_Barrier (though I'm not clear with its actual functionality) and tried out the following program.
#include<mpi.h>
#include<stdio.h>
int main(int argc, char **argv) {
int rank=0,size;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
if(rank == 0){ //Block 1
printf("\nRank 0 Before Barrier");
}
MPI_Barrier(MPI_COMM_WORLD);
if(rank==1){
printf("\nRank 1 After Barrier");
printf("\nRank 1 After Barrier");
}
if(rank==2){
printf("\nRank 2 After Barrier");
}
MPI_Finalize();
}
I got the following output when I executed with np as 3
Rank 1 After Barrier
Rank 0 Before BarrierRank 2 After BarrierRank 1 After Barrier
How could I possibly make the other processors to wait until Block 1 completes its execution in the processor with Rank 0?
Intended Output
Rank 0 Before Barrier
Rank 1 After Barrier //After this, it might be interchanged
Rank 1 After Barrier
Rank 2 After Barrier

Besides the issue with concurrent writing to stdout that eduffy pointed out in a comment, the barrier you've used is only part of what you need to do to ensure ordering. Once all 3 or more ranks pass the one barrier you've inserted, all possible interleavings of rank 1 and 2 are allowed:
Rank 1 After Barrier
Rank 1 After Barrier
Rank 2 After Barrier
or:
Rank 1 After Barrier
Rank 2 After Barrier
Rank 1 After Barrier
or:
Rank 2 After Barrier
Rank 1 After Barrier
Rank 1 After Barrier
You need to do some kind of synchronization between ranks 1 and 2 after the barrier you've got now to make sure that rank 1 gets its first printf done before rank 2 can continue. That could be another barrier, a barrier on a smaller communicator just containing ranks 1 and 2 if you don't want to force other ranks to wait, a blocking MPI_Ssend/MPI_Recv pair with dummy data or similar.

Related

Why is this simple OpenCL code not vectorised?

The code below does not vectorise. With 'istart = n * 1;' instead of the 'istart = n * niters;' it does. With 'istart = n * 2;' it again does not.
// Kernel for ERIAS_critical_code.py
__kernel void pi(
int niters,
__global float* A_d,
__global float* S_d,
__global float* B_d)
{
int num_wrk_items = get_local_size(0);
int local_id = get_local_id(0); // work item id
int group_id = get_group_id(0); // work group id
float accum = 0.0f;
int i, istart, iend, n;
n= group_id * num_wrk_items + local_id;
istart = n * niters;
iend = istart + niters;
for (i= istart; i< iend; i++){
accum += A_d[i] * S_d[i];
}
B_d[n] = accum;
barrier(CLK_LOCAL_MEM_FENCE); // test: result is correct without this statement
}
If the code cannot be vectorised I get:
Kernel was not vectorized
If it can be:
Kernel was successfully vectorized (8)
Any idea why it is not vectorised?
When niters is 1, it makes the for loop cycle only once. This means every workitem computes its own element, in a coalesced access to memory.
Coalesced access is one of conditions to have N neighboring threads/workitems mapped to a SIMD hardware such as with width 8.
When niters is greater than 1, every workitem works only with strides of niters between neighboring workitems. This means SIMD hardware is useless. Only 1 memory cell per workitem is used at a time.
When niters is 2, at least only 2-fold memory bank collision happens. But with very big niters value, memory bank collisions happen more, making it very slow. Using SIMD or not doesn't matter (vectorized or not) as its performance will be locked into the serialized memory read/write latencies.
That for loop is doing a reduction serially. You should make it parallel. There are many examples out there, pick one and apply to your algorithm. For example, have each workitem compute a sum between id and id+niters/2 then reduce them on id and id+niters/4 and continue like this until at last only 1 workitem does final summation of id and id+1 elements.
If the reduction is a global version, then you can do local reduction per workgroup then apply their results same way on another kernel to do the global reduction.
Since you are making only partial sums per workitem, you could do "strided sum per workitem" such that each workitem using same for loop but leaping by M elements where M is something wont disturb the SIMD mapping on kernel workitems. Maybe M could be 1/100 of global number of elements(N) and for loop would cycle for 100 times (or N/M times). Something like this:
time 1 time 2 time 3 time 4
workitem 1 0 15 30 45
workitem 2 1 16 31 46
workitem 3 2 17 32 47
...
workitem 15 14 29 44 59
coalesced coalesced coalesced coalesced
to complete 15 partial sums for 60 elements using 15 workitems. If SIMD length can fit this 15 workitems, it is good.
Lastly, the barrier operation is not needed since kernel ending is an implicit synchronization point globally for all workitems in it. Barrier only needed when you need to use those written results on another workitem in same kernel.

sending multiple messages with different length to the same rank

Let's say I have 3 ranks.
Rank 0 receives 1 MPI_INT from rank 1 and receives 10 MPI_INT from rank 2:
MPI_Recv(buf1, 1, MPI_INT,
1, 0, MPI_COMM_WORLD, &status);
MPI_Recv(buf2, 10, MPI_INT,
2, 0, MPI_COMM_WORLD, &status);
Rank 1 and rank 2 sends 1 and 10 MPI_INT to rank 0, respectively. The MPI_Recv is a blocking call. Let's say the 10 MPI_INIT from rank 2 arrives before the 1 MPI_INT from rank 1. At this point, rank 0 blocks there waiting for data from rank 1.
In this case, could the first MPI_Recv return? Data from rank 2 arrives first, but the data couldn't fit into buf1 which could contain one integer.
And then the message from rank 1 arrives. Is MPI able to pick this message and let the first MPI_Recv return?
Since you specify a source when calling MPI_Recv(), you do not have time worry about the order of the messages. The first MPI_Recv() will return when 1 MPI_INT is received from rank 1, and the second MPI_Recv() will return when 10 MPI_INT are received from rank 2.
If you had MPI_Recv(..., source=MPI_ANY_SOURCE, ...) it would have been a different story.
Feel free to write a simple program with some sleep() here and there if you still need to convince yourself.

How to start distinct MPI jobs from a single allocation

Let's say I have started an MPI job with 256 cores on 16 nodes.
I have an MPI program, but it is not unfortunately parallel over one parameter. Fortunately, I can easily create my own MPI program, which could handle parallelization of that parameter, only if I obtain the output files.
So, how can I start an MPI job (from within an MPI job), which uses a particular subset of these cores, namely only a particular node? So basically I want to run 16 distinct MPI calculations all with 16 cores, from within a single 256 core MPI job. These calculations take about 10 minutes with 16 cores, and there are about 200 iterations in the outer loop. With 256 cores, that is a reasonable 32 hours. It is not plausible to either resubmit 200 times, or run these 16 calculations sequentially.
To be even more precise, here is some python-pseudo-code for what I want to do:
from ase.parallel import world, rank
from os import system, chdir
while 1:
node = rank // 16
subrank = rank % 16
chdir(mydir+"Calculation_%d" % node)
# This will not work, one needs to specify somehow that only ranks from node*16 to node*16+15 will be used
os.system("mpirun -n 16 nwchem input.nw > nwchem.out")
analyse_output(mydir+"Calculation_%d/nwchem.out" % node)
rewrite_input_files()
Basicly, with 16 cores and 4 jobs:
rank 0: start nwchem process in /calculation0/ as rank 0/4.
rank 1: start nwchem in /calculation0/ as rank 1/4.
rank 2: start nwchem in /calculation0/ as rank 2/4.
rank 3: start nwchem in /calculation0/ as rank 3/4.
rank 4: start nwchem in /calculation1/ as rank 0/4.
rank 5: start nwchem in /calculation1/ as rank 1/4.
rank 6: start nwchem in /calculation1/ as rank 2/4.
rank 7: start nwchem in /calculation1/ as rank 3/4.
rank 8: start nwchem in /calculation2/ as rank 0/4.
rank 9: start nwchem in /calculation2/ as rank 1/4.
rank 10: start nwchem in /calculation2/ as rank 2/4.
rank 11: start nwchem in /calculation2/ as rank 3/4.
rank 12: start nwchem in /calculation3/ as rank 0/4.
rank 13: start nwchem in /calculation3/ as rank 1/4.
rank 14: start nwchem in /calculation3/ as rank 2/4.
rank 15: start nwchem in /calculation3/ as rank 3/4.
Gather all the results.
Optimize all geometries (this requires knowledge of forces between the calculations).
Repeat until convergence (about 200 times).
Background: In case you are interested, I will elaborate details here. But the main question still is "How to instantiate N MPI calculations from a single MPI calculation of M cores, which each have their M / N cores.
NWChem does not have image-parallel nudged elastic band calculator. Here is an example of this process, with a different code: GPAW.
https://wiki.fysik.dtu.dk/gpaw/tutorials/neb/neb.html
Here it is smooth, because it is so easy to create a sub-communicator with GPAW and it's MPI interface. However, I only have the nwchem runtime MPI, and I wish to do the same thing: create many calculators (a band or a chain of geometries, which are all linked with 'springs', and optimize that chain.)
I suppose, you are trying to use Dynamic Process Management in MPI. We can spawn new processes for the smaller jobs and then after the calculations we can connect the spawned existing process.
Detailed Explanation on DPM
Example program using DPM

Do global_work_size and local_work_size have any effect on application logic?

I am trying to understand how all of the different parameters for dimensions fit together in OpenCL. If my question isn't clear that's partly because a well formed question requires bits of the answer which I don't have.
How do work_dim, global_work_size, and local_work_size work together to create the execution space that you use in a kernel? For example, if I make work_dim 2 then I can
get_global_id(0);
get_global_id(1);
I can divide those two dimensions up into n Work Groups using global_work_size, right? So if I make the global_work_size like so
size_t global_work_size[] = { 4 };
Then each dimension would have 4 work groups for a total of 8? But, as a beginner, I am only using global_id for my indices so only the global id's matter anyway. As you can tell I am pretty confused about all of this so any help you can offer would ...help.
image i made to try to understand this question
image decribing work groups i found on google
Since you stated yourself that you are a bit confused about the concepts involved in the execution space, I'm gonna try to summary them before answering your question and give some examples.
The threads/workitems are organized in a NDRange which can be viewed as a grid of 1, 2, 3 dims.
The NDRange is mainly used to map each thread to the piece of data each of them will have to manipulate. Therefore each thread should be uniquely identified and a thread should know which one it is and where it stands in the NDRange. And there come the Work-Item Built-in Functions. These functions can be called by all threads to give them info about themself and the NDRange where they stand.
The dimensions:
As already stated, an NDRange can have up to 3 dimensions. So if you set the dimensions this way:
size_t global_work_size[2] = { 4, 4 };
It doesn't mean that each dimension would have 4 work groups for a total of 8, but that you'll have 4 * 4 i.e. 16 threads in your NDRange. These threads will be arranged in a "square" with sides of 4 units. The workitems can know how many dimensions the NDRange is made of, using the uint get_work_dim () function.
The global size:
Threads can also query how big is the NDRange for a specific dimension with size_t get_global_size (uint D). Therefore they can know how big is the "line/square/rectangle/cube" NDRange.
The global unique identifiers:
Thanks to that organization, each thread can be uniquely identified with indexes corresponding to the specific dimensions. Hence the thread (2, 1) refers to a thread that is in the 3rd column and the second row of a 2D range. The function size_t get_global_id (uint D) is used in the kernel to query the id of the threads.
The workgroup (or local) size:
The NDRange can be split in smaller groups called workgroups. This is the local_work_size you were referring to which has also (and logically) up to 3 dimensions. Note that for OpenCL version below 2.0, the NDRange size in a given dimension must be a multiple of the workgroup size in that dimension. so to keep your example, since in the dimension 0 we have 4 threads, the workgroup size in the dimension 0 can be 1, 2, 4 but not 3. Similarly to the global size, threads can query the local size with size_t get_local_size (uint D).
The local unique identifiers:
Sometime it is important that a thread can be uniquely identified within a workgroup. Hence the function size_t get_local_id (uint D). Note the "within" in the previous sentence. a thread with a local id (1, 0) will be the only one to have this id in its workgroup (of 2D). But there will be as many threads with a local id (1, 0) as there will be workgroups in the NDRange.
The number of groups:
Speaking of groups sometime a thread might need to know how many groups there are. That's why the function size_t get_num_groups (uint D) exists. Note that again you have to pass as parameter the dimension you are interested in.
Each group has also an id:
...that you can query within a kernel with the function size_t get_group_id (uint D). Note that the format of the group ids will be similar to those of the threads: tuples of up to 3 elements.
Summary:
To wrap things up a bit, if you have a 2D NDRange of a global work size of (4, 6) and a local work size of (2, 2) it means that:
the global size in the dimension 0 will be 4
the global size in the dimension 1 will be 6
the local size (or workgroup size) in the dimension 0 will be 2
the local size (or workgroup size) in the dimension 1 will be 2
the thread global ids in the dimension 0 will range from 0 to 3
the thread global ids in the dimension 1 will range from 0 to 5
the thread local ids in the dimension 0 will range from 0 to 1
the thread local ids in the dimension 1 will range from 0 to 1
The total number of threads in the NDRange will be 4 * 6 = 24
The total number of threads in a workgroup will be 2 * 2 = 4
The total number of workgroups will be (4/2) * (6/2) = 6
the group ids in the dimension 0 will range from 0 to 1
the group ids in the dimension 1 will range from 0 to 2
there will be only one thread will the global id (0, 0) but there will be 6 threads with the local id (0, 0) because there are 6 groups.
Example:
Here is a dummy example to use all these concepts together (note that performance would be terrible, it's just a stupid example).
Let's say you have a 2D array of 6 rows and 4 columns of int. You want to group these elements in square of 2 by 2 elements and sum them up in such a way that for instance, the elements (0, 0), (0, 1), (1, 0), (1, 1) will be in one group (hope it's clear enough). Because you'll have 6 "squares" you'll have 6 results for the sums, so you'll need an array of 6 elements to store these results.
To solve this, you use our 2D NDRange detailed just above. Each thread will fetch from the global memory one element, and will store it in the local memory. Then after a synchronization, only one thread per workgroup, let say each local(0, 0) threads will sum the elements (in local) up and then store the result at a specific place in a 6 elements array (in global).
//in is a 24 int array, result is a 6 int array, temp is a 4 int array
kernel void foo(global int *in, global int *result, local int *temp){
//use vectors for conciseness
int2 globalId = (int2)(get_global_id(0), get_global_id(1));
int2 localId = (int2)(get_local_id(0), get_local_id(1));
int2 groupId = (int2)(get_group_id (0), get_group_id (1));
int2 globalSize = (int2)(get_global_size(0), get_global_size(1));
int2 locallSize = (int2)(get_local_size(0), get_local_size(1));
int2 numberOfGrp = (int2)(get_num_groups (0), get_num_groups (1));
//Read from global and store to local
temp[localId.x + localId.y * localSize.x] = in[globalId.x + globalId.y * globalSize.x];
//Sync
barrier(CLK_LOCAL_MEM_FENCE);
//Only the threads with local id (0, 0) sum elements up
if(localId.x == 0 && localId.y == 0){
int sum = 0;
for(int i = 0; i < locallSize.x * locallSize.y ; i++){
sum += temp[i];
}
//store result in global
result[groupId.x + numberOfGrp.x * groupId.y] = sum;
}
}
And finally to answer to your question: Do global_work_size and local_work_size have any effect on application logic?
Usually yes because it's part of the way you design you algo. Note that the size of the workgroup is not taken randomly but matches my need (here 2 by 2 square).
Note also that if you decide to use a NDRange of 1 dimension with a size of 24 and a local size of 4 in 1 dim, it'll screw things up too because the kernel was designed to use 2 dimensions.

Reusable Barrier solution has a deadlock?

I have been reading "The Little Book of Semaphores" and in page 41 there is a solution for the Reusable Barrier problem. The problem I have is why it won't generate a deadlock situation.
1 # rendezvous
2
3 mutex.wait()
4 count += 1
5 if count == n:
6 turnstile2.wait() # lock the second
7 turnstile.signal() # unlock the first
8 mutex.signal()
9
10 turnstile.wait() # first turnstile
11 turnstile.signal()
12
13 # critical point
14
15 mutex.wait()
16 count -= 1
17 if count == 0:
18 turnstile.wait() # lock the first
19 turnstile2.signal() # unlock the second
20 mutex.signal()
21
22 turnstile2.wait() # second turnstile
23 turnstile2.signal()
In this solution, between lines 15 and 20, isn't it a bad habit to call wait() on a semaphore (in line 18) while holding a mutex which causes a deadlock? Please explain. Thank you.
mutex protects the count variable. The first mutex lock is concerned with incrementing the counter to account for each thread, and the last thread to enter (if count == n) locks the second tunstile in preparation of leaving (see below) and releases the waiting (n-1) threads (that are waiting on line 10). Then each signals to the next.
The second mutex lock works similarly to the first, but decrements count (same mutext protects it). The last thread to enter the mutex block locks turnstile to prepare for the next batch entring (see above) and releases the (n-1) thread waiting on line 22. Then each thread signals to the next.
Thus turnstile coordinates the entries to the critical point, while turnstile2 coordinates the exit from it.
There could be no deadlock: by the time the (last) thread gets to line 18, turnstile is guarantted to be not held by any other thread (they are all waiting on line 22). Similarly with turnstile2

Resources