MPI_Barrier not work properly althought use fflush(stdout) - mpi

It doesn't print "-------------Hello------------" first, also "-----------------end------------------" last ???
if (rank == 0) {
printf("-------------------------HELLO-----------------------\n");
fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
printf("Process %i says hello\n", rank);
fflush(stdout);
MPI_Barrier(MPI_COMM_WORLD);
if (rank==0)
printf("--------------------END----------------------\n");
MPI_Finalize();

MPI gives no guarantees on the order, in which the standard output from different ranks appears. In fact, the MPI standard doesn't even guarantee that all ranks should be able to write to the standard output (it is possible to issue an environment inquiry with the MPI_IO key in order to find out which ranks could).
What happens is that many MPI libraries implement redirection of the standard output, either by sending it over network streams back to mpirun/mpiexec or by directly allowing each rank to write to the controlling terminal (if all ranks run on the same node). In both cases, although the order to the output lines coming from the same thread in each rank is preserved, the order in which text from different ranks or even from different threads of the same rank appears is undefined.
The only way to ensure that text output from different ranks will appear in a certain order is by explicitly sending the text to a single rank (e.g. to rank 0) in the form of MPI messages, combined with some form of synchronisation/flow control, e.g. token passing.

Related

How to synchronize (specific) work-items based on data, in OpenCL?

Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.
From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html

MPI rank determination

I am new to MPI and I often see the following codes in MPI code:
if (rank == 0) {
MPI_Send(buf, len, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else {
MPI_Recv(buf, len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
}
It seems that the rank determines which process is sending and which process is receiving. But
how is the rank of a process determined by calling MPI_Comm_rank(MPI_COMM_WORLD, &rank);?
Is it related to the command line arguments for mpirun ?
For example:
mpirun -n 2 -host localhost1,localhost2 ./a.out
(localhost1 is rank 0 and localhost2 is rank 1?)
How is the program going to determine who has rank 0 and who has rank 1?
Is there a way for me to specify something such that say localhost1 is sending and localhost2 is receiving?
Usually, if you're trying to think about communication in your MPI program based on physical processors/machines, you're not going about it in the right way. Most of the time, it doesn't matter which actual machine each rank is mapped to. All that matters is that you call mpiexec or mpirun (they're usually the same thing), something inside your MPI implementation starts up n processes which could be located locally, remotely, or some combination of the two, and assigns them ranks. Theoretically those ranks could be assigned arbitrarily, though it's usually in some predictable way (often something like round-robin over the entire group of hosts that are available). Inside your program, it usually makes very little difference whether you're running rank 0 on host0 or host1. The important thing is that you are doing specific work on rank 0, that requires communication from rank 1.
That being said, there are more rare times where it might be important which rank is mapped to which processor. Examples might be:
If you have GPUs on some nodes and not others and you need certain ranks to be able to control a GPU.
You need certain processes to be mapped to the same physical node to optimize communication patterns for things like shared memory.
You have data staged on certain hosts that needs to map to specific ranks.
These are all advanced examples. Usually if you're in one of these situations, you've been using MPI long enough to know what you need to do here, so I'm betting that you're probably not in this scenario.
Just remember, it doesn't really matter where my ranks are. It just matters that I have the right number of them.
Disclaimer: All of that being said, it does matter that you launch the correct number of processes. What I mean by that is, if you have 2 hosts that each have a single quad-core processor, it doesn't make sense to start a job with 16 ranks. You'll end up spending all of your computational time context switching your processes in and out. Try not to have more ranks than you have compute cores.
When you call mpirun there is a process manager which determine the node/rank attribution of your process. I suggest you to have a look at Controlling Process Placement with the Intel MPI library and for openmpi
check -npernode, -pernode options.
Use this Hello world test to check if this is what you want.
You can also just simply change the condition (rank==1) if you want to switch your process works.

Is Open MPI's reduce syncrhonized?

I am looking at the code here which I am doing for practice.
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/main.html
I am confused about the part shown here.
MPI::COMM_WORLD.Reduce(&mypi, &pi, 1, MPI::DOUBLE, MPI::SUM, 0);
if (rank == 0)
cout << "pi is approximately " << pi
<< ", Error is " << fabs(pi - PI25DT)
<< endl;
My question is does the mpi reduce function know when all the other processes (in this case the programs with rank 1-3) have finished and that its result is complete?
All collective communication calls (Reduce, Gather, Scatter, etc) are blocking.
#g.inozemtsev is correct. The MPI collective calls -- including those in Open MPI -- are "blocking" in the MPI sense of the word, meaning that you can use the buffer when the call returns. In an operation like MPI_REDUCE, it means that the root process will have the answer in its buffer when it returns. Further, it means that non-root processes in an MPI_REDUCE can safely overwrite their buffer when MPI_REDUCE returns (which usually means that their part in the reduction is complete).
However, note that as mentioned above, the return from a collective operation such as an MPI_REDUCE in one process has no bearing on the return of the same collective operation in a peer process. The only exception to this rule is MPI_BARRIER, because barrier is defined as an explicit synchronization, whereas all the other MPI-2.2 collective operations do not necessarily need to explicitly synchronize.
As a concrete example, say that all non-root processes call MPI_REDUCE at time X. The root finally calls MPI_REDUCE at time X+N (for this example, assume N is large). Depending on the implementation, the non-root processes may return much earlier than X+N or they may block until X+N(+M). The MPI standard is intentionally vague on this point to allow MPI implementations to do what they want / need (which may also be dictated by resource consumption/availability).
Hence, #g.inozemtsev's point of "You cannot rely on synchronization" (except for with MPI_BARRIER) is correct.

Variable memory allocation in MPI Code

In a cluster running MPI code, is a copy of all the declared-variables, sent to all nodes , so that all nodes can access it locally, and not perform a remote memory access ?
No, MPI itself can't do this for you in single call.
There is an own state of memory in every MPI process, and every value may be different in any of MPI process.
The only way of sending/receiving data is to use explicit calls of MPI, like Send or Recv. You can pack most of your data into some memory space and send this area of memory to each MPI Process, but this area will not contain 'every declared variable', only variables placed manually into this area.
Update:
Each node runs a copy of the program. Each copy will initialize variables as it want (it can be the same initialization, or individual, based on MPI Process number, called Rank; got from MPI_Comm_Rank function). So every variable exist in N copyes; one set per MPI Process. Every process sees variables, but only the set it owns. Values of variables are unsyncronized automatically.
So, task of programmer is to syncronize values of variables between Nodes (mpi processes).
E.g. here is small MPI program to compute Pi:
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/cpi_c.htm
It will send value of the 'n' variable from first process to all other (MPI_Bcast); and every process will send its own 'mypi' after calculation into 'pi' variable of first process (with addition of individual values via MPI_Reduce function).
Only first process will be able to read N from user (via scanf) and this code is conditionally executed based on rank of process; other processes must get the N from the first because they didn't read it from user directly.
Update2 (sorry for late answer):
This is syntax of MPI_Bcast. Programmer should give an address of variable into this function. Each of MPI processes will give the address of its own 'n' variable (it can be different). And the MPI_Bcast will
check the rank of current process and compare with other argument, the rank of "Broadcaster".
If the current process is broadcaster, MPI_Bcast will read value, placed in memory at given address (it will read value of the 'n' variable on "Broadcaster"); then the value will be send via network.
Else, if the current process is not a broadcaster, it is an "receiver". MPI_Bcast at receiver will get the value from "Broadcaster" (Using MPI Library internals, via network) and store the value in memory of current process at given address.
So, the address is given to this function because on some nodes the function will write to the variable. Only value is send via network.

Add recevied data to existing receive buffer in MPI_SendRecv

I am trying to send data(forces) across 2 processes, using MPI_SendRecv. Usually the data will be over written in the received buffer, I do not want to overwrite the data in the received buffer instead I want to add the data it received.
I can do the following. Store the data in the previous time step to a different array and then add it after receiving. But I have huge number of nodes and I do not want to have memory allocated for its storage every time step. (or overwrite the same)
My question is there a way to add the received data directly to the buffer and store it in the received memory using MPI?
Any help in this direction would be really thankful.
I am sure collective communication calls (MPI Reduce)cannot be worked out here. Are there any other commands that can do this?
In short: no, but you should be able to do this.
In long: Your suggestion makes a great deal of sense and the MPI Forum is currently considering new features that would enable essentially what you want.
It is incorrect to suggest that the data must be received before it can be accumulated. MPI_Accumulate does a remote accumulation in a one-sided fashion. You want MPI_Sendrecv_accumulate rather than MPI_Sendrecv_replace. This makes perfect sense and an implementation can internally do much better than you can because it can buffer on a per-packet basis, for example.
For suszterpatt, MPI internally buffers in the eager protocol and in the rendezvous protocol can setup a pipeline to minimize buffering.
The implementation of MPI_Recv_accumulate (for simplicity, as the MPI_Send part need not be considered) looks like this:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, MPI_Op op, int source, int tag, MPI_Comm comm, MPI_Status *status)
{
if (eager)
MPI_Reduce_local(_eager_buffer, buf, count, datatype, op);
else /* rendezvous */
{
malloc _buffer
while (mycount<count)
{
receive part of the incoming data into _buffer
reduce_local from _buffer into buf
}
}
In short: no.
In long: your suggestion doesn't really make sense. The machine can't perform any operations on your received value without first putting it into local memory somewhere. You'll need a buffer to receive the newest value, and a separate sum that you will increment by the buffer's content after every receive.

Resources