rank in MPI_Bcast - mpi

MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
This function does not require the rank parameter. How does it know the rank of each process?
We should call the MPI_COMM_RANK() before broadcast, does any data structure (like communicator) store the rank of processes?

Perhaps you didn't think it possible, but functions inside the MPI library can make internally the same MPI calls that you do use to obtain the process' rank or the size of a communicator. That's why MPI_Bcast() doesn't need the rank of the calling process because it simply calls the internal implementation of MPI_Comm_rank() in order to obtain it. Here is a small sample from one of the MPI_Bcast() implementations in Open MPI (more specifically, this is from the split binary tree implementation in the tuned module from the coll framework that provides algorithms implementing the collective operations):
int
ompi_coll_tuned_bcast_intra_split_bintree ( void* buffer,
int count,
struct ompi_datatype_t* datatype,
int root,
struct ompi_communicator_t* comm,
mca_coll_base_module_t *module,
uint32_t segsize )
{
...
int rank, size;
...
size = ompi_comm_size(comm);
rank = ompi_comm_rank(comm);
...
}
As you can see, it calls the internal implementation of MPI_Comm_size() and MPI_Comm_rank(). These are very cheap calls in Open MPI. The rank of the process is stored in the process group that is associated with the communicator and is copied to a field in the communicator structure (to save a few CPU cycles dereferencing the pointer to the group) during the creation of a communicator (for more information refer to openmpi-source/ompi/communicator/communicator.h and openmpi-source/ompi/group/group.h).
As a matter of fact, no MPI communication primitive ever takes explicitly the rank of the calling process - it is always resolved internally. You only specify where to send the data (e.g. in MPI_SEND) or from where to receive the data (e.g. in MPI_RECV) or the data root in those collective operations which have one.

Consider three possible implementations of MPI_Bcast():
The root sends to root+1, and then sends to root+2, and then to root+3, etc. This is a linear order of magnitude
Starting with the root, each process that has a copy of the data at iteration N forwards the data to rank xor 2^N. This is logarithmic order of magnitude.
The root uses the router to perform a multicast to each process on the network. This is constant order of magnitude.
In each of these scenarios, the MPI_Bcast() function knows which process will get the next message. In the first and third case, any non-root process will simply receive the data; in the second, each process will continue the forwarding process once it receives the data. In all implementations, though, the order of sends and receives is deterministic on the bases of which process is the root. (That's why all processes must invoke MPI_Bcast(), whether root or not.)

You are right, the rank is stored in the communicator, and is available to the implementation of MPI_Bcast internally. Ranks are assigned when a communicator is created. For example, MPI_COMM_WORLD is created by MPI_Init.
MPI_Comm_rank simply gets the rank value from the communicator. There's no requirement to call it before a broadcast. However, knowing the rank is usually necessary to do any meaningful programming.
Note that since MPI_Bcast is a collective call, it needs to be performed by all processes in the communicator.

int root , is the rank of the broadcast root, essentially MPI broadcast sends a message from rank root to all the other ranks
also it would be considered by me a "best practice" to call the following after MPI_Init
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
this will assign each processor or core a int rank value from 0 to n-1
and
MPI_Comm_size( MPI_COMM_WORLD, &Numprocs);
this will create an int with Numproces being the total number of processors

Related

How to synchronize (specific) work-items based on data, in OpenCL?

Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.
From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html

MPI rank determination

I am new to MPI and I often see the following codes in MPI code:
if (rank == 0) {
MPI_Send(buf, len, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else {
MPI_Recv(buf, len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
}
It seems that the rank determines which process is sending and which process is receiving. But
how is the rank of a process determined by calling MPI_Comm_rank(MPI_COMM_WORLD, &rank);?
Is it related to the command line arguments for mpirun ?
For example:
mpirun -n 2 -host localhost1,localhost2 ./a.out
(localhost1 is rank 0 and localhost2 is rank 1?)
How is the program going to determine who has rank 0 and who has rank 1?
Is there a way for me to specify something such that say localhost1 is sending and localhost2 is receiving?
Usually, if you're trying to think about communication in your MPI program based on physical processors/machines, you're not going about it in the right way. Most of the time, it doesn't matter which actual machine each rank is mapped to. All that matters is that you call mpiexec or mpirun (they're usually the same thing), something inside your MPI implementation starts up n processes which could be located locally, remotely, or some combination of the two, and assigns them ranks. Theoretically those ranks could be assigned arbitrarily, though it's usually in some predictable way (often something like round-robin over the entire group of hosts that are available). Inside your program, it usually makes very little difference whether you're running rank 0 on host0 or host1. The important thing is that you are doing specific work on rank 0, that requires communication from rank 1.
That being said, there are more rare times where it might be important which rank is mapped to which processor. Examples might be:
If you have GPUs on some nodes and not others and you need certain ranks to be able to control a GPU.
You need certain processes to be mapped to the same physical node to optimize communication patterns for things like shared memory.
You have data staged on certain hosts that needs to map to specific ranks.
These are all advanced examples. Usually if you're in one of these situations, you've been using MPI long enough to know what you need to do here, so I'm betting that you're probably not in this scenario.
Just remember, it doesn't really matter where my ranks are. It just matters that I have the right number of them.
Disclaimer: All of that being said, it does matter that you launch the correct number of processes. What I mean by that is, if you have 2 hosts that each have a single quad-core processor, it doesn't make sense to start a job with 16 ranks. You'll end up spending all of your computational time context switching your processes in and out. Try not to have more ranks than you have compute cores.
When you call mpirun there is a process manager which determine the node/rank attribution of your process. I suggest you to have a look at Controlling Process Placement with the Intel MPI library and for openmpi
check -npernode, -pernode options.
Use this Hello world test to check if this is what you want.
You can also just simply change the condition (rank==1) if you want to switch your process works.

Variable memory allocation in MPI Code

In a cluster running MPI code, is a copy of all the declared-variables, sent to all nodes , so that all nodes can access it locally, and not perform a remote memory access ?
No, MPI itself can't do this for you in single call.
There is an own state of memory in every MPI process, and every value may be different in any of MPI process.
The only way of sending/receiving data is to use explicit calls of MPI, like Send or Recv. You can pack most of your data into some memory space and send this area of memory to each MPI Process, but this area will not contain 'every declared variable', only variables placed manually into this area.
Update:
Each node runs a copy of the program. Each copy will initialize variables as it want (it can be the same initialization, or individual, based on MPI Process number, called Rank; got from MPI_Comm_Rank function). So every variable exist in N copyes; one set per MPI Process. Every process sees variables, but only the set it owns. Values of variables are unsyncronized automatically.
So, task of programmer is to syncronize values of variables between Nodes (mpi processes).
E.g. here is small MPI program to compute Pi:
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/cpi_c.htm
It will send value of the 'n' variable from first process to all other (MPI_Bcast); and every process will send its own 'mypi' after calculation into 'pi' variable of first process (with addition of individual values via MPI_Reduce function).
Only first process will be able to read N from user (via scanf) and this code is conditionally executed based on rank of process; other processes must get the N from the first because they didn't read it from user directly.
Update2 (sorry for late answer):
This is syntax of MPI_Bcast. Programmer should give an address of variable into this function. Each of MPI processes will give the address of its own 'n' variable (it can be different). And the MPI_Bcast will
check the rank of current process and compare with other argument, the rank of "Broadcaster".
If the current process is broadcaster, MPI_Bcast will read value, placed in memory at given address (it will read value of the 'n' variable on "Broadcaster"); then the value will be send via network.
Else, if the current process is not a broadcaster, it is an "receiver". MPI_Bcast at receiver will get the value from "Broadcaster" (Using MPI Library internals, via network) and store the value in memory of current process at given address.
So, the address is given to this function because on some nodes the function will write to the variable. Only value is send via network.

What is the difference between ranks and processes in MPI?

What is the difference between ranks and processes in MPI?
Here is the resource I learned all my MPI from, you might find it useful.
As to your question: processes are the actual instances of the program that are running. MPI allows you to create logical groups of processes, and in each group, a process is identified by its rank. This is an integer in the range [0, N-1] where N is the size of the group. Communicators are objects that handle communication between processes. An intra-communicator handles processes within a single group, while an inter-communicator handles communication between two distinct groups.
By default, you have a single group that contains all your processes, and the intra-communicator MPI_COMM_WORLD that handles communication between them. This is sufficient for most applications, and does blur the distinction between process and rank a bit. The main thing to remember is that the rank of a process is always relative to a group. If you were to split your processes into two groups (e.g. one group to read input and another group to process data), then each process would now have two ranks: the one it originally had in MPI_COMM_WORLD, and one in its new group.
Rank is a logical way of numbering processes. For instance, you might have 16 parallel processes running; if you query for the current process' rank via MPI_Comm_rank you'll get 0-15.
Rank is used to distinguish processes from one another. In basic applications you'll probably have a "primary" process on rank = 0 that sends out messages to "secondary" processes on rank 1-15. For more advanced applications you can divide workloads even further using ranks (i.e. 0 rank primary process, 1-7 perform function A, 8-15 perform function B).
Every process that belongs to a communicator is uniquely identified by its rank. The rank of a process is an integer that ranges from zero up to the size of the communicator minus one. A process can determine its rank in a communicator by using the MPI_Comm_rank function that takes two arguments: the communicator and an integer variable rank:
int MPI_Comm_rank(MPI_Comm comm, int *rank)
The parameter rank will store the rank of the process.
Note that each process that calls either one of these functions must belong in the supplied communicator, otherwise an error will occur.

Add recevied data to existing receive buffer in MPI_SendRecv

I am trying to send data(forces) across 2 processes, using MPI_SendRecv. Usually the data will be over written in the received buffer, I do not want to overwrite the data in the received buffer instead I want to add the data it received.
I can do the following. Store the data in the previous time step to a different array and then add it after receiving. But I have huge number of nodes and I do not want to have memory allocated for its storage every time step. (or overwrite the same)
My question is there a way to add the received data directly to the buffer and store it in the received memory using MPI?
Any help in this direction would be really thankful.
I am sure collective communication calls (MPI Reduce)cannot be worked out here. Are there any other commands that can do this?
In short: no, but you should be able to do this.
In long: Your suggestion makes a great deal of sense and the MPI Forum is currently considering new features that would enable essentially what you want.
It is incorrect to suggest that the data must be received before it can be accumulated. MPI_Accumulate does a remote accumulation in a one-sided fashion. You want MPI_Sendrecv_accumulate rather than MPI_Sendrecv_replace. This makes perfect sense and an implementation can internally do much better than you can because it can buffer on a per-packet basis, for example.
For suszterpatt, MPI internally buffers in the eager protocol and in the rendezvous protocol can setup a pipeline to minimize buffering.
The implementation of MPI_Recv_accumulate (for simplicity, as the MPI_Send part need not be considered) looks like this:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, MPI_Op op, int source, int tag, MPI_Comm comm, MPI_Status *status)
{
if (eager)
MPI_Reduce_local(_eager_buffer, buf, count, datatype, op);
else /* rendezvous */
{
malloc _buffer
while (mycount<count)
{
receive part of the incoming data into _buffer
reduce_local from _buffer into buf
}
}
In short: no.
In long: your suggestion doesn't really make sense. The machine can't perform any operations on your received value without first putting it into local memory somewhere. You'll need a buffer to receive the newest value, and a separate sum that you will increment by the buffer's content after every receive.

Resources