What is the difference between ranks and processes in MPI? - mpi

What is the difference between ranks and processes in MPI?

Here is the resource I learned all my MPI from, you might find it useful.
As to your question: processes are the actual instances of the program that are running. MPI allows you to create logical groups of processes, and in each group, a process is identified by its rank. This is an integer in the range [0, N-1] where N is the size of the group. Communicators are objects that handle communication between processes. An intra-communicator handles processes within a single group, while an inter-communicator handles communication between two distinct groups.
By default, you have a single group that contains all your processes, and the intra-communicator MPI_COMM_WORLD that handles communication between them. This is sufficient for most applications, and does blur the distinction between process and rank a bit. The main thing to remember is that the rank of a process is always relative to a group. If you were to split your processes into two groups (e.g. one group to read input and another group to process data), then each process would now have two ranks: the one it originally had in MPI_COMM_WORLD, and one in its new group.

Rank is a logical way of numbering processes. For instance, you might have 16 parallel processes running; if you query for the current process' rank via MPI_Comm_rank you'll get 0-15.
Rank is used to distinguish processes from one another. In basic applications you'll probably have a "primary" process on rank = 0 that sends out messages to "secondary" processes on rank 1-15. For more advanced applications you can divide workloads even further using ranks (i.e. 0 rank primary process, 1-7 perform function A, 8-15 perform function B).

Every process that belongs to a communicator is uniquely identified by its rank. The rank of a process is an integer that ranges from zero up to the size of the communicator minus one. A process can determine its rank in a communicator by using the MPI_Comm_rank function that takes two arguments: the communicator and an integer variable rank:
int MPI_Comm_rank(MPI_Comm comm, int *rank)
The parameter rank will store the rank of the process.
Note that each process that calls either one of these functions must belong in the supplied communicator, otherwise an error will occur.

Related

how does parallel GADriver support distributed memory

When I set run_parallel = True for the SimpleGADriver how is the memory handled? Does it do anything with the distributed memory? Does it send each point in generation to a single memory (in case I have a setup that connects multiple nodes (each has its own memory) ) ?
I am not sure I completely understand your question, but I can give an overview of how it works.
When "run_parallel" is True, and you are running under MPI with n processors, the SimpleGADriver will use those procs to evaluate the newly generated population design values. To start, the GA runs on each processor with local values in local memory. When a new set of points is generated, the values from rank 0 are broadcast to all ranks and placed into a list. Then those points are evaluated based on the processor rank, so that each proc is evaluating a different point. When completed, all of the values are allgathered, after which, every processor has all of the objective values for the new generation. This process continues until termination criteria are reached.
So essentially, we are just using multiple processors to speed up objective function evaluation (i.e., running the model), which can be significant for slower models.
One caveat is that the total population size needs to be divisible by the number of processors or an exception will be raised.
The choice to broadcast the population from rank 0 (rather than any other rank) is arbitrary, but those values come from a process that includes random crossover and tournament selection, so each processor does generate a new valid unique population and we just choose one.

Is `MPI_COMM_WORLD` a bad idea if only pairs of ranks ever communicate?

I have very little experience with MPI, so please forgive the naïveté of this question.
I have what I thought was a relatively simple MPI program: a large amount of independent tasks need to be computed, and assembled into a certain array. I do this by letting all ranks talk to rank 0, requesting new entries to compute, and reporting the result back to rank 0. Each communication is of the order of just hundreds of bytes, and always happens between rank 0 and any one of the other ranks. Is it a bad idea to keep all the ranks in the MPI_COMM_WORLD communicator in such a setup? Should I split into a bunch of separate communicators consisting or rank 0 & 1, rank 0 & 2, rank 0 & 3, … , rank 0 & n? Or is it OK to stay in MPI_COMM_WORLD as long as all communications just go to one other rank anyway?
Communicators are heavy-weight objects. Creating a new communicator takes time and consumes internal MPI resources.
It is better to create new communicators for the following cases:
When different abstraction layers within your application need “safe”
communication scopes.
When you need to subset the processes from a
parent communicator (e.g., you have a significant operation that only
needs to be performed by half the processes in MPI_COMM_WORLD).
When you need to re-order the processes from a parent communicator
(e.g. your MPI processes have a logical ordering that is different
than their “native” MPI_COMM_WORLD rank).
In short: Only create new communicators when you need a whole new/safe communication scope, or you need to change your existing scope (e.g., subset and/or reorder the member processes).
Reference: Here

MPI rank process

I am an MPI beginner, so I 'd like to know exactly the definition of rank of an MPI program, and why we need it
For example, there are 2 lines of code here:
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
To understand this, you need to realise that MPI uses the SPMD (Single Program Multiple Data) model. This means that if you run this program in parallel, e.g. on 4 processes at the same time, every process runs its own independent copy of the same program. So, the basic question is: why doesn't every process do the same thing? To make use of parallel programming, you need processes to do different things. For example, you might want one process to act as a controller sending jobs to multiple workers. The rank is the fundamental identifier for each process. If run on 4 processes, then the above program would return ranks of 0, 1, 2 and 3 on the different processes. Once a process knows its rank it can then act appropriately, e.g. "if my rank is zero then call the controller function else call the worker function".

MPI rank determination

I am new to MPI and I often see the following codes in MPI code:
if (rank == 0) {
MPI_Send(buf, len, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else {
MPI_Recv(buf, len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
}
It seems that the rank determines which process is sending and which process is receiving. But
how is the rank of a process determined by calling MPI_Comm_rank(MPI_COMM_WORLD, &rank);?
Is it related to the command line arguments for mpirun ?
For example:
mpirun -n 2 -host localhost1,localhost2 ./a.out
(localhost1 is rank 0 and localhost2 is rank 1?)
How is the program going to determine who has rank 0 and who has rank 1?
Is there a way for me to specify something such that say localhost1 is sending and localhost2 is receiving?
Usually, if you're trying to think about communication in your MPI program based on physical processors/machines, you're not going about it in the right way. Most of the time, it doesn't matter which actual machine each rank is mapped to. All that matters is that you call mpiexec or mpirun (they're usually the same thing), something inside your MPI implementation starts up n processes which could be located locally, remotely, or some combination of the two, and assigns them ranks. Theoretically those ranks could be assigned arbitrarily, though it's usually in some predictable way (often something like round-robin over the entire group of hosts that are available). Inside your program, it usually makes very little difference whether you're running rank 0 on host0 or host1. The important thing is that you are doing specific work on rank 0, that requires communication from rank 1.
That being said, there are more rare times where it might be important which rank is mapped to which processor. Examples might be:
If you have GPUs on some nodes and not others and you need certain ranks to be able to control a GPU.
You need certain processes to be mapped to the same physical node to optimize communication patterns for things like shared memory.
You have data staged on certain hosts that needs to map to specific ranks.
These are all advanced examples. Usually if you're in one of these situations, you've been using MPI long enough to know what you need to do here, so I'm betting that you're probably not in this scenario.
Just remember, it doesn't really matter where my ranks are. It just matters that I have the right number of them.
Disclaimer: All of that being said, it does matter that you launch the correct number of processes. What I mean by that is, if you have 2 hosts that each have a single quad-core processor, it doesn't make sense to start a job with 16 ranks. You'll end up spending all of your computational time context switching your processes in and out. Try not to have more ranks than you have compute cores.
When you call mpirun there is a process manager which determine the node/rank attribution of your process. I suggest you to have a look at Controlling Process Placement with the Intel MPI library and for openmpi
check -npernode, -pernode options.
Use this Hello world test to check if this is what you want.
You can also just simply change the condition (rank==1) if you want to switch your process works.

Variable memory allocation in MPI Code

In a cluster running MPI code, is a copy of all the declared-variables, sent to all nodes , so that all nodes can access it locally, and not perform a remote memory access ?
No, MPI itself can't do this for you in single call.
There is an own state of memory in every MPI process, and every value may be different in any of MPI process.
The only way of sending/receiving data is to use explicit calls of MPI, like Send or Recv. You can pack most of your data into some memory space and send this area of memory to each MPI Process, but this area will not contain 'every declared variable', only variables placed manually into this area.
Update:
Each node runs a copy of the program. Each copy will initialize variables as it want (it can be the same initialization, or individual, based on MPI Process number, called Rank; got from MPI_Comm_Rank function). So every variable exist in N copyes; one set per MPI Process. Every process sees variables, but only the set it owns. Values of variables are unsyncronized automatically.
So, task of programmer is to syncronize values of variables between Nodes (mpi processes).
E.g. here is small MPI program to compute Pi:
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/cpi_c.htm
It will send value of the 'n' variable from first process to all other (MPI_Bcast); and every process will send its own 'mypi' after calculation into 'pi' variable of first process (with addition of individual values via MPI_Reduce function).
Only first process will be able to read N from user (via scanf) and this code is conditionally executed based on rank of process; other processes must get the N from the first because they didn't read it from user directly.
Update2 (sorry for late answer):
This is syntax of MPI_Bcast. Programmer should give an address of variable into this function. Each of MPI processes will give the address of its own 'n' variable (it can be different). And the MPI_Bcast will
check the rank of current process and compare with other argument, the rank of "Broadcaster".
If the current process is broadcaster, MPI_Bcast will read value, placed in memory at given address (it will read value of the 'n' variable on "Broadcaster"); then the value will be send via network.
Else, if the current process is not a broadcaster, it is an "receiver". MPI_Bcast at receiver will get the value from "Broadcaster" (Using MPI Library internals, via network) and store the value in memory of current process at given address.
So, the address is given to this function because on some nodes the function will write to the variable. Only value is send via network.

Resources