MPI rank determination - mpi

I am new to MPI and I often see the following codes in MPI code:
if (rank == 0) {
MPI_Send(buf, len, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else {
MPI_Recv(buf, len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
}
It seems that the rank determines which process is sending and which process is receiving. But
how is the rank of a process determined by calling MPI_Comm_rank(MPI_COMM_WORLD, &rank);?
Is it related to the command line arguments for mpirun ?
For example:
mpirun -n 2 -host localhost1,localhost2 ./a.out
(localhost1 is rank 0 and localhost2 is rank 1?)
How is the program going to determine who has rank 0 and who has rank 1?
Is there a way for me to specify something such that say localhost1 is sending and localhost2 is receiving?

Usually, if you're trying to think about communication in your MPI program based on physical processors/machines, you're not going about it in the right way. Most of the time, it doesn't matter which actual machine each rank is mapped to. All that matters is that you call mpiexec or mpirun (they're usually the same thing), something inside your MPI implementation starts up n processes which could be located locally, remotely, or some combination of the two, and assigns them ranks. Theoretically those ranks could be assigned arbitrarily, though it's usually in some predictable way (often something like round-robin over the entire group of hosts that are available). Inside your program, it usually makes very little difference whether you're running rank 0 on host0 or host1. The important thing is that you are doing specific work on rank 0, that requires communication from rank 1.
That being said, there are more rare times where it might be important which rank is mapped to which processor. Examples might be:
If you have GPUs on some nodes and not others and you need certain ranks to be able to control a GPU.
You need certain processes to be mapped to the same physical node to optimize communication patterns for things like shared memory.
You have data staged on certain hosts that needs to map to specific ranks.
These are all advanced examples. Usually if you're in one of these situations, you've been using MPI long enough to know what you need to do here, so I'm betting that you're probably not in this scenario.
Just remember, it doesn't really matter where my ranks are. It just matters that I have the right number of them.
Disclaimer: All of that being said, it does matter that you launch the correct number of processes. What I mean by that is, if you have 2 hosts that each have a single quad-core processor, it doesn't make sense to start a job with 16 ranks. You'll end up spending all of your computational time context switching your processes in and out. Try not to have more ranks than you have compute cores.

When you call mpirun there is a process manager which determine the node/rank attribution of your process. I suggest you to have a look at Controlling Process Placement with the Intel MPI library and for openmpi
check -npernode, -pernode options.
Use this Hello world test to check if this is what you want.
You can also just simply change the condition (rank==1) if you want to switch your process works.

Related

MPI_Scatter: order of scatter

I my work, I noticed that even if I scatter same amount of data to each process, it takes more time to transfer data from root to the highest-rank process. I tested this on distributed memory machine. If a MWE is needed I will prepare one but before that I would like to know if MPI_Scatter gives privilege to lower rank processes.
The MPI standard does not say such a thing, so MPI libraries are free to implement MPI_Scatter() the way they want regarding which task might return earlier than others.
Open MPI for example can either do a linear or a binomial scatter (by default, the algo is chosen based on communicator and message sizes).
That being said, all data has to be sent from the root process to the other nodes, so obviously, some nodes will be served first. If root process has rank zero, i would expect the highest rank process receive the data at last (i am not aware of any MPI library implementing a topology aware MPI_Scatter(), but that might come some day). If root process has not rank zero, then MPI might internally renumber the ranks (so root is always virtual rank zero), and if this pattern is implemented, the last process to receive the data would be (root + size - 1) % size.
If this is suboptimal from your application point of view, you always have the option to re-implement MPI_Scatter() your own way (that can call the library provided PMPI_Scatter() if needed). An other approach would be to MPI_Comm_split() (with a single color) in order to renumber the ranks, and use the new communicator for MPI_Scatter()

MPI rank process

I am an MPI beginner, so I 'd like to know exactly the definition of rank of an MPI program, and why we need it
For example, there are 2 lines of code here:
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
To understand this, you need to realise that MPI uses the SPMD (Single Program Multiple Data) model. This means that if you run this program in parallel, e.g. on 4 processes at the same time, every process runs its own independent copy of the same program. So, the basic question is: why doesn't every process do the same thing? To make use of parallel programming, you need processes to do different things. For example, you might want one process to act as a controller sending jobs to multiple workers. The rank is the fundamental identifier for each process. If run on 4 processes, then the above program would return ranks of 0, 1, 2 and 3 on the different processes. Once a process knows its rank it can then act appropriately, e.g. "if my rank is zero then call the controller function else call the worker function".

Is it possible to write with several processors in the same file, at the end of the file, in an ordonated way?

I have 2 processors (this is an example), and I want these 2 processors to write in a file. I want them to write at the end of file, but not in a mixed pattern, like that :
[file content]
proc0
proc1
proc0
proc1
proc0
proc1
(and so on..)
I'd like to make them write following this kind of pattern :
[file content]
proc0
proc0
proc0
proc1
proc1
proc1
(and so on..)
Is it possible? If so, what's the setting to use?
The sequence in which your processes have outputs ready to report is, essentially, unknowable in advance. Even repeated runs of exactly the same MPI program will show differences in the ordering of outputs. So something, somewhere, is going to have to impose an ordering on the writes to the file.
A very common pattern, the one Wesley has already mentioned, is to have all processes send their outputs to one process, often process 0, and let it deal with the writing to file. This master-writer could sort the outputs before writing but this creates a couple of problems: allocating space to store output before writing it and, more difficult to deal with, determining when a collection of output records can be sorted and written to file and the output buffers be reused. How long does the master-writer wait and how does it know if a process is still working ?
So it's common to have the master-writer write outputs as it gets them and for another program to order the output file as desired after the parallel program has finished. You could tack this on to your parallel program as a step after mpi_finalize or you could use a completely separate program (such as sort on a Linux machine). Of course, for this to work each output record has to contain some sequencing information on which to sort.
Another common pattern is to only have one process which does any writing at all, that is, none of the other processes do any output at all. This completely avoids the non-determinism of the sequencing of the writing.
Another pattern, less common partly because it is more difficult to implement and partly because it depends on underlying mechanisms which are not always available, is to use mpi io. With mpi io multiple processes can write to different parts of a file as if simultaneously. To actually write simultaneously the program needs to be executing on hardware, network and operating system which supports parallel i/o. It can be tricky to implement this even with the right platform, and especially when the volume of output from processes is uncertain.
In my experience here on SO people asking question such as yours are probably at too early a stage in their MPI experience to be tackling parallel i/o, even if they have access to the necessary hardware.
I disagree with High Performance Mark. MPI-IO isn't so tricky in 2014 (as long as you have have access to any file system besides NFS -- install PVFS if you need a cheap easy parallel file system).
If you know how much data each process has, you can use MPI_SCAN to efficiently compute how much data was written by "earlier" processes, then use MPI_FILE_WRITE_AT_ALL to carry out the I/O efficiently. Here's one way you might do this:
incr = (count*datatype_size);
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
MPI_File_write_at_all(mpi_fh, new_offset, buf, count,
datatype, status)
The answer to your question is no. If you do things that way, you'll end up with jumbled output from all over the place.
However, you can get the same thing by sending your output to a single processor having it do all of the writing itself. For example, at the end of your application, just have everything send to rank 0 and have rank 0 write it all to a file.

MPI and global variables

I have to implement an MPI program. There are some global variables (4 arrays of float numbers and other 6 single float variables) which are first inizialized by the main process reading data from a file. Then I call MPI_Init and, while process of rank 0 waits for results, the other processes (rank 1,2,3,4) work on the arrays etc...
The problem is that those array seem not to be initialized anymore, all is set to 0. I tried to move global variable inside the main function but the result is the same. When MPI_Init() is called all processes are created by fork right? So everyone has a memory copy of the father so why do they see not initizialized arrays?
I fear you have misunderstood.
It is probably best to think of each MPI process as an independent program, albeit one with the same source code as every other process in the computation. Operations that process 0 carries out on variables in its address space have no impact on the contents of the address spaces of other processes.
I'm not sure that the MPI standard even requires process 0 to have values for variables which were declared and initialised prior to the call to mpi_init, that is before process 0 really exists.
Whether it does or not you will have to write code to get the values into the variables in the address space of the other processes. One way to do this would be to have process 0 send the values to the other processes, either one by one or using a broadcast. Another way would be for all processes to read the values from the input files; if you choose this option watch out for contention over i/o resources.
In passing, I don't think it is common for MPI implementations to create processes by forking at the call to mpi_init, forking is more commonly used for creating threads. I think that most MPI implementations actually create the processes when you make a call to mpiexec, the call to mpi_init is the formality which announces that your program is starting its parallel computations.
When MPI_Init() is called all processes are created by fork right?
Wrong.
MPI spawns multiple instances of your program. These instances are separate processes, each with its own memory space. Each process has its own copy of every variable, including globals. MPI_Init() only initializes the MPI environment so that other MPI functions can be called.
As the other answers say, that's not how MPI works. Data is unique to each process and must be explicitly transferred between processes using the API available in the MPI specification.
However, there are programming models that allow this sort of behavior. If, when you say parallel computing, you mean multiple cores on one processor, you might be better served by using something like OpenMP to share your data between threads.
Alternatively, if you do in fact need to use multiple processors (either because your data is too big to fit in one processor's memory, or some other reason), you can take a look at one of the Parallel Global Address Space (PGAS) languages. In those models, you have memory that is globally available to all processes in an execution.
Last, there is a part of MPI that does allow you to expose memory from one process to other processes. It's the Remote Memory Access (RMA) or One-Sided chapter. It can be complex, but powerful if that's the kind of computing model you need.
All of these models will require changing the way your application works, but it sounds like they might map to your problem better.

Asynchronous MPI with SysV shared memory

We have a large Fortran/MPI code-base which makes use of system-V shared memory segments on a node. We run on fat nodes with 32 processors, but only 2 or 4 NICs, and relatively little memory per CPU; so the idea is that we set up a shared memory segment, on which each CPU performs its calculation (in its block of the SMP array). MPI is then used to handle inter-node communications, but only on the master in the SMP group. The procedure is double-buffered, and has worked nicely for us.
The problem came when we decided to switch to asynchronous comms, for a bit of latency hiding. Since only a couple of CPUs on the node communicate over MPI, but all of the CPUs see the received array (via shared memory), a CPU doesn't know when the communicating CPU has finished, unless we enact some kind of barrier, and then why do asynchronous comms?
The ideal, hypothetical solution would be to put the request tags in an SMP segment and run mpi_request_get_status on the CPU which needs to know. Of course, the request tag is only registered on the communicating CPU, so it doesn't work! Another proposed possibility was to branch a thread off on the communicating thread and use it to run mpi_request_get_status in a loop, with the flag argument in a shared memory segment, so all the other images can see. Unfortunately, that's not an option either, since we are constrained not to use threading libraries.
The only viable option we've come up with seems to work, but feels like a dirty hack. We put an impossible value in the upper-bound address of the receive buffer, that way once the mpi_irecv has completed, the value has changed and hence every CPU knows when it can safely use the buffer. Is that ok? It seems that it would only work reliably if the MPI implementation can be guaranteed to transfer data consecutively. That almost sounds convincing, since we've written this thing in Fortran and so our arrays are contiguous; I would imagine that the access would be also.
Any thoughts?
Thanks,
Joly
Here's a pseudo-code template of the kind of thing I'm doing. Haven't got the code as a reference at home, so I hope I haven't forgotten anything crucial, but I'll make sure when I'm back to the office...
pseudo(array_arg1(:,:), array_arg2(:,:)...)
integer, parameter : num_buffers=2
Complex64bit, smp : buffer(:,:,num_buffers)
integer : prev_node, next_node
integer : send_tag(num_buffers), recv_tag(num_buffers)
integer : current, next
integer : num_nodes
boolean : do_comms
boolean, smp : safe(num_buffers)
boolean, smp : calc_complete(num_cores_on_node,num_buffers)
allocate_arrays(...)
work_out_neighbours(prev_node,next_node)
am_i_a_slave(do_comms)
setup_ipc(buffer,...)
setup_ipc(safe,...)
setup_ipc(calc_complete,...)
current = 1
next = mod(current,num_buffers)+1
safe=true
calc_complete=false
work_out_num_nodes_in_ring(num_nodes)
do i=1,num_nodes
if(do_comms)
check_all_tags_and_set_safe_flags(send_tag, recv_tag, safe) # just in case anything else has finished.
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
safe(current)=true
else
wait_until_true(safe(current))
end if
calc_complete(my_rank,current)=false
calc_complete(my_rank,current)=calculate_stuff(array_arg1,array_arg2..., buffer(current), bounds_on_process)
if(not calc_complete(my_rank,current)) error("fail!")
if(do_comms)
check_all_tags_and_set_safe(send_tag, recv_tag, safe)
check_tags_and_wait_if_need_be(next, send_tag, recv_tag)
recv(prev_node, buffer(next), recv_tag(next))
safe(next)=false
wait_until_true(all(calc_complete(:,current)))
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
send(next_node, buffer(current), send_tag(current))
safe(current)=false
end if
work_out_new_bounds()
current=next
next=mod(next,num_buffers)+1
end do
end pseudo
So ideally, I would have liked to have run "check_all_tags_and_set_safe_flags" in a loop in another thread on the communicating process, or even better: do away with "safe flags" and make the handle to the sends / receives available on the slaves, then I could run: "check_tags_and_wait_if_need_be(current, send_tag, recv_tag)" (mpi_wait) before the calculation on the slaves instead of "wait_until_true(safe(current))".
"...unless we enact some kind of barrier, and then why do asynchronous comms?"
That sentence is a bit confused. The purpose of asynchrononous communications is to overlap communications and computations; that you can hopefully get some real work done while the communications is going on. But this means you now have two tasks occuring which eventually have to be synchronized, so there has to be something which blocks the tasks at the end of the first communications phase before they go onto the second computation phase (or whatever).
The question of what to do in this case to implement things nicely (it seems like what you've got now works but you're rightly concerned about the fragility of the result) depends on how you're doing the implementation. You use the word threads, but (a) you're using sysv shared memory segments, which you wouldn't need to do if you had threads, and (b) you're constrained not to be using threading libraries, so presumably you actually mean you're fork()ing processes after MPI_Init() or something?
I agree with Hristo that your best bet is almost certainly to use OpenMP for on-node distribution of computation, and would probably greatly simplify your code. It would help to know more about your constraint to not use threading libraries.
Another approach which would still avoid you having to "roll your own" process-based communication layer that you use in addition to MPI would be to have all the processes on the node be MPI processes, but create a few communicators - one to do the global communications, and one "local" communicator per node. Only a couple of processes per node would be a part of a communicator which actually does off-node communications, and the others do work on the shared memory segment. Then you could use MPI-based methods for synchronization (Wait, or Barrier) for the on-node synchronization. The upcoming MPI3 will actually have some explicit support for using local shared memory segments this way.
Finally, if you're absolutely bound and determined to keep doing things through what's essentially your own local-node-only IPC implementation --- since you're already using SysV shared memory segments, you might as well use SysV semaphores to do the synchronization. You're already using your own (somewhat delicate) semaphore-like mechanism to "flag" when the data is ready for computation; here you could use a more robust, already-written semaphore to let the non-MPI processes know when the data is ready for computation (and a similar mechanism to let the MPI process know when the others are done with the computation).

Resources