if(mpi_rank.eq.master)then
call mpi_bcast(a, 1, mpi_integer, mpi_rank, mpi_comm_world, ierr)
endif
The above will not work since in fact not all processors are calling mpi_bcast. We can only use send receive in this case?
In this case, only the master processor knows he is the master advisor. Others do not know.
I have tried many approaches. This seems difficult for me. Only the master knows he is the master processor and he is gonna share some value with all the other processors.
Related
I'd like the MPI function MPI_Sendrecv() to run on the GPU. Normally I use something like:
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, N, MPI_DOUBLE, proc[0], 0,
recv_buf, N, MPI_DOUBLE, proc[0], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
And it works fine. However now, I call MPI_Sendrecv() inside a loop. If I try to accelerate this loop (with #pragma acc parallel loop) or even accelerate the whole routine (#pragma acc routine) where the loop and the MPI call are situated, I get an error:
64, Accelerator restriction: loop contains unsupported statement type
78, Accelerator restriction: unsupported statement type: opcode=ACCHOSTDATA
How can I make run the call on the device if, like in this case, the call is in an accelerated region?
An alternative could be maybe to do not accelerate the routine and the loop, and use #pragma acc host_data use_device(send_buf, recv_buf) alone, but the goal of having everything on the gpu would fail.
EDIT
I removed the #pragma. Anyway, the application runs hundreds of time slower and I cannot figure why.
I'm using nsight-sys to check: Do you have and idea why MPI_Sendrecv is slowing down the app? Now all the routine where it's called is running on the host. If I move the mouse pointer on the NVTX (MPI) section, it prints "ranges on this row have been projected from the CPU on the GPU". What does this mean?
Sorry if this is not clear but I lack of practicality with nsight and I don't know how to analyze the results properly. If you need more details I'm happy to give them to you.
However it seemes weird to me that the MPI calls appear in the GPU section.
You can't make MPI calls from within device code.
Also, the "host_data" is saying to use a device pointer within host code so can't be used within device code. Device pointers are used by default in device code, hence no need for the "host_data" construct.
Questions after edit:
Do you have and idea why MPI_Sendrecv is slowing down the app?
Sorry, no idea. I don't know what you're comparing to or anything about your app so hard for me to tell. Though Sendrecv is a blocking call so putting in in a loop will cause all the sends and receives to wait on the previous ones before proceeding. Are you able to rewrite the code to use ISend and IRecv instead?
"ranges on this row have been projected from the CPU on the GPU". What
does this mean?
I haven't seen this before, but presume it just means that even though these are host calls, the NVTX instrumentation is able to project them onto the GPU timeline. Most likely so the CUDA Aware MPI device to device data transfers will be correlated to the MPI region.
I know that MPI_SENDRECV allow to overcome the problem of deadlocks (when we use the classic MPI_SEND and MPI_RECV functions).
I would like to know if MPI_SENDRECV(sent_to_process_1, receive_from_process_0) is equivalent to:
MPI_ISEND(sent_to_process_1, request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request1)
MPI_WAIT(request2)
with asynchronous MPI_ISEND and MPI_RECV functions?
From I have seen, MPI_ISEND and MPI_RECV creates a fork (i.e. 2 processes). So if I follow this logic, the first call of MPI_ISEND generates 2 processes. One does the communication and the other calls MPI_RECV which forks itself 2 processes.
But once the communication of first MPI_ISEND is finished, does the second process call MPI_IRECV again? With this logic, the above equivalent doesn't seem to be valid...
Maybe I should change to this:
MPI_ISEND(sent_to_process_1, request1)
MPI_WAIT(request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request2)
But I think that it could be create also deadlocks.
Anyone could give to me another solution using MPI_ISEND, MPI_IRECV and MPI_WAIT to get the same behaviour of MPI_SEND_RECV?
There's some dangerous lines of thought in the question and other answers. When you start a non-blocking MPI operation, the MPI library doesn't create a new process/thread/etc. You're thinking of something more like a parallel region of OpenMP I believe, where new threads/tasks are created to do some work.
In MPI, starting a non-blocking operation is like telling the MPI library that you have some things that you'd like to get done whenever MPI gets a chance to do them. There are lots of equally valid options for when they are actually completed:
It could be that they all get done later when you call a blocking completion function (like MPI_WAIT or MPI_WAITALL). These functions guarantee that when the blocking completion call is done, all of the requests that you passed in as arguments are finished (in your case, the MPI_ISEND and the MPI_IRECV). Regardless of when the operations actually take place (see next few bullets), you as an application can't consider them done until they are actually marked as completed by a function like MPI_WAIT or MPI_TEST.
The operations could get done "in the background" during another MPI operation. For instance, if you do something like the code below:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Waitall(2, req);
The MPI_ISEND and the MPI_IRECV would probably actually do the data transfers in the background during the MPI_BARRIER. This is because as an application, you are transferring "control" of your application to the MPI library during the MPI_BARRIER call. This lets the library make progress on any ongoing MPI operation that it wants. Most likely, when the MPI_BARRIER is complete, so are most other things that finished first.
Some MPI libraries allow you to specify that you want a "progress thread". This tells the MPI library to start up another thread (not that thread != process) in the background that will actually do the MPI operations for you while your application continues in the main thread.
Remember that all of these in the end require that you actually call MPI_WAIT or MPI_TEST or some other function like it to ensure that your operation is actually complete, but none of these spawn new threads or processes to do the work for you when you call your nonblocking functions. Those really just act like you stick them on a list of things to do (which in reality, is how most MPI libraries implement them).
The best way to think of how MPI_SENDRECV is implemented is to do two non-blocking calls with one completion function:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Waitall(2, req);
How I usually do this on node i communicating with node i+1:
mpi_isend(send_to_process_iPlus1, requests(1))
mpi_irecv(recv_from_process_iPlus1, requests(2))
...
mpi_waitall(2, requests)
You can see how ordering your commands this way with non-blocking communication allows you (during the ... above) to perform any computation that does not rely on the send/recv buffers to be done during your communication. Overlapping computation with communication is often crucial for maximizing performance.
mpi_send_recv on the other hand (while avoiding any deadlock issues) is still a blocking operation. Thus, your program must remain in that routine during the entire send/recv process.
Final points: you can initialize more than 2 requests and wait on all of them the same way using the above structure as dealing with 2 requests. For instance, it's quite easy to start communication with node i-1 as well and wait on all 4 of the requests. Using mpi_send_recv you must always have a paired send and receive; what if you only want to send?
My pseudo code has something like this:
//Some work
MPI_Send(....)
MPI_Irecv(&yesno, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
//Do some WORK1 (background work) till MPI_Irecv completes and when receive completes do WORK 2.
--WORK1
--WORK2
Where do I put WORK1 and WORK2 and how to redirect control to WORK2 just when receive completes?
MORE INFO:
I am trying to implement a master-slave program where the master issues work to slaves. The master continues to work when it has issued work to say slave 1 and needs to be interrupted by slave 1 when done. When this interrupt happens some code needs to be executed by the master. Then the master needs to continue from where it left off before interrupt.
If WORK1 is a loop, then use MPI_Test() at the start/end of each iteration to see if the request posted by IRecv has finished. If it has, then do WORK2, post a new IRecv, and continue with the next iteration of WORK1.
If WORK1 is a sequential task, then you still have two options. You could simply pepper it with MPI_Test() calls as described above, but that's not very elegant. Instead, you might want to spawn a new thread where you post your initial IRecv, and then simply use MPI_Wait() to wait for it to complete while the original thread is doing WORK1. If your MPI implementation is halfay decent, MPI_Wait() should only block the thread that calls it, so WORK1 will proceed just fine.
I am new to MPI and I often see the following codes in MPI code:
if (rank == 0) {
MPI_Send(buf, len, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else {
MPI_Recv(buf, len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
}
It seems that the rank determines which process is sending and which process is receiving. But
how is the rank of a process determined by calling MPI_Comm_rank(MPI_COMM_WORLD, &rank);?
Is it related to the command line arguments for mpirun ?
For example:
mpirun -n 2 -host localhost1,localhost2 ./a.out
(localhost1 is rank 0 and localhost2 is rank 1?)
How is the program going to determine who has rank 0 and who has rank 1?
Is there a way for me to specify something such that say localhost1 is sending and localhost2 is receiving?
Usually, if you're trying to think about communication in your MPI program based on physical processors/machines, you're not going about it in the right way. Most of the time, it doesn't matter which actual machine each rank is mapped to. All that matters is that you call mpiexec or mpirun (they're usually the same thing), something inside your MPI implementation starts up n processes which could be located locally, remotely, or some combination of the two, and assigns them ranks. Theoretically those ranks could be assigned arbitrarily, though it's usually in some predictable way (often something like round-robin over the entire group of hosts that are available). Inside your program, it usually makes very little difference whether you're running rank 0 on host0 or host1. The important thing is that you are doing specific work on rank 0, that requires communication from rank 1.
That being said, there are more rare times where it might be important which rank is mapped to which processor. Examples might be:
If you have GPUs on some nodes and not others and you need certain ranks to be able to control a GPU.
You need certain processes to be mapped to the same physical node to optimize communication patterns for things like shared memory.
You have data staged on certain hosts that needs to map to specific ranks.
These are all advanced examples. Usually if you're in one of these situations, you've been using MPI long enough to know what you need to do here, so I'm betting that you're probably not in this scenario.
Just remember, it doesn't really matter where my ranks are. It just matters that I have the right number of them.
Disclaimer: All of that being said, it does matter that you launch the correct number of processes. What I mean by that is, if you have 2 hosts that each have a single quad-core processor, it doesn't make sense to start a job with 16 ranks. You'll end up spending all of your computational time context switching your processes in and out. Try not to have more ranks than you have compute cores.
When you call mpirun there is a process manager which determine the node/rank attribution of your process. I suggest you to have a look at Controlling Process Placement with the Intel MPI library and for openmpi
check -npernode, -pernode options.
Use this Hello world test to check if this is what you want.
You can also just simply change the condition (rank==1) if you want to switch your process works.
I am interested in implementing a sort of event driven dispatch queue using MPI (message passing interface). The basic problem I want to solve is this: I have a master process which inserts jobs into a global queue, and each available slave process retrieves the next job in the queue if there is one.
From what I've read about MPI, it seems like sending and receiving processes must be in agreement about when they send and receive. As in, suppose one process sends a message but the other process does not know it needs to receive, or vice versa, then everything deadlocks. Is there any way to make every process a bit more independent?
You can do that as follows:
Declare one master-node (0), that is going to distribute the tasks. In this node the pseudocode is:
int sendTo
for task in tasks:
MPI_Recv(...sendTo, MPI_INT, MPI_ANY_SOURCE,...)
MPI_Send(job,... receiver: sendTo)
for node in nodes:
MPI_Recv(...sendTo, MPI_INT, MPI_ANY_SOURCE,...)
MPI_SEND(job_null,...,receiver: sendTo)
In the slave nodes the code would be:
while (true)
MPI_Send(myNodenum to 0, MPI_INT)
MPI_Recv(job from 0)
if (job == job_null)
break
else
execute job
I think this should work.
You might want to use charm++; it's not explicitly an event driven framework, but does provide an abstraction mechanism for performing tasks and distributing those tasks dynamically.