I have the code snippet below in C++ which basically calculates the pi using classic monte carlo technic.
srand48((unsigned)time(0) + my_rank);
for(int i = 0 ; i < part_points; i++)
{
double x = drand48();
double y = drand48();
if( (pow(x,2)+pow(y,2)) < 1){ ++count; }
}
MPI_Reduce(&count, &total_hits, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
if(my_rank == root)
{
pi = 4*(total_hits/(double)total_points);
cout << "Calculated pi: " << pi << " in " << end_time-start_time << endl;
}
I am just wondering if the MPI_Barrier call is necessary. Does MPI_Reduce make sure that the body of the if statement won't be executed before the reduce operation is completely finished ? Hope I was clear. Thanks
Yes, all collective communication calls (Reduce, Scatter, Gather, etc) are blocking. There's no need for the barrier.
Blocking yes, a barrier, no. It is very important to call MPI_Barrier() for MPI_Reduce() when executing in a tight loop. If not calling MPI_Barrier() the receive buffers of the reducing process will eventually run full and the application will abort. While other participating processes only need to send and continue, the reducing process has to receive and reduce.
The above code does not need the barrier if my_rank == root == 0 (what probably is true). Anyways... MPI_Reduce() does not perform a barrier or any form of synchronization. AFAIK even MPI_Allreduce() isn't guaranteed to synchronize (at least not by the MPI standard).
Ask your self if that barrier is needed. Suppose you are not the root; you call Reduce, which sends off your data. Is there any reason to sit and wait until the root has the result? Answer: no, so you don't need the barrier.
Suppose you're the root. You issue the reduce call. Semantically you are now forced to sit and wait until the result is fully assembled. So why the barrier? Again, no barrier call is needed.
In general, you almost never need a barrier because you don't care about temporal synchronization. The semantics guarantee that your local state is correct after the reduce call.
Related
According to the specs:
If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
In my understanding this means, that in any Kernel:
if(0 == (get_local_id(0)%2)){
//block a
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//block a part 2
}else{
//block b
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//block b part 2
}
//common operations
Should one worker reach //block a, every other worker needs to reach it too.
By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.
Is this understanding correct?
What would be a good synchronisation strategy for a situation like this ?
where every other worker would need to do a different logic, but by //block a part 2 and //block b part 2 the workers inside one working group be synced up.
in the requested usecase there are more phases, than 2, and I'd like to keep every phase to be synchronised.
Would a logic like this be an acceptable solution?
__local number_finished = 0;
if(0 == (get_local_id(0)%2)){
//block a
atomic_add(&number_finished, 1);
while(number_finished < get_local_size(0));
//block a part 2
}else{
//block b
atomic_add(&number_finished, 1);
while(number_finished < get_local_size(0));
//block b part 2
}
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//common operations
By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.
Is this understanding correct?
It is not possible. When using a work_group_barrier, you must ensure all work-items in the work group reach it. If there are paths within your code that don't reach the barrier, it may lead to undefined behavior.
What would be a good synchronization strategy for a situation like this ?
Usually, barriers are used outside of any conditional sections / loop:
if (cond)
{
// all work items perform work that needs to be synchronized later
// such as local memory writes
}
barrier(CLK_LOCAL_MEM_FENCE); // note the CLK_LOCAL_MEM_FENCE flag
// now every work item can read the data the other work items wrote before the barrier.
for (...)
{
}
barrier(CLK_LOCAL_MEM_FENCE);
Would a logic like this be an acceptable solution?
It might work, but a barrier outside the conditional section would be more efficient than a busy wait.
if(0 == (get_local_id(0)%2)){
//block a
}else{
// block b
}
barrier(CLK_LOCAL_MEM_FENCE); // ensures number_finished == get_local_size(0)
The MPI_Irecv and MPI_Isend operations return an MPI_Request that can be later marked as cancelled with MPI_Cancel. Is there a similar mechanism for blocking MPI_Probe and MPI_Mprobe ?
The context of the question is the latest implementation of Boost.MPI request handlers using Probe.
EDIT - Here is an example of how an hypothetical MPI_Probecancel could be used:
#include <mpi.h>
#include <chrono>
#include <future>
using namespace std::literals::chrono_literals;
// Executed in a thread
void async_cancel(MPI_Probe *probe)
{
std::this_thread::sleep_for(1s);
int res = MPI_Probecancel(probe);
}
int main(int argc, char* argv[])
{
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (rank == 0)
{
// A handle to the probe (similar to a request)
MPI_Probe probe;
// Start a thread
// `probe` will be filled with the next call, pretty ugly
// Ideally, this should be done in two steps like MPI_Irecv, MPI_Wait
auto res = std::async(std::launch::async, &async_cancel, &probe);
MPI_Message message;
MPI_Status status;
MPI_MProbe(1, 123, MPI_COMM_WORLD, &message, &status, &probe);
if (!probe.cancelled)
{
int buffer;
MPI_Mrecv(&buffer, 1, MPI_INT, &message, &status);
}
}
else
std::this_thread::sleep_for(2s);
MPI_Finalize();
return 0;
}
First, the premise / nomenclature of your question is wrong. It is the nonblocking calls. MPI_Irecv and MPI_Isend which return a request object that you may cancel. For these calls, you cancel the local operation.
MPI_Probe and MPI_Mprobe are in fact blocking. You cannot possibly cancel these operations in the sense that control flow will only leave when a message is available.
On the other hand, MPI_Iprobe and MPI_Improbe are nonblocking, meaning they always complete immediately, telling you whether a message is available.
For neither kind of probe call, there is any kind of local state left after the completion. So there is nothing that could be cancelled locally after the functions return.
That said, if a probe tells you that a message is available, you should definitely receive it. Otherwise a send operation may bock and you would leak resources on all side. But that's just a receive operation.
Edit: Regarding your idea to cancel a ongoing local MPI_Probe in a concurrent thread: This is not directly supported.
Theoretically, you could emulate this on a conforming implementation with MPI_THREAD_MULTIPE by running the probe on MPI_ANY_SOURCE and send a message to the same rank from the other thread. That, of course, has the consequence that you change must probe on message from any incoming rank.
Realistically, if you have to do this, you would probably just use a loop like while(!cancelled) MPI_Iprobe();.
That said, I would again question the scenario: How would another thread on your rank suddenly know to cancel a local MPI_Probe operation? It would probably have to be based on information received from a remote rank - in which case that would be covered by actually being able to receive information from it, i.e. the actual Probe would complete.
Maybe for some high-level abstraction it makes some sense to offer a local cancel, but in an actual practical situation I would believe you could design a idiomatic pattern without needing this.
I have some MPI processes which should write to the same file after they finish their task. The problem is that the length of the results is variable and I cannot assume that each process will write at a certain offset.
A possible approach would be to open the file in every process, to write the output at the end and then to close the file. But this way a race condition could occur.
How can I open and write to that file so that the result would be the expected one?
You might think you want the shared file or ordered mode routines. But these routines get little use and so are not well optimized (so they get little use... quite the cycle...)
I hope you intend on doing this collectively. then you can use MPI_SCAN to collect the offsets, then call MPI_FILE_WRITE_AT_ALL to have the MPI library optimize the I/O for you.
(If you are doing this independently, then you will have to do something like... master slave? passing a token? fall back to the shared file pointer routines even though I hate them?)
Here's an approach for a good collective method:
incr = (count*datatype_size);
/* you can skip this call and assume 'offset' is zero if you don't care
about the contents of the file */
MPI_File_get_position(mpi_fh, &offset);
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr;
new_offset += offset;
ret = MPI_File_write_at_all(mpi_fh, new_offset, buf, count,
datatype, status);
I'm trying to implement MPI_Bcast, and I'm planning to do that by MPI_Send and MPI_Recv but seems I can not send message to myself?
The code is as follow
void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
int comm_rank, comm_size, i;
MPI_Comm_rank(comm, &comm_rank);
MPI_Comm_size(comm, &comm_size);
if(comm_rank==root){
for(i = 0; i < comm_size; i++){
MPI_Send(buffer, count, datatype, i, 0, comm);
}
}
MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}
Any suggestion on that? Or I should never send message to oneself and just do a memory copy?
Your program is erroneous on multiple levels. First of all, there is an error in the conditional:
if(comm_rank=root){
This does not compare comm_rank to root but rather assigns root to comm_rank and the loop would then only execute if root is non-zero and besides it would be executed by all ranks.
Second, the root process does not need to send data to itself since the data is already there. Even if you'd like to send and receive anyway, you should notice that both MPI_Send and MPI_Recv peruse the same buffer space, which is not correct. Some MPI implementations use direct memory copy for self-interaction, i.e. the library might use memcpy() to transfer the message. Using memcpy() with overlapping buffers (including using the same buffer) leads to an undefined behaviour.
The proper way to implement linear broadcast is:
void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
{
int comm_rank, comm_size, i;
MPI_Comm_rank(comm, &comm_rank);
MPI_Comm_size(comm, &comm_size);
if (comm_rank == root)
{
for (i = 0; i < comm_size; i++)
{
if (i != comm_rank)
MPI_Send(buffer, count, datatype, i, 0, comm);
}
}
else
MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}
The usual ways for a process to talk to itself without deadlocking is:
using a combination of MPI_Isend and MPI_Recv or a combination of MPI_Send and MPI_Irecv;
using buffered send MPI_Bsend;
using MPI_Sendrecv or MPI_Sendrecv_replace.
The combination of MPI_Irecv and MPI_Send works well in cases where multiple sends are done in a loop like yours. For example:
MPI_Request req;
// Start a non-blocking receive
MPI_Irecv(buff2, count, datatype, root, 0, comm, &req);
// Send to everyone
for (i = 0; i < comm_size; i++)
MPI_Send(buff1, count, datatype, i, 0, comm);
// Complete the non-blocking receive
MPI_Wait(&req, MPI_STATUS_IGNORE);
Note the use of separate buffers for send and receive. Probably the only point-to-point MPI communication call that allows the same buffer to be used both for sending and receiving is MPI_Sendrecv_replace as well as the in-place modes of the collective MPI calls. But these are implemented internally in such way that at no time the same memory area is used both for sending and receiving.
This is an incorrect program. You cannot rely on doing a blocking MPI_Send to yourself...because it may block. MPI does not guarantee that your MPI_Send returns until the buffer is available again. In some cases this could mean it will block until the message has been received by the destination. In your program, the destination may never call MPI_Recv, because it is still trying to send.
Now in your My_MPI_Bcast example, the root process already has the data. Why need to send or copy it at all?
The MPI_Send / MPI_Recv block on the root node can be a deadlock.
Converting to MPI_Isend could be used to resolve the issue. However, there may be issues because the send buffer is being reused and root is VERY likely to reach the MPI_Recv "early" and then may alter that buffer before it is transmitted to other ranks. This is especially likely on large jobs. Also, if this routine is ever called from fortran there could be issues with the buffer being corrupted on each MPI_Send call.
The use of MPI_Sendrecv could be used only for the root process. That would allow the MPI_Send's to all non-root ranks to "complete" (e.g. the send buffer can be safely altered) before the root process enters a dedicated MPI_Sendrecv. The for loop would simply begin with "1" instead of "0", and the MPI_Sendrecv call added to the bottom of that loop. (Why is a better questions, since the data is in "buffer" and is going to "buffer".)
However, all this begs the question, why are you doing this at all? If this is a simple "academic exercise" in writing a collective with point to point calls, so be it. BUT, your approach is naive at best. This overall strategy would be beaten by any of the MPI_Bcast algorithms in any reasonably implemented mpi.
I think you should put MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE); only for rank=root otherwise it will probably hang
I have a GPU with CC 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. How do I get to know that the kernels are executing concurrently?
One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone help me out..
To determine if your kernel executions overlap, you have to profile them. This requires several steps:
1. Creating the command-queues
Profiling data is only collected if the command-queue is created with the property CL_QUEUE_PROFILING_ENABLE:
cl_command_queue queues[10];
for (int i = 0; i < 10; ++i) {
queues[i] = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE,
&errcode);
}
2. Making sure all kernels start at the same time
You are right in your assumption that the CPU queues the kernels sequentially. However, you can create a single user event and add it to the wait list for all kernels. This causes the kernels not to start running before the user event is completed:
// Create the user event
cl_event user_event = clCreateUserEvent(context, &errcode);
// Reserve space for kernel events
cl_event kernel_events[10];
// Enqueue kernels
for (int i = 0; i < 10; ++i) {
clEnqueueNDRangeKernel(queues[i], kernel, work_dim, global_work_offset,
global_work_size, 1, &user_event, &kernel_events[i]);
}
// Start all kernels by completing the user event
clSetUserEventStatus(user_event, CL_COMPLETE);
3. Obtain profiling times
Finally, we can collect the timing information for the kernel events:
// Block until all kernels have run to completion
clWaitForEvents(10, kernel_events);
for (int i = 0; i < 10; ++i) {
cl_ulong start;
clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_START,
sizeof(start), &start, NULL);
cl_ulong end;
clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_END,
sizeof(end), &end, NULL);
printf("Event %d: start=%llu, end=%llu", i, start, end);
}
4. Analyzing the output
Now that you have the start and end times of all kernel runs, you can check for overlaps (either by hand or programmatically). The output units are nanoseconds. Note however that the device timer is only accurate to a certain resolution. You can query the resolution using:
size_t resolution;
clGetDeviceInfo(device, CL_DEVICE_PROFILING_TIMER_RESOLUTION,
sizeof(resolution), &resolution, NULL);
FWIW, I tried this on a NVIDIA device with CC 2.0 (which should support concurrent kernels) and observed that the kernels were run sequentially.
You can avoid all the boilerplate code suggested in the other answers (which are correct by the way) by using C Framework for OpenCL, which simplifies this task a lot, and gives you detailed information about OpenCL events (kernel execution, data transfers, etc), including a table and a plot dedicated to overlapped execution of said events.
I developed this library in order to, among other things, simplify the process described in the other answers. You can see a basic usage example here.
Yes, as you suggest, try to use the events, and analyze all the QUEUED, SUBMIT, START, END values. These should be absolute values in "device time", and you may be able to see if processing (START to END) overlaps for the different kernels.