Does the following code, invokes all the 4 kernel in parallel and all the 4 events will wait for them to complete?
event1 = event2 = event3 = event4 = 0;
printf("sending enqueue task..\n");
clEnqueueTask(command_queue, calculate1, 0, NULL, &event1);
clEnqueueTask(command_queue, calculate2, 0, NULL, &event2);
clEnqueueTask(command_queue, calculate3, 0, NULL, &event3);
clEnqueueTask(command_queue, calculate4, 0, NULL, &event4);
printf("waiting after enquing task..\n");
clWaitForEvents(1, &event1);
clWaitForEvents(1, &event2);
clWaitForEvents(1, &event3);
clWaitForEvents(1, &event4);
Or, is it the right way to perform the task for invoking all the kernels parallel? Is it even possible? What device info i need to see for confirming the same?
These tasks might execute in parallel, if you are using an out-of-order command queue and the device has support for executing multiple kernels in parallel. Unfortunately, there isn't any device info query that you can perform to verify whether the device has this capability, so you'll have to check the start/end times of the resulting events if you want to check whether this actually happened. An alternative method of achieving parallel kernel execution is by using multiple command queues (as discussed in the comments). Note that this kind of coarse-grained task parallelism won't execute particularly efficiently on massively parallel architectures such as GPU.
Rather than waiting for each event individually, you could just call clFinish(command_queue) to wait for all commands to complete. You might also want to experiment with calling clFlush(command_queue) immediately after enqueuing all the tasks to ensure that they are all submitted to the device.
Related
As far as I know, MPI_BUFFER_ATTACH must be called by a process if it is going to do buffered communication. But does this include the standard MPI_SEND as well? We know that MPI_SEND may behave either as a synchronous send or as a buffered send.
You need to call MPI_Buffer_attach() only if you plan to perform (explicitly) buffered sends via MPI_Bsend().
If you only plan to MPI_Send() or MPI_Isend(), then you do not need to invoke MPI_Buffer_attach().
FWIW, buffered sends are error prone and I strongly encourage you not to use them.
MPI_Buffer_attach
Attaches a user-provided buffer for sending
Synopsis
int MPI_Buffer_attach(void *buffer, int size)
Input Parameters
buffer
initial buffer address (choice)
size
buffer size, in bytes (integer)
Notes
The size given should be the sum of the sizes of all outstanding
Bsends that you intend to have, plus MPI_BSEND_OVERHEAD for each Bsend
that you do. For the purposes of calculating size, you should use
MPI_Pack_size. In other words, in the code
MPI_Buffer_attach( buffer, size );
MPI_Bsend( ..., count=20, datatype=type1, ... );
...
MPI_Bsend( ..., count=40, datatype=type2, ... );
the value of size in the MPI_Buffer_attach call should be greater than the value computed by
MPI_Pack_size( 20, type1, comm, &s1 );
MPI_Pack_size( 40, type2, comm, &s2 );
size = s1 + s2 + 2 * MPI_BSEND_OVERHEAD;
The MPI_BSEND_OVERHEAD gives the maximum amount of space that may be used in the buffer for use by the BSEND routines in using the buffer. This value is in mpi.h (for C) and mpif.h (for Fortran).
Thread and Interrupt Safety
The user is responsible for ensuring that multiple threads do not try to update the same MPI object from different threads. This routine should not be used from within a signal handler.
The MPI standard defined a thread-safe interface but this does not mean that all routines may be called without any thread locks. For example, two threads must not attempt to change the contents of the same MPI_Info object concurrently. The user is responsible in this case for using some mechanism, such as thread locks, to ensure that only one thread at a time makes use of this routine. Because the buffer for buffered sends (e.g., MPI_Bsend) is shared by all threads in a process, the user is responsible for ensuring that only one thread at a time calls this routine or MPI_Buffer_detach.
Notes for Fortran
All MPI routines in Fortran (except for MPI_WTIME and MPI_WTICK) have an additional argument ierr at the end of the argument list. ierr is an integer and has the same meaning as the return value of the routine in C. In Fortran, MPI routines are subroutines, and are invoked with the call statement.
All MPI objects (e.g., MPI_Datatype, MPI_Comm) are of type INTEGER in Fortran.
Errors
All MPI routines (except MPI_Wtime and MPI_Wtick) return an error value; C routines as the value of the function and Fortran routines in the last argument. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. The error handler may be changed with MPI_Comm_set_errhandler (for communicators), MPI_File_set_errhandler (for files), and MPI_Win_set_errhandler (for RMA windows). The MPI-1 routine MPI_Errhandler_set may be used but its use is deprecated. The predefined error handler MPI_ERRORS_RETURN may be used to cause error values to be returned. Note that MPI does not guarentee that an MPI program can continue past an error; however, MPI implementations will attempt to continue whenever possible.
MPI_SUCCESS
No error; MPI routine completed successfully.
MPI_ERR_BUFFER
Invalid buffer pointer. Usually a null buffer where one is not valid.
MPI_ERR_INTERN
An internal error has been detected. This is fatal. Please send a bug report to mpi-bugs#mcs.anl.gov.
See Also MPI_Buffer_detach, MPI_Bsend
Refer Here For More
Buffer allocation and usage
Programming with MPI
MPI - Bsend usage
My application uses one-sided communications (MPI_Rget, MPI_Raccumulate) with synchronization primitives like MPI_Win_Lock and MPI_Win_Unlock for its passive target synchronization.
I profiled my application and found that most of time is being spent in MPI_Win_Unlock function (not MPI_Win_Lock), which I cannot understand why.
(1) Does anyone know why MPI_Win_Unlock function takes so much time? (Maybe it's implementation issue)
(2) Can this situation get better if I moves for S/C/P/W synchronization model?
I just need to be sure that all the one-sided operations are not concurrently overlapped.
I am using Intel's MPI Library ver 5.1 which implements MPI V3.
I appended some snippets of my codes (actually it's all :D)
Each MPI process runs 'Run()'
Run ()
// Join
For each Target_Proc i in MPI_COMM_WORLD
RequestDataFrom ( (i + k) % nprocs ); // requests k-step away neighbor's data asynchronously
ConsumeDataFrom (i);
JoinWithMyData (my_rank, i);
WriteBackDataTo (i);
goto the above 'For loop' again if the termination condition does not hold.
MPI_Barrier(MPI_COMM_WORLD);
// Update Data in Window
UpdateMyWindow (my_rank);
RequstDataFrom (target_rank_id)
MPI_Win_Lock (MPI_LOCK_SHARED, target_rank_id, win)
MPI_Rget (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
ConsumeDataFrom (target_rank_id)
MPI_Wait (&requests[target_rank_id])
GetPointerToBuffer (target_rank_id)
WriteBackDataTo (target_rank_id)
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
MPI_Rput (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
UpdateMyWindow ()
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
Update()
MPI_Win_Unlock (target_rank_id, win)
The function MPI_Win_unlock will block until all RMA operations of the access epoch have been completed.
As such it is no surprise that your profiler will show that this function takes the majority of time. It will block till the MPI implementation has completed all one-sided communication operations that were posted since the corresponding MPI_Win_lock.
Note that one-sided operations (Put, Get, etc) will merely dispatch the operation and not block till the operation is completed. As such these operations are effectively very similar to non-blocking communication functions (MPI_Isend/MPI_Irecv) without the MPI_Request object. To continue the analogy, MPI_Win_unlock waits on all operations to complete, similar to a MPI_Wait_all.
I have a loop within which I am launching multiple kernels onto a GPU. Below is the snippet:
for (int idx = start; idx <= end ;idx ++) {
ret = clEnqueueNDRangeKernel(command_queue, memset_kernel, 1, NULL,
&global_item_size_memset, &local_item_size, 0, NULL, NULL);
ASSERT_CL(ret, "Error after launching 1st memset_kernel !");
ret = clEnqueueNDRangeKernel(command_queue, cholesky_kernel, 1, NULL,
&global_item_size_cholesky, &local_item_size, 0, NULL, NULL);
ASSERT_CL(ret, "Error after launching 1st cholesky_kernel !");
ret = clEnqueueNDRangeKernel(command_queue, ckf_kernel1, 1, NULL,
&global_item_size_kernel1, &local_item_size, 0, NULL, NULL);
ASSERT_CL(ret, "Error after launching ckf_kernel1[i] !");
clFinish(command_queue);
ret = clEnqueueNDRangeKernel(command_queue, memset_kernel, 1, NULL,
&global_item_size_memset, &local_item_size, 0, NULL, NULL);
ASSERT_CL(ret, "Error after launching 2nd memset_kernel !");
ret = clEnqueueNDRangeKernel(command_queue, cholesky_kernel, 1, NULL,
&global_item_size_cholesky, &local_item_size, 0, NULL, NULL);
ASSERT_CL(ret, "Error after launching 2nd cholesky_kernel !");
ret = clSetKernelArg(ckf_kernel2, 4, sizeof(idx), (void *)&idx);
ret = clEnqueueNDRangeKernel(command_queue, ckf_kernel2, 1, NULL,
&global_item_size_kernel2, &local_item_size, 0, NULL, NULL);
ASSERT_CL(ret, "Error after launching ckf_kernel2 !");
Now, I am wanting to use this code for a system which has multiple GPUs. So I have completed the following steps:
created a single context for all the GPUs.
created one command queue per device.
created separate kernels for each device (code snippet below assuming two gpus)
allocated separate device buffers for each device
cl_kernel ckf_kernel1[2];
cl_kernel ckf_kernel2[2];
cl_kernel cholesky_kernel[2];
cl_kernel memset_kernel[2];
// read get kernel.
ckf_kernel1[0] = clCreateKernel(program, "ckf_kernel1", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel1[i]!");
ckf_kernel2[0] = clCreateKernel(program, "ckf_kernel2", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel2!");
memset_kernel[0] = clCreateKernel(program, "memset_zero", &ret);
ASSERT_CL(ret, "Cannot load memset_kernel!");
cholesky_kernel[0] = clCreateKernel(program, "cholesky_kernel", &ret);
ASSERT_CL(ret, "Cannot load cholesky_kernel!");
ckf_kernel1[1] = clCreateKernel(program, "ckf_kernel1", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel1[i]!");
ckf_kernel2[1] = clCreateKernel(program, "ckf_kernel2", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel2!");
memset_kernel[1] = clCreateKernel(program, "memset_zero", &ret);
ASSERT_CL(ret, "Cannot load memset_kernel!");
cholesky_kernel[1] = clCreateKernel(program, "cholesky_kernel", &ret);
ASSERT_CL(ret, "Cannot load cholesky_kernel!");
Now, I am not sure how to launch the kernels onto the different devices within the loop. How to get them to execute in parallel? Please note that there is a clFinish command within the loop above.
Another question: is it standard practice to use multiple threads/processes on the host where each thread/process is responsible for launching kernels on a single GPU?
You need not create separate contexts for all the devices. You only need to that if they are from different platforms.
You need not create separate kernels either. You can compile your kernels for multiple devices at the same time (clBuildProgram supports multi-device compilation), and if you launch a kernel on a device, the runtime will know that the kernel entity holds device binary valid for the given device or not.
Easiest thing is: create a context, fetch all devices that you need, place then in an array, and use that array for building your kernels, and create one command_queue for every device in them.
clEnqueueNDRange kernel is non-blocking. The only reason why your for loop doesn't dash through is because of the clFinish() statemen, and most likely because you are using in order queue, which means that the single device case would work fine without clFinish too.
The general idea for best usage of multi-GPU in OpenCL, is create context-kernels-queues the way I mentioned, and make the queues out-of-order. That way commands are allowed to execute in parallel, if they don't have unmet dependencies, for eg. the input of command2 is not the output of command1, then it is free to start executing in parallel to command1. If you are using this method however, you HAVE to use the final few parameters to clEnqueueNDRangeKernels, because you have to build this chain of dependencies using cl_events. Every clEnqueueWhatever can wait on an array of events, which originate from other commands. Execution of a command in the queue will only start once all it's dependencies are met.
There is one issue that you have not touched upon, and that is the idea of buffers. If you want to get multi-GPU running, you need to explicitly create buffers for your devices separately, and partition your data. It is not valid to have the same buffer set as argument on 2 devices, while both of them are trying to write it. At best, the runtime will serialize your work, and the 2 devices will not work in parallel. This is because buffers are handles to memory, and the runtime is responsible for moving the contents of the buffer to the devices that need it. (This can happen implicitly (lazy memory movement), or explicitly if you call clEnqueueMigrateBuffer.) The runtime is forbidden to give the same buffer with CL_MEM_READ_WRITE or CL_MEM_WRITE_ONLY flags to 2 devices simultaneously. Even though you know as the programmer, that the 2 devices might not be writing the same part of the buffer, the runtime does not. You have to tell it. Elegant way is to create 2 sub-buffers, that are part of the larger/original buffer; less elegant way is to simply create 2 buffers. The first approach is better, because it is easier to collect data from multiple devices back to host, because you need to fetch only the large buffer, and the runtime will know which sub-buffers have been modified on which devices, and it will take care of collecting the data.
If I saw your clSetKernelArgument calls, and the buffers you are using, I could see what the dependencies are to your kernels and write out what you need to do, but I think this is a fairly good start for you in getting multi-device running. Ultimately, it's all about the data. (And start using out-of-order queues, because it has the potential to be faster, and it forces you to start using events, which make it explicit to you and anyone reading the code, which kernels are allowed to run in parallel.
I am in a startup of OpenCl and still learning.
Kernel Code:
__kernel void gpu_kernel(__global float* data)
{
printf("workitem %d invoked\n", get_global_id(0));
int x = 0;
if (get_global_id(0) == 1) {
while (x < 1) {
x = 0;
}
}
printf("workitem %d completed\n", get_global_id(0));
}
C code for invoking kernel
size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 1; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Ouput:
workitem 3 invoked
workitem 3 completed
workitem 0 invoked
workitem 0 completed
workitem 1 invoked
workitem 2 invoked
workitem 2 completed
## Here code is waiting on terminal for Workitem #1 to finish, which will never end
this clearly states, all workitems are parallel (but in different workgroup).
Another C code for invoking kernel (for 1 workgroup with 4 workitems)
size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 4; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Ouput:
workitem 0 invoked
workitem 0 completed
workitem 1 invoked
## Here code is waiting on terminal for Workitem #1 to finish, which will never end
This clearly states that, this running in sequence (that's why it completed 1st Workitem and then got stuck on second and rest are never executed)
My Question:
I need to invoke 1 workgroup with 4 workitems which run parallel. So that i can use barrier in my code (which i guess is only possible within single workgroup)?
any help/suggestion/pointer will be appreciated.
Your second host code snippet correctly launches a single work-group that contains 4 work-items. You have no guarantees that these work-items will run in parallel, since the hardware might not have the resources to do so. However, they will run concurrently, which is exactly what you need in order to be able to use work-group synchronisation constructs such as barriers. See this Stack Overflow question for a concise description of the difference between parallelism and concurrency. Essentially, the work-items in a work-group will make forward progress independently of each other, even if they aren't actually executing in parallel.
OpenCL 1.2 Specification (Section 3.2: Execution Model)
The work-items in a given work-group execute concurrently on the processing elements of a single compute unit.
Based on your previous question on a similar topic, I assume you are using AMD's OpenCL implementation targeting the CPU. The way most OpenCL CPU implementations work is by serialising all work-items from a work-group into a single thread. This thread then executes each work-item in turn (ignoring vectorisation for the sake of argument), switching between them when they either finish or hit a barrier. This is how they achieve concurrent execution, and gives you all the guarantees you need in order to safely use barriers within your kernel. Parallel execution is achieved by having multiple work-groups (as in your first example), which will result in multiple threads executing on multiple cores (if available).
If you replaced your infinite loop with a barrier, you would clearly see that this does actually work.
I have a GPU with CC 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. How do I get to know that the kernels are executing concurrently?
One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone help me out..
To determine if your kernel executions overlap, you have to profile them. This requires several steps:
1. Creating the command-queues
Profiling data is only collected if the command-queue is created with the property CL_QUEUE_PROFILING_ENABLE:
cl_command_queue queues[10];
for (int i = 0; i < 10; ++i) {
queues[i] = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE,
&errcode);
}
2. Making sure all kernels start at the same time
You are right in your assumption that the CPU queues the kernels sequentially. However, you can create a single user event and add it to the wait list for all kernels. This causes the kernels not to start running before the user event is completed:
// Create the user event
cl_event user_event = clCreateUserEvent(context, &errcode);
// Reserve space for kernel events
cl_event kernel_events[10];
// Enqueue kernels
for (int i = 0; i < 10; ++i) {
clEnqueueNDRangeKernel(queues[i], kernel, work_dim, global_work_offset,
global_work_size, 1, &user_event, &kernel_events[i]);
}
// Start all kernels by completing the user event
clSetUserEventStatus(user_event, CL_COMPLETE);
3. Obtain profiling times
Finally, we can collect the timing information for the kernel events:
// Block until all kernels have run to completion
clWaitForEvents(10, kernel_events);
for (int i = 0; i < 10; ++i) {
cl_ulong start;
clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_START,
sizeof(start), &start, NULL);
cl_ulong end;
clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_END,
sizeof(end), &end, NULL);
printf("Event %d: start=%llu, end=%llu", i, start, end);
}
4. Analyzing the output
Now that you have the start and end times of all kernel runs, you can check for overlaps (either by hand or programmatically). The output units are nanoseconds. Note however that the device timer is only accurate to a certain resolution. You can query the resolution using:
size_t resolution;
clGetDeviceInfo(device, CL_DEVICE_PROFILING_TIMER_RESOLUTION,
sizeof(resolution), &resolution, NULL);
FWIW, I tried this on a NVIDIA device with CC 2.0 (which should support concurrent kernels) and observed that the kernels were run sequentially.
You can avoid all the boilerplate code suggested in the other answers (which are correct by the way) by using C Framework for OpenCL, which simplifies this task a lot, and gives you detailed information about OpenCL events (kernel execution, data transfers, etc), including a table and a plot dedicated to overlapped execution of said events.
I developed this library in order to, among other things, simplify the process described in the other answers. You can see a basic usage example here.
Yes, as you suggest, try to use the events, and analyze all the QUEUED, SUBMIT, START, END values. These should be absolute values in "device time", and you may be able to see if processing (START to END) overlaps for the different kernels.