clEnqueueNDRangeKernel function - opencl

Please correct me if I am wrong in understanding.
err= clEnqueueNDRangeKernel(command_queue,kernel,1,NULL,global,wg,0,NULL,&gpuExec);
Is CPU code (written between these two function) execute at same time on CPU, when kernel executing on GPU ....means they execute simultaneously
?
err=clEnqueueReadBuffer(command_queue,output,CL_TRUE,0,sizeof(cl_int)*100,results,0,NULL,NULL);
2) this function
err= clEnqueueNDRangeKernel(command_queue,kernel,1,NULL,global,wg,0,NULL,&gpuExec);
return immediately and CPU can do another work after this function call(means after that kernel starts executing on GPU on differents core) and at the same time when kernel is executing ,CPU can do its another work ?

Both of your questions can be answered with yes. Kernels are executed asynchronously on the GPU, therefore you can do other work on the CPU in the mean time. You could wait explicitly on your gpuExec event using clWaitForEvents(1, &gpuExec) though.

Related

Issue with #pragma acc host_data use_device

I'd like the MPI function MPI_Sendrecv() to run on the GPU. Normally I use something like:
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, N, MPI_DOUBLE, proc[0], 0,
recv_buf, N, MPI_DOUBLE, proc[0], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
And it works fine. However now, I call MPI_Sendrecv() inside a loop. If I try to accelerate this loop (with #pragma acc parallel loop) or even accelerate the whole routine (#pragma acc routine) where the loop and the MPI call are situated, I get an error:
64, Accelerator restriction: loop contains unsupported statement type
78, Accelerator restriction: unsupported statement type: opcode=ACCHOSTDATA
How can I make run the call on the device if, like in this case, the call is in an accelerated region?
An alternative could be maybe to do not accelerate the routine and the loop, and use #pragma acc host_data use_device(send_buf, recv_buf) alone, but the goal of having everything on the gpu would fail.
EDIT
I removed the #pragma. Anyway, the application runs hundreds of time slower and I cannot figure why.
I'm using nsight-sys to check: Do you have and idea why MPI_Sendrecv is slowing down the app? Now all the routine where it's called is running on the host. If I move the mouse pointer on the NVTX (MPI) section, it prints "ranges on this row have been projected from the CPU on the GPU". What does this mean?
Sorry if this is not clear but I lack of practicality with nsight and I don't know how to analyze the results properly. If you need more details I'm happy to give them to you.
However it seemes weird to me that the MPI calls appear in the GPU section.
You can't make MPI calls from within device code.
Also, the "host_data" is saying to use a device pointer within host code so can't be used within device code. Device pointers are used by default in device code, hence no need for the "host_data" construct.
Questions after edit:
Do you have and idea why MPI_Sendrecv is slowing down the app?
Sorry, no idea. I don't know what you're comparing to or anything about your app so hard for me to tell. Though Sendrecv is a blocking call so putting in in a loop will cause all the sends and receives to wait on the previous ones before proceeding. Are you able to rewrite the code to use ISend and IRecv instead?
"ranges on this row have been projected from the CPU on the GPU". What
does this mean?
I haven't seen this before, but presume it just means that even though these are host calls, the NVTX instrumentation is able to project them onto the GPU timeline. Most likely so the CUDA Aware MPI device to device data transfers will be correlated to the MPI region.

Is clWaitForEvents required for an in-order queue?

I've created an in-order OpenCL queue. My pipeline enqueues multiple kernels into the queue.
queue = clCreateCommandQueue(cl.context, cl.device, 0, &cl.error);
for(i=0 ;i < num_kernels; i++){
clEnqueueNDRangeKernel(queue, kernels[i], dims, NULL, global_work_group_size, local_work_group_size, 0, NULL, &event);
}
The output of kernels[0] is intput to kernels[1]. Output of kernels[1] is input to kernels[2] and so on.
Since my command queue is an in-order queue, my assumption is kernels[1] will start only after kernels[0] is completed.
Is my assumption valid?
Should I use clWaitForEvents to make sure the previous kernel is completed before enqueuing the next kernel?
Is there any way I can stack multiple kernels into the queue & just pass the input to kernels[0] & directly get the output from the last kernel? (without having to enqueue every kernel one by one)
Your assumption is valid. You do not need to wait for events in an in-order queue. Take a look at the OpenCL doc:
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
command-queue is not set, the commands enqueued to a command-queue
execute in order. For example, if an application calls
clEnqueueNDRangeKernel to execute kernel A followed by a
clEnqueueNDRangeKernel to execute kernel B, the application can assume
that kernel A finishes first and then kernel B is executed. If the
memory objects output by kernel A are inputs to kernel B then kernel B
will see the correct data in memory objects produced by execution of
kernel A. If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
commandqueue is set, then there is no guarantee that kernel A will
finish before kernel B starts execution.
As to the other question: yes, you'll need to enqueue every kernel that you want to run explicitly. Consider it a good thing, as there is no magic happening.
Of course you can always write your own helpers in C/C++ (or whatever host language you are using) that simplify this, and potentially hide the cumbersome kernel calls. Or use some GPGPU abstraction library to do the same.

OpenCL enqueued kernels using lots of host memory

I am executing monte carlo sweeps on a population of replicas of my system using OpenCL kernels. After the initial debugging phase I increased some of the arguments to more realistic values and noticed that the program is suddenly eating up large amounts of host memory. I am executing 1000 sweeps on about 4000 replicas, each sweep consists of 2 kernel invocations. That results in about 8 million kernel invocations.
The source of the memory usage was easy to find (see screenshot).
While the the kernel executions are enqueued the memory usage goes up.
While the kernels are executing the memory usage stays constant.
Once the kernels finish up the usage goes down to its original state.
I did not allocate any memory, as can be seen in the memory snapshots.
That means the OpenCL driver is using the memory. I understand that it must keep a copy of all the arguments to the kernel invocations and also the global and local workgroup size, but that does not add up.
The peak memory usage was 4.5GB. Before enqueuing the kernels about 250MB were used. That means OpenCL used about 4.25GB for 8 million invocations, i.e. about half a kilobyte per invocation.
So my questions are:
Is that kind of memory usage normal and to be expected?
Are there good/known techniques to reduce memory usage?
Maybe I should not enqueue so many kernels simultaneously, but how would I do that without causing synchronization, e.g. with clFinish()?
Enqueueing large number of kernel invocations needs to be done in a bit controlled manner so that command queue does not eat too much memory. First, clFlush may help to some degree then clWaitForEvents is necessary to make a synchronization point in the middle such that for example 2000 kernel invocations is enqueued and clWaitForEvents waits for the 1000th one. Device is not going to pause because we have another 1000 invocations of work pre-batched already. Then similar thing needs to be repeated again and again. This could be illustrated this way:
enqueue 999 kernel commands
while(invocations < 8000000)
{
enqueue 1 kernel command with an event
enqueue 999 kernel commands
wait for the event
}
The optimal number of kernel invocations after which we should wait may be different than presented here so it needs to be worked out for the given scenario.

Translating C code to OpenCL

I am trying to translate a smaller program written in C into openCL. I am supposed to transfer some input data to the GPU and then perform ALL calculations on the device using successive kernel calls.
However, I am facing difficulties with parts of the code that are not suitable for parallelization since I must avoid transferring data back and forth between CPU and GPU because of the amount of data used.
Is there a way to execute some kernels without the parallel processing so I can replace these parts of code with them? Is this achieved by setting global work size to 1?
Yes, you can execute code serially on OpenCL devices. To do this, write your kernel code the same as you would in C and then execute it with the clEnqueueTask() function.
You could manage two devices :
the GPU for highly parallelized code
the CPU for sequential code
This is a bit complex as you must manage one command-queue by device to schedule each kernel on the appropriate device.
If the devices are part of the same platform (typically AMD) you can use the same context, otherwise you will have to create one more context for the CPU.
Moreover, if you want to have a more fine-grained CPU task-parallelization you could use device-fission if your CPU supports it.

Multiple OpenCl Kernels

I just wanted to ask, if somebody can give me a heads up on what to pay attention to when using several simple kernels after each other.
Can I use the same CommandQueue? Can I just run several times clCreateProgramWithSource + cl_program with a different cl_program? What did I forget?
Thanks!
You can either create and compile several programs (and create kernel objects from those), or you can put all kernels into the same program (clCreateProgramWithSource takes several strings after all) and create all your kernels from that one. Either should work fine using the same CommandQueue . Using more then one CommandQueue to execute kernels which should execute serially on the same device is not a good idea anyways, because in that case you have to manually wait for the event completion instead of asynchronously enqueuing all kernels and then waiting on the result (at least some operations should execute in parallel on device and host, so waiting at the last possible moment is generally faster and easier).

Resources