Asynchronous data transfer CUDA - asynchronous

Consider the cuda code below:
CudaMemCpyAsync(H2d, data1...., StreamA);
KernelB<<<..., StreamB>>>(data1,...);
CudaMemCpyAsync(D2H, output using data1, ...., StreamA);
When does "CudaMemCpyAsync(D2H....., StreamA);" in the code starts? Does it start after end of execution of KernelB? Do I replace "CudaMemCpyAsync(D2H....., StreamA);" with "CudaMemCpy(D2H....., StreamA);" if I have to copy output of KernelB back to the host?
Also, is pinned memory usage is absolutely required in asynchronous data transfer?
Thanks in advance.

The user created CUDA streams are asynchronous with respect to each other and with respect to the host. The tasks issued to same CUDA stream are serialized. So in your case, cudaMemCpyAsync(D2H, output using data1, ...., StreamA); will wait for the previous memory copy to finish. But there is no guarantee that when this memory copy initiates, the kernel would have finished its execution. Because StreamA and StreamB are asynchronous w.r.t each other.
Also, the host will not wait for these streams to finish execution.
If you want the host to wait for the streams, you may use cudaDeviceSynchronize or cudaStreamSynchronize.
If you do not use pinned memory, the memory copies will not overlap with kernel execution.


User mode and kernel mode: different program at same time

Is it possible that one process is running in kernel mode and another in user mode at the same time?
I know, it's not a coding question but please guide me if someone knows answer.
For two processes to actually be running at the same time, you must have multiple CPUs. And indeed, when you have multiple CPUs, what runs on the different CPUs is very loosly coupled and you can definitely have one process running user code on one CPU, while another process runs kernel code (e.g., doing some work inside a system call) on another CPU.
If you are asking about just one CPU, in that case you can't have two running processes at the same time. But what you can have is two runnable processes, which mean two processes which are both ready to run but since there is just one CPU, only one of the can actually run. One of the runnable processes might be in user mode - e.g., consider a long-running tight loop that was preempted after its time quota was over. Another runnable process might be in kernel mode - e.g., consider a process that did a read() system call from disk, the kernel sent the read request to the disk, but the read request completed so now the process is ready to run again in kernel mode and complete the read() call.
Yes, it is possible. Even multiple processes can be in the kernel mode at the same time.
Just that a single process cannot be in both the modes at the same time.
correct me but i suppose there is no any processes in kernel mode , there are only threads.

Effect of not using clWaitForEvents

I'm new to OpenCL programming. In one of my OpenCL applications, I use clWaitForEvents after launching every kernel.
Case 1:
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
clWaitForEvents(1, &event);
Time taken : 250 ms (with clWaitForEvents)
If I remove clWaitForEvents(), my kernel runs faster with the same output.
Case 2:
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
Time taken: 220 ms (without clWaitForEvents)
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel. Using clWaitForEvent after every kernel increases my execution time by few 100 ms.
Can the outputs go wrong if I do not use clWaitForEvents? I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
Any pointers are appreciated.
Hopefully a slightly less complicated answer:
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel.
If you don't explicitly set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property in clCreateCommandQueue() call (= the usual case), it will be an in-order queue. You don't need to synchronize commands in them (actually you shouldn't, as you see it can considerably slow down execution). See the docs:
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.
I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
If you're doing simple stuff on a single in-order queue, you don't need clWaitForEvents() at all. It's mostly useful if you want to wait for multiple events from multiple queues, or you're using out-of-order queues, or you want to enqueue 20 commands but wait for the 4th, or something similar.
For a single in-order queue, after clFinish() returns all commands will be completed and any&all events will have their status updated to complete or failed. So in the simplest case you don't need to deal with events at all, just enqueue everything you need (check the enqueues for errors though) and call clFinish().
Note that if you don't use any form of wait/flush (WaitForEvents / Finish / a blocking command), the implementation may take as much time as it wants to actually push those commands to a device. IOW you must either 1) use WaitForEvents or Finish, or 2) enqueue a blocking command (read/write/map/unmap) as the last command.
In-order-queue implicitly waits for each command completion in the order they are enqueued but only on device-side. This means host can't know what happened.
Out-of-order-queue does not guarantee any command order in anywhere and can have issues.
'Wait-for-event' waits on host side for an event of a command.
'Finish' waits on host side until all commands are complete.
'Non blocking buffer read/write' does not wait on host side.
'Blocking buffer read/write' waits on host side but does not wait for other commands.
Recommended solutions:
Inter-command sync (for using output of a command as input of next command)
or passing event of a command to another (if its an out-of-order queue)
Inter-queue(or out-of-order queue) sync (for overlapping buffer copies and kernel executions)
pass events from command(s) to another command
Device - host sync (for getting latest data to RAM(or getting first data from RAM) or pausing host)
enable blocking option on buffer commands
or add a clFinish
or use clWaitForEvent
Be informed when a command is complete(for reasons like benchmarking)
use event callback
or constantly query event state(CPU/pci-e usage increases)
Enqueueing 1 non-blocking buffer write + 1000 x kernels + 1 blocking buffer read on an in-order-queue can successfully execute a chain of 1000 kernels on initial data and get latest results on host side.

Overlapping transfer and execution: ensure that commands are performed in the right order

OpenCL Best Practices Guide ( ) suggests in the section 3.1.3 to use clFlush to ensure that commands happen in the right order, e.g. processing doesn't happen before data transfer:
Transfer the data for queue0
clFlush for queue0
Run the kernel for queue0, transfer the data for queue1
clFlush for queue0 and queue1
Run the kernel for queue1 and retrieve the data for queue0
clFlush for them both
Retrieve the data for queue1
The reply here suggests to use events to achieve, as it seems, the same.
My question is: Did I get it right, and do both clFlush and events serve the same purpose (avoiding simultaneous execution) in this case? Does it matter which of them to use?
clFlush only ensures the enqueue function enqueues the data transfer or the kernel execution, but it does not ensure the function you call is finished. There are multiple cases where you need to use events:
1 - If you are using non-blocking calls to the data transfers, you need to use events to make sure you have finished transferring all of it before you can start executing your kernel and when you are copying back to the host, you need to wait for read event to finish.
2 - If you have dependencies between kernels you are executing in both queues, then again you have to use event to order the kernels in the right way.
So your question depends on what kind of dependencies you have between kernel executions and whether you are using non-blocking calls to transfer data. If you do not have dependencies and you are using blocking calls for data transfer, clFlush will do the job. Otherwise, you need events.

OpenCL clEnqueueReadBuffer During Kernel Execution?

Can queued kernels continue to execute while an OpenCL clEnqueueReadBuffer operation is occurring?
In other words, is clEnqueueReadBuffer a blocking operation on the device?
From a host API point of view, clEnqueueReadBuffer can be blocking or not, depending on if you set the blocking_read parameter to CL_TRUE or CL_FALSE.
If you set it to not block, then the read just gets queued and you should use an event (or subsequent blocking call) to determine when it has finished (i.e., before you access the memory that you are reading to).
If you set it to block, the call won't return until the read is done. The memory being read to will be correct. Also (and answering your actual question) any operations you queued prior to the clEnqueueReadBuffer will all have to finish first before the read starts (see exception note below).
All clEnqueue* API calls are asynchronous, but some have "blocking" parameters you can set. Using it is the equivalent to using a non-blocking version and then calling clFinish instead. The command queue will be flushed to the device and your host thread won't continue until the work has finished. Of course, it is hard to keep the GPU always busy doing it this way, since now it doesn't have any work, but if you queue up new work fast enough you can still keep it reasonably busy.
This all assumes a single, in-order command queue. If your command queue is out-of-order and your device supports out-of-order queues then enqueued items can execute in any order that doesn't violate the event_wait_list parameters you provided. Likewise, you can have multiple command queues, which can again be executed in any order that doesn't violate the event_wait_list parameters you provided. Typically, they are used to overlap memory transfers and compute, and to keep multiple compute units busy. Out-of-order command queues and multiple command queues are both advanced OpenCL concepts and shouldn't be attempted until you fully understand and have experience with in-order command queues.
Clarification added later after DarkZeros pointed out the "on the device" part of the OP's question: My answer was from the host thread API point of view. On the device, with an in-order command queue all downstream commands are blocked by the current command. With an out-of-order queue they are only blocked by the event_wait_list. However, out-of-order command queues are not well supported in today's drivers. With multiple command queues, in theory commands are only blocked by prior commands (if in-order) and the event_wait_list. In reality, there are sometimes special vendor rules that prevent the free flowing of potentially non-blocked commands that you might like. This is often because the multiple OpenCL command queues get transferred to device-side memory and compute queues, and get executed in-order there. So depending on the order that you add commands to your multiple command queues, they might get interleaved in such a way that they block in sub-optimal ways. The best solution I'm aware of is to either be careful about the order you enqueue (based on knowledge of this implementation detail), or use one queue for memory and one for compute, which matches the device-side queueing.
If overlap of memory and compute is your goal, both AMD and NVIDIA both provide examples of how to overlap memory and compute operations, and for GPUs that support multiple compute operations, how to do that too. NVIDIA examples are hard to get ahold of but they are out there (from CUDA 4 days).

Asynchronous CUDA transfer calls not behaving asynchronously

I am using my GPU concurrently with my CPU. When I profile memory transfers I find that the async calls in cuBLAS do not behave asynchronously.
I have code that does something like the following
cudaEvent_t event;
// time-point A
cublasSetVectorAsync(n, elemSize, x, incx, y, incy, 0);
// time-point B
// time-point C
I'm using sys/time.h to profile (code omited for clarity). I find that the cublasSetVectorAsync call dominates the time as though it were behaving synchronously. I.e. the duration A-B is much longer than the duration B-C and increases as I increase the size of the transfer.
What are possible reasons for this? Is there some environment variable I need to set somewhere or an updated driver that I need to use?
I'm using a GeForce GTX 285 with Cuda compilation tools, release 4.1, V0.2.1221
cublasSetVectorAsync is a thin wrapper around cudaMemcpyAsync. Unfortunately, in some circumstances, the name of this function is a misnomer, as explained on this page from the CUDA reference manual.
For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed.
So the solution to your problem is likely to just allocate x, your host data array, using cudaHostAlloc, rather than standard malloc (or C++ new).
Alternatively, if your GPU and CUDA version support it, you can use malloc and then call cudaHostRegister on the malloc-ed pointer. Note in the documentation the condition that you must create your CUDA context with the cudaDeviceMapHost flag in order for cudaHostRegister to have any effect (see the documentation for cudaSetDeviceFlags.
In cuBLAS/cuSPARSE, things take place in stream 0 if you don't specify a different stream. To specify a stream, you have to use cublasSetStream (see cuBLAS documentation).
