Is cublasSetMatrixAsync on the default stream blocking? - asynchronous

I would like to copy data from host to device and run some kernels in parallel. There seems to be conflicting information on whether running a cublasSetMatrixAsync function call will be blocking on the default stream?
I am seeing it block execution and am wondering what the correct way to use it is. Should cublasSetMatrixAsync be on the non-default stream? If so, is there an easy way for default stream to block if it needs the matrix on device for some kernel in the future?

Yes, it has blocking behavior.
From the programming guide:
Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
...
• any CUDA command to the default stream,
cublasSetMatrixAsync is not exempt from this.
A general rule for CUDA concurrency is, if you want it, don't use the default stream.
is there an easy way for default stream to block if it needs the matrix on device for some kernel in the future?
issue a cudaDeviceSynchronize()
That will force all cuda device activity, in any stream associated with that device, to finish before any subsequent commands, issued to any stream associated with that device, can begin.

Related

Is it possible to attach an asynchronous callback/continuation to a SYCL kernel?

I have a collection of thousands of SYCL kernels to execute. Once each of these kernels has finished, I need to execute a function on a cl::sycl::buffer written to by said kernel.
The methods I'm aware of for achieving this are:
by using RAII; the requisite global memory is copied back to the host upon destruction of the cl::sycl::buffer
by constructing a host cl::sycl::accessor (with cl::sycl::access::target::host_buffer)
Both of these methods are synchronous and blocking. Is it possible to instead attach an asynchronous callback/continuation when submitting kernels to a cl::sycl::queue that executes as soon as the kernel has finished? Or even better, can the same functionality be achieved with C++2a coroutines? If not, is such a feature planned for SYCL?
The feature to attach callbacks or execute on the host from a SYCL queue did not make the cut for SYCL 1.2.1.
There are some proposals being discussed at the moment to bring that feature into the next version of the standard, but everything is still internal to the SYCL group.
In the meantime, if you use ComputeCpp, you can use the host_handler extension, which allows you to execute a lambda on the host based on dependencies from the device.
The open source compiler doesn't have that feature yet that I've seen.

OpenCL clEnqueueReadBuffer During Kernel Execution?

Can queued kernels continue to execute while an OpenCL clEnqueueReadBuffer operation is occurring?
In other words, is clEnqueueReadBuffer a blocking operation on the device?
From a host API point of view, clEnqueueReadBuffer can be blocking or not, depending on if you set the blocking_read parameter to CL_TRUE or CL_FALSE.
If you set it to not block, then the read just gets queued and you should use an event (or subsequent blocking call) to determine when it has finished (i.e., before you access the memory that you are reading to).
If you set it to block, the call won't return until the read is done. The memory being read to will be correct. Also (and answering your actual question) any operations you queued prior to the clEnqueueReadBuffer will all have to finish first before the read starts (see exception note below).
All clEnqueue* API calls are asynchronous, but some have "blocking" parameters you can set. Using it is the equivalent to using a non-blocking version and then calling clFinish instead. The command queue will be flushed to the device and your host thread won't continue until the work has finished. Of course, it is hard to keep the GPU always busy doing it this way, since now it doesn't have any work, but if you queue up new work fast enough you can still keep it reasonably busy.
This all assumes a single, in-order command queue. If your command queue is out-of-order and your device supports out-of-order queues then enqueued items can execute in any order that doesn't violate the event_wait_list parameters you provided. Likewise, you can have multiple command queues, which can again be executed in any order that doesn't violate the event_wait_list parameters you provided. Typically, they are used to overlap memory transfers and compute, and to keep multiple compute units busy. Out-of-order command queues and multiple command queues are both advanced OpenCL concepts and shouldn't be attempted until you fully understand and have experience with in-order command queues.
Clarification added later after DarkZeros pointed out the "on the device" part of the OP's question: My answer was from the host thread API point of view. On the device, with an in-order command queue all downstream commands are blocked by the current command. With an out-of-order queue they are only blocked by the event_wait_list. However, out-of-order command queues are not well supported in today's drivers. With multiple command queues, in theory commands are only blocked by prior commands (if in-order) and the event_wait_list. In reality, there are sometimes special vendor rules that prevent the free flowing of potentially non-blocked commands that you might like. This is often because the multiple OpenCL command queues get transferred to device-side memory and compute queues, and get executed in-order there. So depending on the order that you add commands to your multiple command queues, they might get interleaved in such a way that they block in sub-optimal ways. The best solution I'm aware of is to either be careful about the order you enqueue (based on knowledge of this implementation detail), or use one queue for memory and one for compute, which matches the device-side queueing.
If overlap of memory and compute is your goal, both AMD and NVIDIA both provide examples of how to overlap memory and compute operations, and for GPUs that support multiple compute operations, how to do that too. NVIDIA examples are hard to get ahold of but they are out there (from CUDA 4 days).

Non-blocking write into a in-order queue

I have a buffer created with CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE flags. I have used this in one kernel and then downloaded (queue.enqueueReadBuffer(...)) the data back to the host memory set when the buffer was created. I have modified these data on CPU and now I'd like to use them in another kernel.
When I have uploaded (queue.enqueueWriteBuffer) the data manually using non-blocking write and then enqueued kernel with this buffer as argument, it returned the CL_OUT_OF_RESOURCES error. Blocking write was just fine.
Why did this happen? I thought that the blocking/non-blocking version only controls if I can work with the memory on CPU after the enqueueWriteBuffer call returns, with in-order queue there should be no difference for the kernel.
Second question is whether I have to upload it manually at all - does the CL_MEM_USE_HOST_PTR mean that the data has to be uploaded from host to device in for every time some kernel uses the buffer as argument? As I have to download the data manually when I require them, has the above mentioned flag any pros?
Thanks
I can't be sure of the specific problem for your CL_OUT_OF_RESOURCES error. This error seems to be raised as kind of a catch-all for problems in the system, so the actual error you're getting might be caused by something else in your program (maybe the kernel).
In regards to using the CL_MEM_USE_HOST_PTR, you still still have to manually upload the data. The OpenCL specification states:
This flag is valid only if host_ptr is not NULL. If specified, it
indicates that the application wants the OpenCL implementation to use
memory referenced by host_ptr as the storage bits for the memory
object. OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This cached copy can
be used when kernels are executed on a device.
For some devices the data will be cached on the device memory. In order to sync your data you would have to use some clEnqueueReadBuffer / clEnqueueWriteBuffer or clEnqueueMapBuffer / clEnqueueUnmapBuffer. For discrete CPU+GPU combinations (i.e. seperate GPU card), I'm not sure what benefit there would be to doing CL_MEM_USE_HOST_PTR, since the data will be cached anyway.
Upon reading the specification, there might be some performance benefit for using clEnqueueMapBuffer / clEnqueueUnmapBuffer instead of clEnqueueReadBuffer / clEnqueueWriteBuffer, but I haven't tested this for any real devices.
Best of luck!

What are some tips for buffer usage and tuning in custom TCP services?

I've been researching a number of networking libraries and frameworks lately such as libevent, libev, Facebook Tornado, and Concurrence (Python).
One thing I notice in their implementations is the use of application-level per-client read/write buffers (e.g. IOStream in Tornado) -- even HAProxy has such buffers.
In addition to these application-level buffers, there's the OS kernel TCP implementation's buffers per socket.
I can understand the app/lib's use of a read buffer I think: the app/lib reads from the kernel buffer into the app buffer and the app does something with the data (deserializes a message therein for instance).
However, I have confused myself about the need/use of a write buffer. Why not just write to the kernel's send/write buffer? Is it to avoid the overhead of system calls (write)? I suppose the point is to be ready with more data to push into the kernel's write buffer when the kernel notifies the app/lib that the socket is "writable" (e.g. EPOLLOUT). But, why not just do away with the app write buffer and configure the kernel's TCP write buffer to be equally large?
Also, consider a service for which disabling the Nagle algorithm makes sense (e.g a game server). In such a configuration, I suppose I'd want the opposite: no kernel write buffer but an application write buffer, yes? When the app is ready to send a complete message, it writes the app buffer via send() etc. and the kernel passes it through.
Help me to clear up my head about these understandings if you would. Thanks!
Well, speaking for haproxy, it has no distinction between read and write buffers, a buffer is used for both purposes, which saves a copy. However, it is really painful to do some changes. For instance, sometimes you have to rewrite an HTTP header and you have to manage to move data correctly for your rewrite, and to save some state about the previous header's value. In haproxy, the connection header can be rewritten, and its previous and new states are saved because they are need later, after being rewritten. Using a read and a write buffer, you don't have this complexity, as you can always look back in your read buffer if you need any original data.
Haproxy is also able to make use of splicing between sockets on Linux. This means that it does not read nor write data, it just tells the kernel what to take where, and where to move it. The kernel then automatically moves pointers without copying data to transfer TCP segments from a network card to another one (when possible), but data are then never transferred to user space, thus avoiding a double copy.
You're completely right about the fact that in general you don't need to copy data between buffers. It's a waste of memory bandwidth. Haproxy runs at 10Gbps with 20% CPU with splicing, but without splicing (2 more copies), it's close to 100%. But then consider the complexity of the alternatives, and make your choice.
Hoping this helps.
When you use asynchronous socket IO operation, the asynchronous read/write operation returns immediately, since the asynchronous operation does not guaranty dealing all the data (ie put all the required data to TCP socket buffer or get all the required data from it) successfully with one invocation, the partial data must outlive through mutiple operations. Then you need an application buffer space to keep the data as long as IO operations last.

TCP Socket Piping

Suppose that you have 2 sockets(each will be listened by other TCP peers) each resides on the same process, how these sockets could be bound, meaning input stream of each other will be bound to output stream of other. Sockets will continuously carry data, no waiting will happen. Normally thread can solve this problem but, rather than creating threads is there more efficient way of piping sockets?
If you need to connect both ends of the socket to the same process, use the pipe() function instead. This function returns two file descriptors, one used for writing and the other used for reading. There isn't really any need to involve TCP for this purpose.
Update: Based on your clarification of your use case, no, there isn't any way to tell the OS to connect the ends of two different sockets together. You will have to write code to read from one socket and write the same data to the other. Depending on the architecture of your process, you may or may not need an additional thread to do this work. For example, if your application is based on a select() loop, then creating another thread is not necessary.
You can avoid threads with an event queue within the process. The WP Message queue article assumes you want interprocess message passing, but if you are using sockets, you kind of are doing interprocess message passing over the same process.

Resources