I'm optimizing a pipeline by adding more queues but is it possible to query for maximum supported command queues so I can make it dynamical instead of hand tuned?
For example, Amd's R7-240 (a low end graphics card) supports 16 queues but don't know about Nvidia and Intel.
Related
I have an OpenCL buffer created with the read and write flag. Can I access the same memory address simultaneously? say, calling enqueueReadBuffer and a kernel that doesn't modify the contents out-of-order without waitlists, or two calls to kernels that only read from the buffer.
Yes, you can do so. Create 2 queues, then call clEnqueieReadBuffer and clEnqueueNDRangeKernel on the different queue.
It ultimately depends on weather the device and driver supports executing different queues at the same time. Most GPUs can while embedded devices may or may not.
I have the S10000 AMD GPU, which has 2 GPUs inside. When I run clinfo the output looks like these are treated as separate GPUs. To run my kernel across both of these GPUs do I need to create 2 separate openCL queues and partition my work-groups? Do these two GPUs share memory?
Yes, you will need to create separate command queues for each GPU and manually partition the workload between them. The GPUs do not share memory, so you will also have to make sure data is transferred to both GPUs as necessary. If you create a single context containing both GPUs, the implementation will automatically deal with moving buffers between the GPUs as and when needed. However, in my experience it is often better to do this explicitly, as sometimes the implementation will generate false dependencies between kernels that both use the same buffer and will serialise kernel execution.
Can queued kernels continue to execute while an OpenCL clEnqueueReadBuffer operation is occurring?
In other words, is clEnqueueReadBuffer a blocking operation on the device?
From a host API point of view, clEnqueueReadBuffer can be blocking or not, depending on if you set the blocking_read parameter to CL_TRUE or CL_FALSE.
If you set it to not block, then the read just gets queued and you should use an event (or subsequent blocking call) to determine when it has finished (i.e., before you access the memory that you are reading to).
If you set it to block, the call won't return until the read is done. The memory being read to will be correct. Also (and answering your actual question) any operations you queued prior to the clEnqueueReadBuffer will all have to finish first before the read starts (see exception note below).
All clEnqueue* API calls are asynchronous, but some have "blocking" parameters you can set. Using it is the equivalent to using a non-blocking version and then calling clFinish instead. The command queue will be flushed to the device and your host thread won't continue until the work has finished. Of course, it is hard to keep the GPU always busy doing it this way, since now it doesn't have any work, but if you queue up new work fast enough you can still keep it reasonably busy.
This all assumes a single, in-order command queue. If your command queue is out-of-order and your device supports out-of-order queues then enqueued items can execute in any order that doesn't violate the event_wait_list parameters you provided. Likewise, you can have multiple command queues, which can again be executed in any order that doesn't violate the event_wait_list parameters you provided. Typically, they are used to overlap memory transfers and compute, and to keep multiple compute units busy. Out-of-order command queues and multiple command queues are both advanced OpenCL concepts and shouldn't be attempted until you fully understand and have experience with in-order command queues.
Clarification added later after DarkZeros pointed out the "on the device" part of the OP's question: My answer was from the host thread API point of view. On the device, with an in-order command queue all downstream commands are blocked by the current command. With an out-of-order queue they are only blocked by the event_wait_list. However, out-of-order command queues are not well supported in today's drivers. With multiple command queues, in theory commands are only blocked by prior commands (if in-order) and the event_wait_list. In reality, there are sometimes special vendor rules that prevent the free flowing of potentially non-blocked commands that you might like. This is often because the multiple OpenCL command queues get transferred to device-side memory and compute queues, and get executed in-order there. So depending on the order that you add commands to your multiple command queues, they might get interleaved in such a way that they block in sub-optimal ways. The best solution I'm aware of is to either be careful about the order you enqueue (based on knowledge of this implementation detail), or use one queue for memory and one for compute, which matches the device-side queueing.
If overlap of memory and compute is your goal, both AMD and NVIDIA both provide examples of how to overlap memory and compute operations, and for GPUs that support multiple compute operations, how to do that too. NVIDIA examples are hard to get ahold of but they are out there (from CUDA 4 days).
I am running my OpenCL C codes on our institution's GPU cluster, which has 8 nodes and each node has an Intel Xeon 8C proc with 3 NVIDIA Tesla M2070 GPUs (in total 24 GPUs). I need to find a way from my host code to identify which of the GPUs are already occupied and which are free and to submit my jobs to those available GPUs. The closest answer that i could find was in
How to programmatically discover specific GPU on platform with multiple GPUs (OpenCL 1.1)?
How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?.
Can anyone help me out with how to choose a node and choose a GPU which is free for computation. I am writing in OpenCL C.
Gerald
Unfortunately, there is no standard way to do such a thing.
If you want to squeeze the full power of GPUs for computations and your problem is not a memory hog, I can suggest to use two contexts per device: as kernels at the first one end computation, kernels of the second one are still working and you have time to fill the buffers with data and start next task in the first context, and vice versa. In my case (AMD GPU, OpenCL 1.2) if saves from 0 to 20 % of computational time. Three contexts provide sometimes slower execution, sometimes faster, so I do not recommend this as a standard technique, but you can try. Four and more contexts are useless, from my experience.
Have a command queue for each device, then use OpenCL Events with each kernel submission, and check the state of them before submitting a new kernel for execution. Whichever command queue has the least unfinished kernels is the one you should enqueue to.
Is it possible to achieve the same level of parallelism with a multiple core CPU device as that of multiple heterogenous devices ( like GPU and CPU ) in OpenCL?
I have an intel i5 and am looking to optimise my code. When I query the platform for devices I get only one device returned: the CPU. I was wondering how I could optimise my code by using this.
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices or does it have to be done manually by the programmer?
Can a cpu device achieve the same level of parallelism as a gpu? Pretty much always no.
The number of compute units in a gpu is almost always more than in a cpu. For example, $50 can get you a video card with 10 compute units (Radeon 6450). The cheapest 8-core cpus on newegg are going for $189 (desktop cpu) and $269 (server).
The compute units of a cpu will run faster due to clock speed, and execute branching code much better than a gpu. You want a cpu if your workload has a lot of conditional statements.
A gpu will execute the same instructions on many pieces of data. The 6450 gpu has 16 'stream processors' per compute unit to make this happen. Gpus are great when you have to do the same (small/medium) tasks many times. Matrix multiplication, n-boy computations, reduction operations, and some sorting algorithms run much better on gpu/accelerator hardware than on a cpu.
I answered a similar question with more detail a few weeks ago. (This one)
Getting back to your question about the "same level of parallelism" -- cpus don't have the same level of parallelism as gpus, except in cases where the gpu under performs on the execution of the actual kernel.
On your i5 system, there would be only one cpu device. This represents the entire cpu. When you query for the number of compute units, opencl will return the number of cores you have. If you want to use all cores, you just run the kernel on your device, and opencl will use all of the compute units (cores) for you.
Short answer: yes, it will run in parallel and no, no need to do it manually.
Long answer:
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices [...]
Either you need to revise your OpenCL vocabulary or I didn't understand your question. You only have one device and core != device!
One CPU, regardless of how many cores it has, is one device. The same goes for a GPU: one GPU, which has hundreds of cores, is only one device. You send jobs to the device through the queue and the device's driver. Your jobs can (and will) be split up into work-items. Then, some (how many depends on the device/driver) work-items are executed in parallel. On the GPU aswell as on the CPU, one work-item is executed by one kernel. (This might not be completely true but it is a very helpful abstraction.)
If you enqueue several kernels in one queue (without connecting them through a wait event!), the driver may or may not run them in parallel.
It is the very goal of OpenCL to allow you to compute work-items in parallel regardless of whether it is using several devices' cores in parallel or only a single devices cores.
If this confuses you, watch these really good (and long) videos: http://macresearch.org/opencl
How are you determining the OPENCL device count? I have an Intel I3 laptop that gives me 2 OpenCL compute units? It has 2 cores.
According to Intels spec an I5-2300 has 4 cores and supports 4 threads. It isn't hyper-threaded. I would expect a OpenCL call to the query the # devices to give you a count of 4.