I am comparing performance of OpenMP with that of OpenCL on CPUs and my system has 8 cores. But I need comparisons for 2, 4, 6 and 8 cores respectively. I can activiate number of cores in OpenMP through "set_num_threads(n)" function or an environment variable; But I dont know how could I do same in OpenCL, is there alternative of OpenMP set_num_threads API in OpenCL ?
There is no standard way to do this. OpenCL will try to use all of the resources available on an OpenCL device.
One possibility you could look into is the device fission extension. It allows you to divide the device (in this case the CPU) into smaller logical devices. It is currently supported on CPUs by AMD's implementation at least. Do a search and you'll find some more resources from AMD as well.
Related
CUDA MPS allows you to run multiple processes in parallel on the GPU, thus fully utilizing the GPU for operations that don't take full advantage. Is there an equivalent for OpenCL? Or is there a different approach in OpenCL?
If you use multiple OpenCL command queues that don't have event interdependencies, an OpenCL runtime could keep the GPU cores busy with varied work from each queue. It's really up to the implementation as to whether this actually happens. You'd need to check each vendor's OpenCL guide to see if they support concurrent GPU kernels.
I am using OpenCL to execute a procedure on different GPUs and CPUs simultaneously to get a high performance results. The Intel OpenCL is always showing a message that the Kernel is not vectorized, so it will only run on different cores but will not run using SIMD instructions. My question is, if I rewrite the code so that the SIMD instruction can be exploit with the OpenCL code, will it increase the GPU Performance also?
Yes - but beware that this is not necessary on AMD GCN based APU/GPU or Nvidia Fermi or higher GPU hardware for good performance -they do scalar operations with great utilization. CPUs and Intels GPU however can greatly benefit via SIMD instructions which is what the vector operations boil down to.
I'm newbie for OpenCL, just started learning. I wanted to know whether it is possible to execute few threads on GPU and remaining threads on CPU? In other words, if I launch 100 threads and assume that I've 8 core CPU then is it possible that 8 threads out of 100 threads will execute on CPU and remaining 92 threads will run on GPU?Can OpenCL help me to do this job smoothly?
I wanted to know whether it is possible to execute few threads on GPU and remaining threads on CPU?
Yes
In other words, if I launch 100 threads and assume that I've 8 core CPU then is it possible that 8 threads out of 100 threads will execute on CPU and remaining 92 threads will run on GPU?
No. That description suggests that you'd be viewing the GPU & CPU as a single compute resource. You can't do that.
That doesn't mean you can't have both working on the same task.
The GPU and CPU will be considered to be separate OpenCL devices.
You can write code that can talk to multiple devices.
You can compile the same kernel for multiple devices.
You can ask for multiple devices to do work at the same time.
...but...
None of this is automatic.
OpenCL won't split a single NDRange (or equivalent) call between multiple devices.
This means you'd have to schedule tasks between the two devices yourself.
There's going to be quite a large disparity in speed, so keeping it optimal will require more than "92 here, 8 there".
What I've found works better is having the CPU work on a different task whilst the GPU is working. Maybe preparing the next piece of work for the GPU, or post-processing the results from the GPU. Sometimes this is normal code. Sometimes it's OpenCL.
You can use multiple openCL devices to work on your algorithm, but the workload needs to be partitioned subtly enough so the work across devices is balanced properly, or else the overhead may make your runtime worse.
It is stated clearly in the AMD OpenCL Programming Guide section 4.7 about using multiple OpenCL devices, so my answer is, yes, you can divide the work to be executed with multiple devices, smoothly, if and only if your scheduling algorithm is smart enough to balance the whole thing.
openCL code is compiled at run time for the selected device (CPU, model of GPU)
You can switch which target you use for different tasks but you can't (with any implementation I know of) split the same task between CPU and GPU
Is it possible to achieve the same level of parallelism with a multiple core CPU device as that of multiple heterogenous devices ( like GPU and CPU ) in OpenCL?
I have an intel i5 and am looking to optimise my code. When I query the platform for devices I get only one device returned: the CPU. I was wondering how I could optimise my code by using this.
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices or does it have to be done manually by the programmer?
Can a cpu device achieve the same level of parallelism as a gpu? Pretty much always no.
The number of compute units in a gpu is almost always more than in a cpu. For example, $50 can get you a video card with 10 compute units (Radeon 6450). The cheapest 8-core cpus on newegg are going for $189 (desktop cpu) and $269 (server).
The compute units of a cpu will run faster due to clock speed, and execute branching code much better than a gpu. You want a cpu if your workload has a lot of conditional statements.
A gpu will execute the same instructions on many pieces of data. The 6450 gpu has 16 'stream processors' per compute unit to make this happen. Gpus are great when you have to do the same (small/medium) tasks many times. Matrix multiplication, n-boy computations, reduction operations, and some sorting algorithms run much better on gpu/accelerator hardware than on a cpu.
I answered a similar question with more detail a few weeks ago. (This one)
Getting back to your question about the "same level of parallelism" -- cpus don't have the same level of parallelism as gpus, except in cases where the gpu under performs on the execution of the actual kernel.
On your i5 system, there would be only one cpu device. This represents the entire cpu. When you query for the number of compute units, opencl will return the number of cores you have. If you want to use all cores, you just run the kernel on your device, and opencl will use all of the compute units (cores) for you.
Short answer: yes, it will run in parallel and no, no need to do it manually.
Long answer:
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices [...]
Either you need to revise your OpenCL vocabulary or I didn't understand your question. You only have one device and core != device!
One CPU, regardless of how many cores it has, is one device. The same goes for a GPU: one GPU, which has hundreds of cores, is only one device. You send jobs to the device through the queue and the device's driver. Your jobs can (and will) be split up into work-items. Then, some (how many depends on the device/driver) work-items are executed in parallel. On the GPU aswell as on the CPU, one work-item is executed by one kernel. (This might not be completely true but it is a very helpful abstraction.)
If you enqueue several kernels in one queue (without connecting them through a wait event!), the driver may or may not run them in parallel.
It is the very goal of OpenCL to allow you to compute work-items in parallel regardless of whether it is using several devices' cores in parallel or only a single devices cores.
If this confuses you, watch these really good (and long) videos: http://macresearch.org/opencl
How are you determining the OPENCL device count? I have an Intel I3 laptop that gives me 2 OpenCL compute units? It has 2 cores.
According to Intels spec an I5-2300 has 4 cores and supports 4 threads. It isn't hyper-threaded. I would expect a OpenCL call to the query the # devices to give you a count of 4.
With OpenCL's getDeviceInfo one can get the number of available compute units (CL_DEVICE_MAX_COMPUTE_UNITS). On my nVidia Geforce 8600GTS I have 4 compute units with 8 cores per unit. With getDeviceInfo(...CL_DEVICE_MAX_COMPUTE_UNITS...) I get 4 as answer for the compute units. But, how can I get the information about the number of cores per compute unit?
The OpenCL specification does not give any hint on that subject. Does anyone know how to retrieve the number of core per computation unit in a standard way?
There is no way I am aware of - even the underlying CUDA APIs don't presently expose the multiprocessor internal configuration. In the context of OpenCL, where a compute unit might well be the core of a CPU, exposing the internal SIMD configuration via the API doesn't make that much sense, and isn't really all that useful anyway.
NVIDIA do provide the cl_nv_device_attribute_query extension which will give you the CUDA compute capability of the device. This then maps to cores per compute unit as:
1.0, 1.1, 1.2, 1.3: 8 cores per execution unit
2.0: 32 cores per execution unit
2.1: 48 cores per execution unit
It would be up to you to code this into a subroutine and keep it up to date as hardware changes. Being based on specifics of NVIDIA hardware and relying on an NVIDIA OpenCL extension, all of the above is totally non-portable to other platforms.