When we have multi-core CPU, OpenCL treats it as a single device with multiple compute units and for every device we can create some command queues.How can CPU as a host, create a command queue on itself? I think in this situation it become multithreading rather than parallel computing.
Some devices, including most CPU devices can be segmented into sub-devices using the extension "cl_ext_device_fission". When you use device fission, you are still getting parallel processing in that the host thread can do other tasks while the kernel is running on some CPU cores.
When not using device fission, the CPU device will block essentially block the host program while a kernel is running. Even if some opencl implementations are non-blocking during kernel execution, the performance hit to the host would be too great to allow much work to be done by the host thread.
So it's still parallel computation, but I guess the host application's core is technically multi-threading during kernel execution.
It's parallel computing using multithreading. When you use an OpenCL CPU driver and enqueue kernels for the CPU device, the driver uses threads to execute the kernel, in order to fully leverage all of the CPUs cores (and also typically uses vector instructions like SSE to fully utilize each core).
Related
So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c
I want to run heterogeneous kernels that execute on a single GPU asynchronously. I think this is possible in Nvidia Kepler K20(Or any device having compute capability 3.5+) by launching each of this kernels to a different stream and the runtime system maps them to different hardware queues based on the resource availability.
Is this feature accessible in OpenCL?
If it is so, what is the equivalent of a CUDA 'Stream' in OpenCL?
Do Nvidia drivers support such an execution on their K20 cards through OpenCL?
Is their any AMD GPU that has similar feature(or is there anything on development)?
Answer for any of these questions will help me a lot.
In principle, you can use OpenCL command queues to achieve CKE (Concurrent Kernel Execution). You can launch them from different CPU threads. Here are few links that might help you get started:
How do I know if the kernels are executing concurrently?
http://devgurus.amd.com/thread/142485
I am not sure how would it work with NVIDIA Kepler GPUs as we are having strange issues using OpenCL on K20 GPU.
Little disclaimer: This is more the kind of theoretical / academic question than an actual problem I've got.
The usual way of setting up a parallel program in OpenCL is to write a C/C++ program, which sets up the devices (GPU and/or other CPUs), kernel and data buffers for executing the kernel on the device.
This program gets launched from the host, which used to be a CPU.
Would it be possible to write a OpenCL program where the host is a GPU and the devices other GPUs and/or CPUs?
What would be the prerequisites for such a scenario?
Do one need a special GPU or would it be possible to use any OpenCL-capable GPU?
Are you looking for a complete host or just a kernel launcher?
Up coming CUDA (v 5.0) introduces a feature to launch a kernel inside a kernel. Therefore, a device can be used for launching a kernel on itself. May be this feature will be supported by OpenCL too in near future.
I was just wondering how is it possible that OpenMP (shared memory) and MPI (distributed memory) could run on normal desktop CPUs like i7 for example. Is there some kind of a virtual machine that can simulate shared and distributed memory on these CPUs? I am asking it because when learnig OpenMP and MPI, the structures of supercomputers is shown, with shared memory or different nodes for distributed memory, each node with its own processor and memory.
MPI assumes nothing about how and where MPI processes run. As far as MPI is concerned, processes are just entites that have a unique address known as their rank and MPI gives them the ability to send and receive data in the form of messages. How exactly are the messages transfered is left to the implementation. The model is so general that MPI can run virtually on any platform imaginable.
OpenMP deals with shared memory programming using threads. Threads are just concurrent instruction flows that can access a shared memory space. They can execute in a timesharing fashion on a single CPU core or they can execute on multiple cores inside a single CPU chip, or they can be distributed among multiple CPUs connected together by some sophisticated network that allows them to access each others memory.
Given all that, MPI does not require that each process executes on a dedicated CPU core or that millions of cores should be necessarily put on separate boards connected with some high speed network - performance does, as well as technical limitations. You can happily run a 100 processes MPI job on a single CPU core though performance would be very very bad but it will still work (given enough memory is available). The same applies to OpenMP - it does not require that each thread is scheduled on a dedicated CPU core but doing so gives the best performance.
That's why MPI and OpenMP are called abstractions - they are general enough that the execution hardware can vary greatly while source code is kept the same.
A modern multicore-CPU-based PC is a shared-memory computer. It is a sensible approximation to think of each core as a processor, and that they all have equal access to the same RAM. This approximation hides a lot of details of processor and chip architectures.
It has always (well, perhaps not always, but for almost as long as MPI has been around) been possible to use message-passing (of which MPI is one standard) on a shared-memory computer so that you can run the same MPI-enabled program as you would on a genuinely distributed-memory machine.
At the application level a programmer only cares about calls to MPI routines. At the systems level the MPI run-time translates these calls into, well on a cluster or supercomputer, into instructions to send stuff over the interconnect. On a shared-memory computer it could instead translate these calls into instructions to send stuff over the internal bus.
This is by no means a comprehensive introduction to the topics you've raised, but that's what Google and all the published sources out there are for.
I am new to OpenCL, please tell me that the host cpu can be used only for allocating memory to the device, or we can use it can as an openCL device. (Because after the allocation is done, the host cpu will be idle).
You can use a cpu as a compute device. Opencl even allows multicore/processor systems to segment cores into separate compute units. I like to use this feature to divide the cpus on my system into groups based on NUMA nodes. It is possible to divide a cpu into compute devices which all share the same level of cache memory (L1, L2, L3 or L4).
You need a platform that supports it, such as AMD's SDK. I know there are ways to have Nvidia and AMD platforms on the same machine, but I have never had to do so myself.
Also, the opencl event/callback system allows you to use your cpu as you normally would while the gpu kernels are executing. In this way, you can use openmp or any other code on the host while you wait for the gpu kernel to finish.
There's no reason the CPU has to be idle, but it needs a separate job to do. Once you've submitted work to OpenCL you can:
Get on with something else, like preparing the next set of work, or performing calculation on something completely different.
Have the CPU set up as another compute device, and so submit a piece of work to it.
Personally I tend to find myself needing the first case more often as it's rare I find myself with two tasks that are independent and lend themselves to OpenCL style. The trick is keeping things balanced so you're not waiting a long time for the GPU task to finish, or having the GPU idle while the CPU is getting on with other work.
It's the same problem OpenGL coders had to conquer. Avoiding being CPU or GPU bound, and balancing between the two for best performance.