HEVC Deblocking with parallel processing on OpenCL - opencl

I have been working on HEVC for the past 2 years and recently I was asked to port the code of x265 onto OpenCL for parallel processing. Now, I am still at the starting stage and do see some concerns since Class is not a possibility as x265 uses many classes. Would it be possible to pass the structure since I have some function prototypes within the class. Is it possible to replicate the same onto GPU.

Yes, as you have mentioned that we will not be able to pass a class to the Kernel function. However, you would be able to include the prototypes in the structure and pass it to the GPU. You can refer to this link. passing parameters of an kernel function as C++ struct?

Related

Is there a way to simplify OpenCl kernels usage ?

To use OpenCL kernel the following is needed:
Put the kernel code in a string
call clCreateProgramWithSource
call clBuildProgram
call clCreateKernel
call clSetKernelArg (x number of arguments)
call clEnqueueNDRangeKernel
This need to be done for each kernel. Is there a way to do this repeating less code for each kernel?
There is no way to speed up the process. You need to go step by step as you listed.
But it is important to know why it is needed these steps, to understand how flexible the chain is.
clCreateProgramWithSource: Allows to add different strings from different sources to generate the program. Some string might be static, but some might be downloaded from a server, or loaded from disk. It allows the CL code to be dynamic and updated over time.
clBuildProgram: Builds the program for a given device. Maybe you have 8 devices, so you need to call this multiple times. Each device will produce a different binary code.
clCreateKernel: Creates a kernel. But a kernel is an entry point in a binary. So it is possible you create multiple kernels from a program (for different functions). Also the same kernel might be created multiple times, since it holds the arguments. This is useful for having ready-to-be-launched instances with proper parameters.
clSetKernelArg: Changes the parameters in the instance of the kernel. (it is stored there, so it can used multiple times in the future).
clEnqueueNDRangeKernel: Launches it, configuring the size of the launch and the chain of dependencies with other operations.
So, even if you could have a way to just call "getKernelFromString()", the functionality will be very limited, and not very flexible.
You can have look at wrapper libraries
https://streamhpc.com/knowledge/for-developers/opencl-wrappers/
I suggest you look into SYCL. The building steps are performed offline, saving execution time by skipping the clCreateProgramWithSource. The argument setting is done automatically by the runtime, extracting the information from the user lambda
There is also CLU: https://github.com/Computing-Language-Utility/CLU - see https://www.khronos.org/assets/uploads/developers/library/2012-siggraph-opencl-bof/OpenCL-CLU-and-Intel-SIGGRAPH_Aug12.pdf for more info. It is a very simple tool, but should make life a bit easier.

What's different between the normal memory object and OpenCL's pipe?

Pipe is one of the OpenCL 2.0's new features, and this feature has been demonstrated in the AMDAPPSDK's producer/consumer example. I've read some articles abut pipe's use cases and they're all like the producer/consumer way.
My question is, the same functionality can be achieved by creating a global memory space/object and passing the pointer to 2 kernel functions given that OpenCL 2.0 provides the shared virtual memory. So what's the difference between a pipe object and a global memory object? Or is it invented just for optimization?
It is as useful as std::vector and std::queue.
One is useful to store data, while the other is useful to store packets.
Packets are indeed data, but it is much easier to handle them as small units rather than a big block.
Pipes in OpenCL allow you to consume these small packets in a kernel, without having to deal with the indexing + storing + pointers + forloops hell that would happen if you manually implement a pipe mechanism yourself in the kernel.
Pipes are useful for example when each work item can generate variable number of outputs. Prior to OpenCL 2.0 this was difficult to handle.
Pipes may reside in faster memory (vendor specific) i.e. Altera recommends using pipes to exchange data between kernels instead of using global memory.
Pipes are designed to transfer data from one kernel to another kernel/s without the need to store/load data in/from global or host memory. This is essentially a FIFO on the FPGA device. So, the speed of access of the data are much faster than that of through DDR or host memory. This is probably the reason to use FPGA as an accelerator.
Sometimes the DDRs are used to share data between kernels as well. One example is that a SIMD kernel want to share some data with a single task kernel with requirement on input data sequence. As, Pipes will run out of order in a SIMD way.
Other than the Pipes, you can use Altera channels for more function support. But this is not portable to other OpenCL devices.
Hope this can help. :)

How to get kernel information

I want to get following information about compiled OpenCL kernels - list of types, params order (if possible - with memory and access classifiers). Kernels are build from the sources during run time of app.
Actually, in OpenCL 1.2 already exists appropriate functions for such query - clGetKernelArgInfo, but due to project restrictions I have to find way to achieve such functionality using pure OpenCL 1.0 without any extensions.
At present, I am thinking about three approaches:
write simple Ansi C parser to get info about kernel's signature directly from OpenCL kernel's source
using macros in OpenCL code to mark kernel's arguments for simple in-app parsing (by extending this idea)
define list of the most possible combination of kernel's arguments using macros and class-helpers (due to my project's constrains it is possible to operate under 3-5 common arg-types)
My question: is there any other ways to get info about compiled kernel?
I want to use this info to decrease amount of OpenCL routine in client code by encapsulate calls to clCreateBuffer, clEnqueueWrite/Read, clSetKernelArg in small wrapper, which should check provided params, allocate device side ptrs, copy data from/to hosts and so on.
The Khronos WebCL Validator gives you the equivalent of clGetKernelArgInfo, including all qualifiers.
The necessary downside is that it's a complete parser, based on Clang/LLVM. It takes roughly the same amount of time to run as a typical OpenCL compiler (not a coincidence), and adds around 10 megabytes to your executable size.

Multiple programs with same names for kernels

I have the following situation:
Two threads handling two OpenCL devices which share the same context. Each thread loads a different version of the OpenCL device code, creates a cl::Programm instance and compiles the code for his specific cl::Device. However, the createKernels function after successfully building the program fails with error code -47 =
CL_INVALID_KERNEL_DEFINITION if the function definition for __kernel function given by kernel_name such as the number of arguments, the argument types are not the same for all devices for which the program executable has been built.
With multiple cl::Context instances (one for each device) this worked well. If I look at the OpenCL class diagram (http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/classDiagram.html) I don't see why is should not be able to use multiple programs with multiple kernels within one context as they are clearly distinguishable via the associated programs.
I'm using the OpenCL implementaton of Nvidia within CUDA SDK 5.5. The questions that arises for me are:
Is this a general misunderstanding of the OpenCL structure and there is a rule that says that every kernel within a context must have a unique name, or is this one of Nvidia's non OpenCL standard confirm ways of handling this particular use case?
I really want multiple devices within one context to be able to use copy from one cl::Buffer to another even if their memory resides on different devices.

What should replace "memcpy" inside OpenCL kernels?

The OpenCL language, which extends C99, does not provide the memcpy function. What should be used instead?
As far as I know, there is nothing like that defined in OpenCL. OpenCL does not provide a concept like dynamic memory and therefore, such functionality is not needed.
You could just run over your array with for and copy the data element by element. But, the target array is of fixed size due to the need to specify the array length at compile time.
On the other side, OpenCL (and OpenGL as a kind of origin) was defined in a more static way. The data needs to be provided to the GPU and the result size needs to be defined. The GPU calculates the input to the pre-defined output location. It is not meant to create more processes within the GPU and it is also not meant to allocate dynamically memory to not disturbed the host doing it.

Resources