OpenCL: writing a program to capture API calls (or draw calls) - opencl

I was wondering what is the best way to get started to code a middleware that will capture all OpenCL apicalls. I could then write a program to replay the trace on a different system.
I'm assuming this will not need any special hooks in the driver. If this is the case, then I suppose we will not be able to do it.
I could not find examples on the internet. If you are aware of any resources- website or books, can you please let me know?

Develop wrapper library (say, myopencl.so), link it against system OpenCL library (say, libopencl.so).
In myopencl.so, implement entry points for all OpenCL API calls as follows:
typdef cl_int (*clEnqueueNDRangeKernel_fptr)(/*arguments here*/);
cl_int clEnqueueNDRangeKernel(/*arguments here*/)
{
// Do whatever you want to do;
// Get function pointer from system OpenCL library;
clEnqueueNDRangeKernel_fptr enqueue_fptr = dlsym("clEnqueueNDRangeKernel");
// Call actual OpenCL function;
return enqueue_fptr(/*arguments here*/);
}
Link your application against myopencl.so instead of libopencl.so. You'll have to write a lot of boilerplate code, though.

Related

Is there a way to simplify OpenCl kernels usage ?

To use OpenCL kernel the following is needed:
Put the kernel code in a string
call clCreateProgramWithSource
call clBuildProgram
call clCreateKernel
call clSetKernelArg (x number of arguments)
call clEnqueueNDRangeKernel
This need to be done for each kernel. Is there a way to do this repeating less code for each kernel?
There is no way to speed up the process. You need to go step by step as you listed.
But it is important to know why it is needed these steps, to understand how flexible the chain is.
clCreateProgramWithSource: Allows to add different strings from different sources to generate the program. Some string might be static, but some might be downloaded from a server, or loaded from disk. It allows the CL code to be dynamic and updated over time.
clBuildProgram: Builds the program for a given device. Maybe you have 8 devices, so you need to call this multiple times. Each device will produce a different binary code.
clCreateKernel: Creates a kernel. But a kernel is an entry point in a binary. So it is possible you create multiple kernels from a program (for different functions). Also the same kernel might be created multiple times, since it holds the arguments. This is useful for having ready-to-be-launched instances with proper parameters.
clSetKernelArg: Changes the parameters in the instance of the kernel. (it is stored there, so it can used multiple times in the future).
clEnqueueNDRangeKernel: Launches it, configuring the size of the launch and the chain of dependencies with other operations.
So, even if you could have a way to just call "getKernelFromString()", the functionality will be very limited, and not very flexible.
You can have look at wrapper libraries
https://streamhpc.com/knowledge/for-developers/opencl-wrappers/
I suggest you look into SYCL. The building steps are performed offline, saving execution time by skipping the clCreateProgramWithSource. The argument setting is done automatically by the runtime, extracting the information from the user lambda
There is also CLU: https://github.com/Computing-Language-Utility/CLU - see https://www.khronos.org/assets/uploads/developers/library/2012-siggraph-opencl-bof/OpenCL-CLU-and-Intel-SIGGRAPH_Aug12.pdf for more info. It is a very simple tool, but should make life a bit easier.

HEVC Deblocking with parallel processing on OpenCL

I have been working on HEVC for the past 2 years and recently I was asked to port the code of x265 onto OpenCL for parallel processing. Now, I am still at the starting stage and do see some concerns since Class is not a possibility as x265 uses many classes. Would it be possible to pass the structure since I have some function prototypes within the class. Is it possible to replicate the same onto GPU.
Yes, as you have mentioned that we will not be able to pass a class to the Kernel function. However, you would be able to include the prototypes in the structure and pass it to the GPU. You can refer to this link. passing parameters of an kernel function as C++ struct?

Difference between write() and printf()

Recently I am studying operating system..I just wanna know:
What’s the difference between a system call (like write()) and a standard library function (like printf())?
A system call is a call to a function that is not part of the application but is inside the kernel. The kernel is a software layer that provides you some basic functionalities to abstract the hardware to you. Roughly, the kernel is something that turns your hardware into software.
You always ultimately use write() to write anything on a peripheral whatever is the kind of device you write on. write() is designed to only write a sequence of bytes, that's all and nothing more. But as write() is considered too basic (you may want to write an integer in ten basis, or a float number in scientific notation, etc), different libraries are provided to you by different kind of programming environments to ease you.
For example, the C programming langage gives you printf() that lets you write data in many different formats. So, you can understand printf() as a function that convert your data into a formatted sequence of bytes and that calls write() to write those bytes onto the output. But C++ gives you cout; Java System.out.println, etc. Each of these functions ends to a call to write() (at least on POSIX systems).
One thing to know (important) is that such a system call is costly! It is not a simple function call because you need to call something that is outside of your own code and the system must ensure that you are not trying to do nasty things, etc. So it is very common in higher print-like function that some buffering is built-in; such that write is not always called, but your data are kept into some hidden structure and written only when it is really needed or necessary (buffer is full or you really want to see the result of your print).
This is exactly what happens when you manage your money. If many people gives you 5 bucks each, you won't go deposit each to the bank! You keep them on your wallet (this is the print) up to the point it is full or you don't want to keep them anymore. Then you go to the bank and make a big deposit (this is the write). And you know that putting 5 bucks to your wallet is much much faster than going to the bank and make the deposit. The bank is the kernel/OS.
System calls are implemented by the operating system, and run in kernel mode. Library functions are implemented in user mode, just like application code. Library functions might invoke system calls (e.g. printf eventually calls write), but that depends on what the library function is for (math functions usually don't need to use the kernel).
System Call's in OS are used in interacting with the OS. E.g. Write() could be used something into the system or into a program.
While Standard Library functions are program specific, E.g. printf() will print something out but it will only be in GUI/command line and wont effect system.
Sorry couldnt comment, because i need 50 reputation to comment.
EDIT: Barmar has good answer
I am writing a small program. At the moment it just reads each line from stdin and prints it to stdout. I can add a call to write in the loop, and it would add a few characters at the end of each line. But when I use printf instead, then all the extra characters are clustered and appear all at once, instead of appearing on each line.
It seems that using printf causes stderr to be buffered. Adding fflush(stdout); after calling printf fixes the discrepancy in output.
I'd like to mention another point that the stdio buffers are maintained in a process’s user-space memory, while system call write transfers data directly to a kernel buffer. It means that if you fork a process after write and printf calls, flushing may bring about to give output three times subject to line-buffering and block-buffering, two of them belong to printf call since stdio buffers are duplicated in the child by fork.
printf() is one of the APIs or interfaces exposed to user space to call functions from C library.
printf() actually uses write() system call. The write() system call is actually responsible for sending data to the output.

How to get kernel information

I want to get following information about compiled OpenCL kernels - list of types, params order (if possible - with memory and access classifiers). Kernels are build from the sources during run time of app.
Actually, in OpenCL 1.2 already exists appropriate functions for such query - clGetKernelArgInfo, but due to project restrictions I have to find way to achieve such functionality using pure OpenCL 1.0 without any extensions.
At present, I am thinking about three approaches:
write simple Ansi C parser to get info about kernel's signature directly from OpenCL kernel's source
using macros in OpenCL code to mark kernel's arguments for simple in-app parsing (by extending this idea)
define list of the most possible combination of kernel's arguments using macros and class-helpers (due to my project's constrains it is possible to operate under 3-5 common arg-types)
My question: is there any other ways to get info about compiled kernel?
I want to use this info to decrease amount of OpenCL routine in client code by encapsulate calls to clCreateBuffer, clEnqueueWrite/Read, clSetKernelArg in small wrapper, which should check provided params, allocate device side ptrs, copy data from/to hosts and so on.
The Khronos WebCL Validator gives you the equivalent of clGetKernelArgInfo, including all qualifiers.
The necessary downside is that it's a complete parser, based on Clang/LLVM. It takes roughly the same amount of time to run as a typical OpenCL compiler (not a coincidence), and adds around 10 megabytes to your executable size.

Replacing WaitForMultipleObjects in Qt

I am not familar with WINAPI, and I am looking for a way to replace WaitForMultipleObjects used in one example I'm porting to Qt by anything using Qt only. Is it possible?
EDIT: (Providing more information as requested in comments)
A 3rd party API provides an array of events:
HANDLE m_hEv[MAX_EV];
In an endles-loop of a thread, the program waits for the events like this:
WaitForMultipleObjects(m_EvMax, m_hEv, FALSE ,INFINITE )
The HANDLE type seems to be void*.
So I wonder, if any Qt class could observe m_hEv for changes and unlock thread execution.
There is no simple way of porting WaitForMultipleObjects outside WinAPI. WinAPI has an "advantage" of that all lockable resources (sockets, files, processes) provide the same generic non-typesafe HANDLE, which is your void*. Unlike other platforms which have different ways of locking and signalling per the type of resource, the event handling in WinAPI is largely independent of the resources. Then a generic function like WaitForMultipleObjects can exist, which doesn't need to care who produced the HANDLEs. So you'll have to understand what the code is trying to do and mimic it differently per scenario.
The biggest difference is in WaitForMultipleObjects third parameter, which is FALSE in your case. Which means that the it will exit waiting as soon as any single event of the waiting array will happen. That is the easier scenario and can be replaced with a QWaitCondition.
Instead of m_hEv, you will pass a QWaitCondition* into the code which signals the event (most probably via WinAPI SetEvent(m_hEv[x]))
Instead of WaitForMultipleObjects, do QWaitCondition::wait().
Instead of SetEvent(), do QWaitCondition::wakeOne().
Would the third parameter be TRUE, then the WinAPI code waits until ALL m_hEv events are signalled. The established name for such functionality is a synchronization barrier and it can be simulated with QEventCondition too, but does not come out of the Qt box. I never needed to do any myself, but SO has some ideas how to do it:
Qt synchronization barrier?
WaitForMultipleObjects is a kind of generic function that works with many things: threads, processes, mutexes, etc. Qt is an OOP library where every class exposes the operations it supports. So the equivalent operation in Qt depends on what class you're using. For example, with threads, use QThread::wait. With mutexes, use QMutex::lock.

Resources