How to compile 'OpenCL-program' faster?

How to compile 'OpenCL-program' faster? - opencl

I have two related questions here and so I am asking as one question:
1- We compile the opencl-program at run time using
clCreateProgramWithSource(context, 1, (const char**)&source, NULL, NULL);
clBuildProgram(program, 1, &device, NULL, NULL,
NULL);
My question is every time my opencl application runs it will do this compilation, and it might take considerable time. Is there a way so that the compilation will happen for the first time, and in subsequent application runs, it uses the binary from the previous compilation?
2- What are the different ways to speed up the compilation using clBuildProgram()? may be using compiler flags or something else?

At the expense of portability, you can use clCreateProgramWithBinary.
To save your compiled OpenCL code to run on the same device, you need to do the following:
Compile the code using clCreateProgramWithSource
Use clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES, //...) to obtain the size of the binary
Use clGetProgramInfo(program, CL_PROGRAM_BINARIES, //...) to write the binary to a char buffer.
Write the buffer to disk.
Then in future, you can use clCreateProgramWithBinary rather than compile from source.
There's an example of how to do all of this in this code. You can trim it down to suit your needs.
As mentioned in the comments (thanks #Dithermaster) and to reiterate my first point, the compiled binary is very specific to the system upon which it was compiled. If there are any changes to the system a new binary must be compiled.

Related

Is there a way to simplify OpenCl kernels usage ?

To use OpenCL kernel the following is needed:
Put the kernel code in a string
call clCreateProgramWithSource
call clBuildProgram
call clCreateKernel
call clSetKernelArg (x number of arguments)
call clEnqueueNDRangeKernel
This need to be done for each kernel. Is there a way to do this repeating less code for each kernel?

There is no way to speed up the process. You need to go step by step as you listed.
But it is important to know why it is needed these steps, to understand how flexible the chain is.
clCreateProgramWithSource: Allows to add different strings from different sources to generate the program. Some string might be static, but some might be downloaded from a server, or loaded from disk. It allows the CL code to be dynamic and updated over time.
clBuildProgram: Builds the program for a given device. Maybe you have 8 devices, so you need to call this multiple times. Each device will produce a different binary code.
clCreateKernel: Creates a kernel. But a kernel is an entry point in a binary. So it is possible you create multiple kernels from a program (for different functions). Also the same kernel might be created multiple times, since it holds the arguments. This is useful for having ready-to-be-launched instances with proper parameters.
clSetKernelArg: Changes the parameters in the instance of the kernel. (it is stored there, so it can used multiple times in the future).
clEnqueueNDRangeKernel: Launches it, configuring the size of the launch and the chain of dependencies with other operations.
So, even if you could have a way to just call "getKernelFromString()", the functionality will be very limited, and not very flexible.

You can have look at wrapper libraries
https://streamhpc.com/knowledge/for-developers/opencl-wrappers/

I suggest you look into SYCL. The building steps are performed offline, saving execution time by skipping the clCreateProgramWithSource. The argument setting is done automatically by the runtime, extracting the information from the user lambda

There is also CLU: https://github.com/Computing-Language-Utility/CLU - see https://www.khronos.org/assets/uploads/developers/library/2012-siggraph-opencl-bof/OpenCL-CLU-and-Intel-SIGGRAPH_Aug12.pdf for more info. It is a very simple tool, but should make life a bit easier.

What does Qt Quick Compiler do exactly?

What does Qt Quick Compiler do exactly? My understanding was that it "compiles" QML/JS into C++ and integrates this into the final binary/executable. So, there is no JIT compilation or any other JS-related things during runtime.
However, I saw somewhere an article that claimed that it's not like this and actually it only "bundles" QML/JS into final binary/executable, but there is still some QML/JS-related overhead during runtime.
At the documentation page there is this explanation:
.qml files as well as accompanying .js files can be translated into
intermediate C++ source code. After compilation with a traditional
compiler, the code is linked into the application binary.
What is this "intermediate C++ source code"? Why not just "C++ source code"? That confuses me, but the last statement kinda promises that yes, it is a C++ code, and after compiling it with C++ compiler you will have a binary/executable without any additional compiling/interpretation during runtime.
Is it how it actually is?

The code is of an intermediate nature because it doesn't map Javascript directly to C++. E.g. var i = 1, j = 2, k = i+j is not translated to the C++ equivalent double i = 1., j = 2., k = i+j. Instead, the code is translated to a series of operations that directly manipulate the state of the JS virtual machine. JS semantics are not something you can get for free from C++: there will be runtime costs no matter how you implement it. There is no additional compiling nor interpretation, but the virtual machine that implements the JS state still has to exist.
That's not an overhead easy to get rid of without emitting a lot mostly dead code to cover all contexts in which a given piece of code might run, or doing just-in-time compilation that you wanted to avoid. That's the primary problem with JavaScript: its semantics are such that it's generally not possible to translate it to typical imperative statically typed code that gives rise to "standard" machine code.

Your question already contains the answer.
It compiles the code into C++, that is of intermediate nature as it is not enough to have C++-Code. You need binaries. So after the compilation to C++, the files are then compiled into binaries. Those are then linked.
The statement only says: We do not compile to binary, but to C++ instead. You need to compile it into a binary with your a C++-Compiler of your choice.
The bundeling happens, if you only put it into the resources (qrc-file). Putting it into the resources does not imply that you use the compiler.
Then there is the JIT compiler, that might (on supported platforms) do a Just-in-Time-Compilation. More on this here

How to get kernel information

I want to get following information about compiled OpenCL kernels - list of types, params order (if possible - with memory and access classifiers). Kernels are build from the sources during run time of app.
Actually, in OpenCL 1.2 already exists appropriate functions for such query - clGetKernelArgInfo, but due to project restrictions I have to find way to achieve such functionality using pure OpenCL 1.0 without any extensions.
At present, I am thinking about three approaches:
write simple Ansi C parser to get info about kernel's signature directly from OpenCL kernel's source
using macros in OpenCL code to mark kernel's arguments for simple in-app parsing (by extending this idea)
define list of the most possible combination of kernel's arguments using macros and class-helpers (due to my project's constrains it is possible to operate under 3-5 common arg-types)
My question: is there any other ways to get info about compiled kernel?
I want to use this info to decrease amount of OpenCL routine in client code by encapsulate calls to clCreateBuffer, clEnqueueWrite/Read, clSetKernelArg in small wrapper, which should check provided params, allocate device side ptrs, copy data from/to hosts and so on.

The Khronos WebCL Validator gives you the equivalent of clGetKernelArgInfo, including all qualifiers.
The necessary downside is that it's a complete parser, based on Clang/LLVM. It takes roughly the same amount of time to run as a typical OpenCL compiler (not a coincidence), and adds around 10 megabytes to your executable size.

How to have more than one source file with C18 in MPLAB?

In many languages, such as C++, having lots of different source files is normal, but it doesn't seem like this is the case very often with PIC microcontroller programs -- at least not with any of the tutorials or books I've read.
I'm wondering how I can have a source (.c) file with a bunch of routines, global variables and defines in it that can be used by my main.c file. Is this even possible?
Thanks for your advice!

This is absolutely possible with PIC development. Size is certainly a concern both from a code and data perspective but it's still just C code meaning most (see compiler documentation for exceptions) of the rules of C apply including having multiple source files that get compiled and linked into a single output (usually .hex file). For example, in a separate C file from your main.c like test.c:
int AddNumbers(int a, int b)
{
return a + b;
}
You could then define that in a header file test.h:
int AddNumbers(int a, int b);
Include test.h at the top of your main.c file:
#include "test.h"
You should then be able to call AddNumbers(4,5) from main.c. I have not tested this code but provide it simply as an example of the process.

Typically, most code for PIC18 is included from other files. so rather than the high level techniques of compile then link, it is more common to include (and include from includes) all of the code so that there is a single stream going to the compiler. I think you can do it under PIC18, but I never spent enough time to get it to work. Most of the libraries and such are designed as include file, rather than as separately translated units.
It's a different mindset, but there is a reason. I think this is due to the historic need to keep things as small as possible. Therfore, things are done with MUCH more macros based on the chip, and much less (linkable) library development.
PIC32 compiler is much better for its library support.

How are dynamic languages JITted?

In dynamic languages, how is dynamically typed code JIT compiled into machine code? More specifically: does the compiler infer the types at some point? Or is it strictly interpreted in these cases?
For example, if I have something like the following pseuocode
def func(arg)
if (arg)
return 6
else
return "Hi"
How can the execution platform know before running the code what the return type of the function is?

In general, it doesn't. However, it can assume either type, and optimize for that. The details depend on what kind of JIT it is.
The so-called tracing JIT compilers interpret and observe the program, and record types, branches, etc. for a single run (e.g. loop iteration). They record these observations, insert a (quite fast) check that these assumptions are still true when the code is executed, and then optimize the heck out of the following code based on these assuptions. For example, if your function is called in a loop with a constantly true argument and adds one to it, the JIT compiler first records instructions like this (we'll ignore call frame management, memory allocation, variable indirection, etc. not because those aren't important, but because they take a lot of code and are optimized away too):
; calculate arg
guard_true(arg)
boxed_object o1 = box_int(6)
guard_is_boxed_int(o1)
int i1 = unbox_int(o1)
int i2 = 1
i3 = add_int(res2, res3)
and then optimizes it like this:
; calculate arg
; may even be elided, arg may be constant without you realizing it
guard_true(arg)
; guard_is_boxed_int constant-folded away
; unbox_int constant-folded away
; add_int constant-folded away
int i3 = 7
Guards can also be moved to allow optimizing earlier code, combined to have fewer guards, elided if they are redundant, strengthened to allow more optimizations, etc.
If guards fail too frequently, or some code is otherwise rendered useless, it can be discarded, or at least patched to jump to a different version on guard failure.
Other JITs take a more static approach. For instance, you can do quick, inaccurate type inference to at least recognize a few operations. Some JIT compilers only operate on function scope (they are thus called method JIT compilers by some), so they probably can't make much of your code snippet (one reason tracing JIT compilers are very popular for). Nevertheless, they exist -- an example is the latest revision of Mozilla's JavaScript engine, Ion Monkey, although it apparently takes inspiration from tracing JITs as well. You can also insert add not-always-valid optimizations (e.g. inline a function that may be changed later) and remove them when they become wrong.
When all else fails, you can do what interpreters do, box objects, use pointers to them, tag the data, and select code based on the tag. But this is extremely inefficient, the whole purpose of JIT compilers is getting rid of that overhead, so they will only do that when there is no reasonable alternative (or when they are still warming up).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to compile 'OpenCL-program' faster? - opencl

Related

Is there a way to simplify OpenCl kernels usage ?

What does Qt Quick Compiler do exactly?

How to get kernel information

How to have more than one source file with C18 in MPLAB?

How are dynamic languages JITted?

Categories

Resources