What should replace "memcpy" inside OpenCL kernels?

What should replace "memcpy" inside OpenCL kernels? - opencl

The OpenCL language, which extends C99, does not provide the memcpy function. What should be used instead?

As far as I know, there is nothing like that defined in OpenCL. OpenCL does not provide a concept like dynamic memory and therefore, such functionality is not needed.
You could just run over your array with for and copy the data element by element. But, the target array is of fixed size due to the need to specify the array length at compile time.
On the other side, OpenCL (and OpenGL as a kind of origin) was defined in a more static way. The data needs to be provided to the GPU and the result size needs to be defined. The GPU calculates the input to the pre-defined output location. It is not meant to create more processes within the GPU and it is also not meant to allocate dynamically memory to not disturbed the host doing it.

Related

Declaring Half precision floating point memory in SYCL

I would like to know and understand how can one declare half-precision buffers and pointers in SYCL namely in the following ways -
Via the buffer class.
Using malloc_device() function.
Also, suppose I have an existing fp32 matrix / array on the host side. How can I copy it's contents to fp16 memory on the GPU side.
TIA

For half-precision, you can just use sycl::half as the template parameter for either of these.
accHalf[i] = static_cast<sycl::half>(accFloat[i]);
For copying, you'll need to convert the data from fp32 to fp16, which you could probably do using a kernel to perform the conversion. This seems to be a well documented problem with solutions, see this thread.

Is there a way to simplify OpenCl kernels usage ?

To use OpenCL kernel the following is needed:
Put the kernel code in a string
call clCreateProgramWithSource
call clBuildProgram
call clCreateKernel
call clSetKernelArg (x number of arguments)
call clEnqueueNDRangeKernel
This need to be done for each kernel. Is there a way to do this repeating less code for each kernel?

There is no way to speed up the process. You need to go step by step as you listed.
But it is important to know why it is needed these steps, to understand how flexible the chain is.
clCreateProgramWithSource: Allows to add different strings from different sources to generate the program. Some string might be static, but some might be downloaded from a server, or loaded from disk. It allows the CL code to be dynamic and updated over time.
clBuildProgram: Builds the program for a given device. Maybe you have 8 devices, so you need to call this multiple times. Each device will produce a different binary code.
clCreateKernel: Creates a kernel. But a kernel is an entry point in a binary. So it is possible you create multiple kernels from a program (for different functions). Also the same kernel might be created multiple times, since it holds the arguments. This is useful for having ready-to-be-launched instances with proper parameters.
clSetKernelArg: Changes the parameters in the instance of the kernel. (it is stored there, so it can used multiple times in the future).
clEnqueueNDRangeKernel: Launches it, configuring the size of the launch and the chain of dependencies with other operations.
So, even if you could have a way to just call "getKernelFromString()", the functionality will be very limited, and not very flexible.

You can have look at wrapper libraries
https://streamhpc.com/knowledge/for-developers/opencl-wrappers/

I suggest you look into SYCL. The building steps are performed offline, saving execution time by skipping the clCreateProgramWithSource. The argument setting is done automatically by the runtime, extracting the information from the user lambda

There is also CLU: https://github.com/Computing-Language-Utility/CLU - see https://www.khronos.org/assets/uploads/developers/library/2012-siggraph-opencl-bof/OpenCL-CLU-and-Intel-SIGGRAPH_Aug12.pdf for more info. It is a very simple tool, but should make life a bit easier.

How to get kernel information

I want to get following information about compiled OpenCL kernels - list of types, params order (if possible - with memory and access classifiers). Kernels are build from the sources during run time of app.
Actually, in OpenCL 1.2 already exists appropriate functions for such query - clGetKernelArgInfo, but due to project restrictions I have to find way to achieve such functionality using pure OpenCL 1.0 without any extensions.
At present, I am thinking about three approaches:
write simple Ansi C parser to get info about kernel's signature directly from OpenCL kernel's source
using macros in OpenCL code to mark kernel's arguments for simple in-app parsing (by extending this idea)
define list of the most possible combination of kernel's arguments using macros and class-helpers (due to my project's constrains it is possible to operate under 3-5 common arg-types)
My question: is there any other ways to get info about compiled kernel?
I want to use this info to decrease amount of OpenCL routine in client code by encapsulate calls to clCreateBuffer, clEnqueueWrite/Read, clSetKernelArg in small wrapper, which should check provided params, allocate device side ptrs, copy data from/to hosts and so on.

The Khronos WebCL Validator gives you the equivalent of clGetKernelArgInfo, including all qualifiers.
The necessary downside is that it's a complete parser, based on Clang/LLVM. It takes roughly the same amount of time to run as a typical OpenCL compiler (not a coincidence), and adds around 10 megabytes to your executable size.

STD Classes in CUDA Kernel

I know that there is no way using std classes such as string, vector, map or set in CUDA kernel. However, it's very uncomfortable without them. I have to write a lot of code in CUDA kernel, so I would like to use at least strings and vectors. I'm not talking about something like thrust. I want to be able to write something like this:
__global__ void kernel()
{
cuda_vector<int> a;
for(int i=0;i<10;i++)
a.push_back(i);
}
int main()
{
kernel<<<1,512>>>();
return 0;
}
This should create 512 threads and in each thread I want to create cuda_vector class and use it as std::vector. I didn't find any solution on the internet and I started to write my own class. Each function of this class is defined as "__ host __ " and " __ device __" function so that I can use it on both CPU and GPU.
Theoretically, it can be implemented, however only on Fermi architecture. Because, we need to allocate memory dynamically. I have GTX 580 and started to write my own Vector. But it's tiring and needs a lot of time. Isn't there any implementation which I can use? I can't believe that there isn't any. Do so many software developers write on CUDA without it? And noone tried to write his/her own version?

The reason you don't find something like std::vector for cuda is performance. Your traditional vector object doesn't fit well with the CUDA model. If you are planning on using only 512 threads and each one will be managing a std::vector like object your performance is going to be worse than running the same code on the CPU.
GPU threads are not like CPU threads, they should be as light as possible. Use thread blocks and shared memory to have the threads cooperate. If you are manipulating a string, each thread should be working on one character, if you are using vectors in the CPU pass an array of that to the GPU, and have each thread work on one element. Basically, think about how to solve the problem with the CUDA programming model as apposed to solving it with a CPU approach and then translating it to CUDA.

I've not used it, but the CuPP framework may be of interest to you, especially the vector<T> implementation. Looks like it could do what you need it to do.

How do you work around the inabilty to pass a list of cl_mem into a kernel invocation?

There are lots of real-world reasons you'd want to do this. Ours is because we have a list of variable length data structures, and we want to be able to change the size of one of the elements without recopying them all.
Here's a few things I've tried:
Just have a lot of kernel arguments. Sure, sounds hacky, but works for small N. This is actually what we've been doing.
Do 1) with some sort of macro loop which extends the kernel args to the max size (which I think is device dependent). I don't really want to do this... it sounds bad.
Create some sort of list of structs which contain pointers, and fill it before your kernel invocation. I tried this, and I think it violates the spec. According to what I've seen on the nVidia forums, preserving the address of a device pointer beyond one kernel invocation is illegal. If anyone can point to where in the spec it says this, I'd love to know, because I can't find it. However, this definitely breaks on ATI hardware, as it moves the objects around.
Give up, store the variable sized objects in a big array, and write a clever algorithm to use empty space so the whole array must be reflowed less often. This will work, but is an inelegant, complicated design. Also, it requires lots of scary pointer arithmetic...
Does anyone else have other ideas? What about experiences trying to do this; is there a least hacky way? Why?

To 3:
OpenCL 1.1 spec page 193 says "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)."
Struct containing a pointer to pointer (pointer to a buffer object) might not be against strict reading of this sentence but it's within the spirit: No pointers to buffer objects may be passed as arguments from host code to kernel even if they're hidden inside a user defined struct.
I'd opt for option 5: Do not use variable size data structures. If you have any way of making them constant size by all means do it. It will make your life a whole lot easier. To be precise there is no 'variable size structure'. Every struct definition produces constant sized structs, so if the size has changed then the struct itself has changed and therefore requires another mem object. Every pointer passed to kernel function must have a single type.

In addition to sharpnelis answer option 5:
If the objects have similar size you could use unions on the biggest possible object size. But make sure you use explicit alignment. Pass a second buffer identifying the union used in each object in your variable-sized-objects-in-static-size-union buffer.
I reverted to this when using opencl lib code that only allowed one variable array of arbitrary type. I simply used cl_float2 to pass two floats. Since the cl_floatN types are implemented as unions - what works for the build in types will work for you as well.