OpenCL: maintaining separate version of kernels - opencl

The Intel SDK says:
If you need separate versions of kernels, one way to keep the source
code base same, is using the preprocessor to create CPU-specific or
GPU-specific optimized versions of the kernels. You can run
clBuildProgram twice on the same program object, once for CPU with
some flag (compiler input) indicating the CPU version, the second time
for GPU and corresponding compiler flags. Then, when you create two
kernels with clCreateKernel, the runtime has two different versions
for each kernel.
Let us say I use clBuildProgram twice with flags for CPU and GPU. This will compile two versions of program one optimized for CPU and another optimized for GPU. But how will I create two kernels now, since there is not CPU/GPU specific option in clCreateKernel()?

The sequence of calls to build the kernel for CPU- and GPU devices and obtain the different kernels could look like this:
cl_program program = clCreateProgramWithSource(...)
clBuildProgram(program, numCpuDevices, cpuDeviceList, cpuOptions, NULL, NULL);
cl_kernel cpuKernel = clCreateKernel(program, ...);
clBuildProgram(program, numGpuDevices, gpuDeviceList, gpuOptions, NULL, NULL);
cl_kernel gpuKernel = clCreateKernel(program, ...);
(Note: I could not test this at the moment. If there's something wrong, I'll delete this answer)

clCreateKernel creates an entry point to a program, and the program has already been compiled for an specific device (CPU or GPU). So, there is nothing that you can do at the create kernel level if the program is already compiled in one or the other way.
By passing different compiled program objects, clCreateKernel will create different kernel objects for different devices.
The key to control the GPU/CPU mode is at the clBuildProgram step, where a device has to be specified.
Additionally the compilation can be further refined with external defines to disable/enable pieces of code specifically designed for CPU/GPU.

You would create only kernel with the same name. To discriminate between devices you would use the #ifdef queries inside the kernel, i.e.:
kernel void foo(global float *bar)
{
#ifdef HAVE_CPU
bar[0] = 23.0;
#elif HAVE_GPU
bar[0] = 42.0;
#endif
}
You can obtain this flag by
program.build({device}, "-DHAVE_CPU")
or -DHAVE_GPU. Remark: -D... is not a typo.

Related

Local memory using C++ Wrappers

I wish to use local work groups for my kernels, but I'm having some issues passing the 'NULL' parameters to my kernels. I hope to know how to pass these parameters using the methods that I'm using which I will show below, as opposed to setArg which I saw here: How to declare local memory in OpenCL?
I have the following host code for my kernel:
initialized in a .h file:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer>> setInputKernel;
in host code:
this->setInputKernel.reset(new cl::make_kernel<cl::Buffer, cl::Buffer>(program, "setInputs"));
enqueue kernel code:
(*setInputKernel)(cl::EnqueueArgs(*queue, cl::NDRange(1000),cl::NDRange(1000)),
cl::Buffer, cl::Buffer);
kernel code:
kernel void setInputs(global float* restrict inputArr, global float* restrict inputs)
I have already set the appropriate sizes and setting for my local work group parameters. However, I did not successfully pass the data into the kernel.
The kernel with the local work group updates:
kernel void setInputs(global float* restrict inputArr, global float*
restrict inputs, local float* inputArrLoc, local float* inputsLoc)
I had tried to change my code accordingly by using NULL or cl::Buffer for the input params of the kernels, but didn't work:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, NULL, NULL>> setInputKernel;
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, cl::Buffer, cl::Buffer>> setInputKernel;
with the first attempt giving me compiler issues saying that the function expects a value while I did not give one, and the second attempt returning clSetKernelArg error when I try to run the kernel. In both examples, I had ensured that all the parameters for the headers and host files were consistent.
I also tried to just put NULL behind my cl::Buffers when I enqueue the kernel, but this returns an error telling me that there is no function for call.
How do I pass parameters to my kernel in my example?
There is a LocalSpaceArg type and Local helper function to do this.
The type of your kernel would be this:
cl::make_kernel<cl::Buffer, cl::Buffer, cl::LocalSpaceArg, cl::LocalSpaceArg>
You would then specify the size of the local memory allocations when you enqueue the kernel by using cl::Local(size) (where size is the number of bytes you wish to allocate).

What are kernel blocks in OpenCL?

In the article "How to set up Xcode to run OpenCL code, and how to verify the kernels before building" NeXTCoder referred to some code as the "Short Answer", i.e. https://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html.
In that code the author says "Wrap your kernel code into a kernel block:" without explaining what is a "kernel block". (The OpenCL Programmer Guide for Mac OS X by Apple makes no mention of kernel block.)
The host program calls "square_kernel" but the sample kernel is called "square" and the sample kernel block is labelled "kernelName" (in italics). Can you please tell me how to put the 3 pieces together:kernel, kernel block & host program to run in Xcode 5.1? I only have one kernel. Thanks.
It's not really jargon. It's closure-like entity.
OpenCL C 2.0 adds support for the clang block syntax. You use the ^ operator to declare a Block variable and to indicate the beginning of a Block literal. The body of the Block itself is contained within {}, as shown in the example (as usual with C, ; indicates the end of the statement).The Block is able to make use of variables from the same scope in which it was defined.
Example:
int multiplier = 7;
int (^myBlock)(int) = ^(int num) {
return num * multiplier;
};
printf(ā€œ%d\nā€, myBlock(3));
// prints 21
Source:
https://www.khronos.org/registry/cl/sdk/2.1/docs/man/xhtml/blocks.html
The term "kernel block" only seems to be a jargon to refer to the "part of the code that is the kernel". Particularly, the kernel block in this case is simply the function that is declared to be a kernel, by adding kernel before its declaration. Or, even simpler, and from the way how the term is used on this website, I would say that "kernel block" is the same as "kernel".
The kernelName (in italics) is a placeholder. The code there shows the general pattern of how to define any kernel:
It is prefixed with kernel
It returns void
It has a name ... the kernelName, which may for example be square
It has several input- and output parameters
The reason why the kernel is called square, but invoked with square_kernel seems to be some magic that is done by XCode: It seems to read the .cl file, and creates a .h file that contains additional declarations that are derived from the .cl file (as can be seen in this question, where a kernel called rebound is defined, and GCL generated a rebound_kernel declaration).

How to pass char pointer into opencl kernel?

I am trying to pass a char pointer into the kernel function of opencl as
char *rp=(char*)malloc(something);
ciErr=clSetKernelArg(ckKernel,0,sizeof(cl_char* ),(char *)&rp)
and my kernel is as
__kernel void subFilter(char *rp)
{
do something
}
When I am running the kernel I am getting
error -48 in clsetkernelargs 1
Also, I tried to modify the kernel as
__kernel void subFilter(__global char *rp)
{
do something
}
I got error as
error -38 in clsetkernelargs 1
which says invalid mem object .
i just want to access the memory location pointed by the rp in the kernel.
Any help would be of great help.
Thnaks,
Piyush
Any arrays and memory objects that you use in an OpenCL kernel needed to be allocated via the OpenCL API (e.g. using clCreateBuffer). This is because the host and device don't always share the same physical memory. A pointer to data that is allocated on the host (via malloc) means absolutely nothing to a discrete GPU for example.
To pass an array of characters to an OpenCL kernel, you should write something along the lines of:
char *h_rp = (char*)malloc(length);
cl_mem d_rp = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, length, h_rp, &err);
err = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), &d_rp)
and declare the argument with the __global (or __constant) qualifier in your kernel. You can then copy the data back to the host with clEnqueueReadBuffer.
If you do know that host and device share the same physical memory, then you can allocate memory that is visible to both host and device by creating a buffer with the CL_MEM_ALLOC_HOST_PTR flag, and using clEnqueueMapMemObject when you wish to access the data from the host. The new shared-virtual-memory (SVM) features of OpenCL 2.0 also improve the way that you can share buffers between host and device on unified-memory architectures.

Valgrind equivalent for OpenCL

To check memory access violations on the CPU, there's Valgrind/memcheck; for CUDA code on the GPU, there's CUDA memcheck. Is there an equivalent to these tools for OpenCL?
There is now an OpenCL device simulator called Oclgrind that works in a similar manner to Valgrind to provide a 'memcheck' feature (among other things).
It's open-source, and there are binary builds for various platform available. Like Valgrind it's not fast, but using it is very straightforward:
$ oclgrind ./myapp
Invalid write of size 4 at global memory address 0x3000000000010
Kernel: write_out_of_bounds
Entity: Global(4,0,0) Local(4,0,0) Group(0,0,0)
store i32 %tmp15, i32 addrspace(1)* %tmp19, align 4, !dbg !24
At line 4 of input.cl:
c[i] = a[i] + b[i]
Have you looked at http://github.com/KhronosGroup/webcl-validator? It takes your OpenCL kernel source and instruments it with bounds checking code. OOB reads/writes are currently discarded, but you could modify the instrumented kernel (or the tool itself) to make it report the access violation.

MPI - one function for MPI_Init and MPI_Init_thread

Is it possible to have one function to wrap both MPI_Init and MPI_Init_thread? The purpose of this is to have a cleaner API while maintaining backward compatibility. What happens to a call to MPI_Init_thread when it is not supported by the MPI run time? How do I keep my wrapper function working for MPI implementations when MPI_Init_thread is not supported?
MPI_INIT_THREAD is part of the MPI-2.0 specification, which was released 15 years ago. Virtually all existing MPI implementations are MPI-2 compliant except for some really archaic ones. You might not get the desired level of thread support, but the function should be there and you should still be able to call it instead of MPI_INIT.
You best and most portable option is to have a configure-like mechanism probe for MPI_Init_thread in the MPI library, e.g. by trying to compile a very simple MPI program and see if it fails with an unresolved symbol reference, or you can directly examine the export table of the MPI library with nm (for archives) or objdump (for shared ELF objects). Once you've determined that the MPI library has MPI_Init_thread, you can have a preprocessor symbol defined, e.g. CONFIG_HAS_INITTHREAD. Then have your wrapped similar to this one:
int init_mpi(int *pargc, char ***pargv, int desired, int *provided)
{
#if defined(CONFIG_HAS_INITTHREAD)
return MPI_Init_thread(pargc, pargv, desired, provided);
#else
*provided = MPI_THREAD_SINGLE;
return MPI_Init(pargc, pargv);
#endif
}
Of course, if the MPI library is missing MPI_INIT_THREAD, then MPI_THREAD_SINGLE and the other thread support level constants will also not be defined in mpi.h, so you might need to define them somewhere.

Resources