Local memory using C++ Wrappers - opencl

I wish to use local work groups for my kernels, but I'm having some issues passing the 'NULL' parameters to my kernels. I hope to know how to pass these parameters using the methods that I'm using which I will show below, as opposed to setArg which I saw here: How to declare local memory in OpenCL?
I have the following host code for my kernel:
initialized in a .h file:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer>> setInputKernel;
in host code:
this->setInputKernel.reset(new cl::make_kernel<cl::Buffer, cl::Buffer>(program, "setInputs"));
enqueue kernel code:
(*setInputKernel)(cl::EnqueueArgs(*queue, cl::NDRange(1000),cl::NDRange(1000)),
cl::Buffer, cl::Buffer);
kernel code:
kernel void setInputs(global float* restrict inputArr, global float* restrict inputs)
I have already set the appropriate sizes and setting for my local work group parameters. However, I did not successfully pass the data into the kernel.
The kernel with the local work group updates:
kernel void setInputs(global float* restrict inputArr, global float*
restrict inputs, local float* inputArrLoc, local float* inputsLoc)
I had tried to change my code accordingly by using NULL or cl::Buffer for the input params of the kernels, but didn't work:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, NULL, NULL>> setInputKernel;
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, cl::Buffer, cl::Buffer>> setInputKernel;
with the first attempt giving me compiler issues saying that the function expects a value while I did not give one, and the second attempt returning clSetKernelArg error when I try to run the kernel. In both examples, I had ensured that all the parameters for the headers and host files were consistent.
I also tried to just put NULL behind my cl::Buffers when I enqueue the kernel, but this returns an error telling me that there is no function for call.
How do I pass parameters to my kernel in my example?

There is a LocalSpaceArg type and Local helper function to do this.
The type of your kernel would be this:
cl::make_kernel<cl::Buffer, cl::Buffer, cl::LocalSpaceArg, cl::LocalSpaceArg>
You would then specify the size of the local memory allocations when you enqueue the kernel by using cl::Local(size) (where size is the number of bytes you wish to allocate).

Related

GCC, linker-script: Variables that resolve to manually defined addresses?

I'll use a simple specific example to illustrate what I'm trying to do.
file main.c:
#include <stdio.h>
unsigned int X;
int main()
{
printf("&X = 0x%zX\r\n", &X);
return 0;
}
I want to know if it's possible (using a linker-script/gcc options) to manually specify an address for X at compile/link time, because I know it lies somewhere in memory, outside my executable.
I only want to know if this is possible, I know I can use a pointer (i.e. unsigned int*) to access a specific memory location (r/w) but that's not what I'm after.
What I'm after is making GCC generate code in which all accesses to global variables/static function variables are either done through a level of indirection, i.e. through a pointer (-fPIC not good enough because static global vars are not accessed via GOT) or their addresses can be manually specified (at link/compile time).
Thank you
What I'm after is making GCC generate code in which all accesses to
global variables/static function variables ā€¦ their addresses can be
manually specified (at link/compile time).
You can specify the addresses of the .bss and .data sections (which contain the uninitialized and initialized variables respectively) with linker commands. The relative placement of the variables in the sections is up to the compiler/linker.
If you need only individual variables to be placed, this can be done by declaring them extern and specifying their addresses in a file, e. g. addresses.ld:
X = 0x12345678;
(note: spaces around = needed), which is added to the compiler/linker arguments:
cc main.c addresses.ld

OpenCL: maintaining separate version of kernels

The Intel SDK says:
If you need separate versions of kernels, one way to keep the source
code base same, is using the preprocessor to create CPU-specific or
GPU-specific optimized versions of the kernels. You can run
clBuildProgram twice on the same program object, once for CPU with
some flag (compiler input) indicating the CPU version, the second time
for GPU and corresponding compiler flags. Then, when you create two
kernels with clCreateKernel, the runtime has two different versions
for each kernel.
Let us say I use clBuildProgram twice with flags for CPU and GPU. This will compile two versions of program one optimized for CPU and another optimized for GPU. But how will I create two kernels now, since there is not CPU/GPU specific option in clCreateKernel()?
The sequence of calls to build the kernel for CPU- and GPU devices and obtain the different kernels could look like this:
cl_program program = clCreateProgramWithSource(...)
clBuildProgram(program, numCpuDevices, cpuDeviceList, cpuOptions, NULL, NULL);
cl_kernel cpuKernel = clCreateKernel(program, ...);
clBuildProgram(program, numGpuDevices, gpuDeviceList, gpuOptions, NULL, NULL);
cl_kernel gpuKernel = clCreateKernel(program, ...);
(Note: I could not test this at the moment. If there's something wrong, I'll delete this answer)
clCreateKernel creates an entry point to a program, and the program has already been compiled for an specific device (CPU or GPU). So, there is nothing that you can do at the create kernel level if the program is already compiled in one or the other way.
By passing different compiled program objects, clCreateKernel will create different kernel objects for different devices.
The key to control the GPU/CPU mode is at the clBuildProgram step, where a device has to be specified.
Additionally the compilation can be further refined with external defines to disable/enable pieces of code specifically designed for CPU/GPU.
You would create only kernel with the same name. To discriminate between devices you would use the #ifdef queries inside the kernel, i.e.:
kernel void foo(global float *bar)
{
#ifdef HAVE_CPU
bar[0] = 23.0;
#elif HAVE_GPU
bar[0] = 42.0;
#endif
}
You can obtain this flag by
program.build({device}, "-DHAVE_CPU")
or -DHAVE_GPU. Remark: -D... is not a typo.

What are kernel blocks in OpenCL?

In the article "How to set up Xcode to run OpenCL code, and how to verify the kernels before building" NeXTCoder referred to some code as the "Short Answer", i.e. https://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html.
In that code the author says "Wrap your kernel code into a kernel block:" without explaining what is a "kernel block". (The OpenCL Programmer Guide for Mac OS X by Apple makes no mention of kernel block.)
The host program calls "square_kernel" but the sample kernel is called "square" and the sample kernel block is labelled "kernelName" (in italics). Can you please tell me how to put the 3 pieces together:kernel, kernel block & host program to run in Xcode 5.1? I only have one kernel. Thanks.
It's not really jargon. It's closure-like entity.
OpenCL C 2.0 adds support for the clang block syntax. You use the ^ operator to declare a Block variable and to indicate the beginning of a Block literal. The body of the Block itself is contained within {}, as shown in the example (as usual with C, ; indicates the end of the statement).The Block is able to make use of variables from the same scope in which it was defined.
Example:
int multiplier = 7;
int (^myBlock)(int) = ^(int num) {
return num * multiplier;
};
printf(ā€œ%d\nā€, myBlock(3));
// prints 21
Source:
https://www.khronos.org/registry/cl/sdk/2.1/docs/man/xhtml/blocks.html
The term "kernel block" only seems to be a jargon to refer to the "part of the code that is the kernel". Particularly, the kernel block in this case is simply the function that is declared to be a kernel, by adding kernel before its declaration. Or, even simpler, and from the way how the term is used on this website, I would say that "kernel block" is the same as "kernel".
The kernelName (in italics) is a placeholder. The code there shows the general pattern of how to define any kernel:
It is prefixed with kernel
It returns void
It has a name ... the kernelName, which may for example be square
It has several input- and output parameters
The reason why the kernel is called square, but invoked with square_kernel seems to be some magic that is done by XCode: It seems to read the .cl file, and creates a .h file that contains additional declarations that are derived from the .cl file (as can be seen in this question, where a kernel called rebound is defined, and GCL generated a rebound_kernel declaration).

Passing a pointer to device memory between classes in CUDA

I would appreciate some help involving CUDA device memory pointers. Basically I want to split my CUDA kernel code into multiple files for readability and because it is a large program. So what I want to do is be able to pass the same device memory pointers to multiple CUDA kernels, not simultaneously. Below is a rough example of what I need
//random.h
class random{
public:
int* dev_pointer_numbers;
};
so the object simply needs to store the pointer to device memory
//random_kernel.cu
__global__ void doSomething(int *values){
//do some processing}
extern "C" init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
extern "C" runKernel(int *devPtr){
doSomething<<<1,1>>>(devPtr);
}
and the main file:
//main.cpp
//ignoring all the details etc
random rnd;
void CUDA(int *hostArray)
{
init_memory(rnd.dev_pointer_numbers,hostArray,10);
runKernel(rnd.dev_pointer_numbers);
}
I understand that when I run the kernel code with the object pointer it isnt mapped in device memory so thats why the kernel code fails. What I want to know is how can I store to the pointer to a particular block in device memory in my main file so that it can be reused amongst other cuda kernel files?
You're losing your pointer!
Check out your init_memory function:
init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
So you pass in a pointer, at which point you have a local copy named devPtr. Then you call cudaMalloc() with the address of the local copy of the pointer. When the function returns the local copy (on the stack) is destroyed, so you have lost the pointer.
Instead try this:
init_memory(int **devPtr,int *host_memory,int arraysize)
{
cudaMalloc(devPtr,arraysize*sizeof(int));
cudaMemcpy(*devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
...
init_memory(&rnd.dev_pointer_numbers,hostArray,10);
As a side note, consider removing the extern "C", since you're calling this from C++ (main.cpp) there's no point and it just clutters your code.

OpenCL kernel fails to compile asking for address space qualifier

The following opencl code fails to compile.
typedef struct {
double d;
double* da;
long* la;
uint ui;
} MyStruct;
__kernel void MyKernel (__global MyStruct* s) {
}
The error message is as follows.
line 11: error: kernel pointer arguments must point to addrSpace global, local, or constant
__kernel void MyKernel (__global MyStruct* s) {
^
As you can see I have clearly qualified the argument with '__global' as the error suggests I should. What am I doing wrong and how can I resolve this error?
Obviously this happens during kernel compilation so I haven't posted my host code here as it doesn't even get further than this.
Thanks.
I think the problem is that you have pointers in your struct, which is not allowed. You cannot point to host memory from your kernel like that, so pointers in kernel argument structs don't make much sense. Variable-sized arrays are backed up in OpenCL by a cl_mem host object, and that counts for one whole argument, so as far as I know, you can only pass variable-sized arrays directly as a kernel argument (and adjust the number of work units accordingly, of course).
You might prefer to put size information in your struct and pull out the arrays as standalone kernel arguments.

Resources