How to pass char pointer into opencl kernel? - opencl

I am trying to pass a char pointer into the kernel function of opencl as
char *rp=(char*)malloc(something);
ciErr=clSetKernelArg(ckKernel,0,sizeof(cl_char* ),(char *)&rp)
and my kernel is as
__kernel void subFilter(char *rp)
{
do something
}
When I am running the kernel I am getting
error -48 in clsetkernelargs 1
Also, I tried to modify the kernel as
__kernel void subFilter(__global char *rp)
{
do something
}
I got error as
error -38 in clsetkernelargs 1
which says invalid mem object .
i just want to access the memory location pointed by the rp in the kernel.
Any help would be of great help.
Thnaks,
Piyush

Any arrays and memory objects that you use in an OpenCL kernel needed to be allocated via the OpenCL API (e.g. using clCreateBuffer). This is because the host and device don't always share the same physical memory. A pointer to data that is allocated on the host (via malloc) means absolutely nothing to a discrete GPU for example.
To pass an array of characters to an OpenCL kernel, you should write something along the lines of:
char *h_rp = (char*)malloc(length);
cl_mem d_rp = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, length, h_rp, &err);
err = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), &d_rp)
and declare the argument with the __global (or __constant) qualifier in your kernel. You can then copy the data back to the host with clEnqueueReadBuffer.
If you do know that host and device share the same physical memory, then you can allocate memory that is visible to both host and device by creating a buffer with the CL_MEM_ALLOC_HOST_PTR flag, and using clEnqueueMapMemObject when you wish to access the data from the host. The new shared-virtual-memory (SVM) features of OpenCL 2.0 also improve the way that you can share buffers between host and device on unified-memory architectures.

Related

OpenCL Image reading not working on dynamically allocated array

I'm writing an OpenCL program that applies a convolution matrix on an image. Everything works fine if I store all pixel on an array image[height*width][4] (line 65,commented) (sorry, I speak Spanish, and I code mostly in Spanish). But, since the images I'm working with are really large, I need to allocate the memory dynamically. I execute the code, and I get a Segmentation fault error.
After some poor man's debugging, I found out the problem arises after executing the kernel and reading the output image back into the host, storing the data into the dynamically allocated array. I just can't access the data of the array without getting the error.
I think the problem is the way the clEnqueueReadImage function (line 316) writes the image data into the image array. This array was allocated dynamically, so it has no predefined "structure".
But I need a solution, and I can't find it, nor on my own or on Internet.
The C program and the OpenCL kernel are here:
https://gist.github.com/MigAGH/6dd0fddfa09f5aabe7eb0c2934e58cbe
Don't use pointers to pointers (unsigned char**). Use a regular pointer instead:
unsigned char* image = (unsigned char*)malloc(sizeof(unsigned char)*ancho*alto*4);
Then in the for loop:
for(i=0; i<ancho*alto; i++){
unsigned char* pixel = (unsigned char*)malloc(sizeof(unsigned char)*4);
fread (pixel, 4, 1, bmp_entrada);
image[i*4] = pixel[0];
image[i*4+1] = pixel[1];
image[i*4+2] = pixel[2];
image[i*4+3] = pixel[3];
free(pixel);
}

Local memory using C++ Wrappers

I wish to use local work groups for my kernels, but I'm having some issues passing the 'NULL' parameters to my kernels. I hope to know how to pass these parameters using the methods that I'm using which I will show below, as opposed to setArg which I saw here: How to declare local memory in OpenCL?
I have the following host code for my kernel:
initialized in a .h file:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer>> setInputKernel;
in host code:
this->setInputKernel.reset(new cl::make_kernel<cl::Buffer, cl::Buffer>(program, "setInputs"));
enqueue kernel code:
(*setInputKernel)(cl::EnqueueArgs(*queue, cl::NDRange(1000),cl::NDRange(1000)),
cl::Buffer, cl::Buffer);
kernel code:
kernel void setInputs(global float* restrict inputArr, global float* restrict inputs)
I have already set the appropriate sizes and setting for my local work group parameters. However, I did not successfully pass the data into the kernel.
The kernel with the local work group updates:
kernel void setInputs(global float* restrict inputArr, global float*
restrict inputs, local float* inputArrLoc, local float* inputsLoc)
I had tried to change my code accordingly by using NULL or cl::Buffer for the input params of the kernels, but didn't work:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, NULL, NULL>> setInputKernel;
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, cl::Buffer, cl::Buffer>> setInputKernel;
with the first attempt giving me compiler issues saying that the function expects a value while I did not give one, and the second attempt returning clSetKernelArg error when I try to run the kernel. In both examples, I had ensured that all the parameters for the headers and host files were consistent.
I also tried to just put NULL behind my cl::Buffers when I enqueue the kernel, but this returns an error telling me that there is no function for call.
How do I pass parameters to my kernel in my example?
There is a LocalSpaceArg type and Local helper function to do this.
The type of your kernel would be this:
cl::make_kernel<cl::Buffer, cl::Buffer, cl::LocalSpaceArg, cl::LocalSpaceArg>
You would then specify the size of the local memory allocations when you enqueue the kernel by using cl::Local(size) (where size is the number of bytes you wish to allocate).

OpenCL: maintaining separate version of kernels

The Intel SDK says:
If you need separate versions of kernels, one way to keep the source
code base same, is using the preprocessor to create CPU-specific or
GPU-specific optimized versions of the kernels. You can run
clBuildProgram twice on the same program object, once for CPU with
some flag (compiler input) indicating the CPU version, the second time
for GPU and corresponding compiler flags. Then, when you create two
kernels with clCreateKernel, the runtime has two different versions
for each kernel.
Let us say I use clBuildProgram twice with flags for CPU and GPU. This will compile two versions of program one optimized for CPU and another optimized for GPU. But how will I create two kernels now, since there is not CPU/GPU specific option in clCreateKernel()?
The sequence of calls to build the kernel for CPU- and GPU devices and obtain the different kernels could look like this:
cl_program program = clCreateProgramWithSource(...)
clBuildProgram(program, numCpuDevices, cpuDeviceList, cpuOptions, NULL, NULL);
cl_kernel cpuKernel = clCreateKernel(program, ...);
clBuildProgram(program, numGpuDevices, gpuDeviceList, gpuOptions, NULL, NULL);
cl_kernel gpuKernel = clCreateKernel(program, ...);
(Note: I could not test this at the moment. If there's something wrong, I'll delete this answer)
clCreateKernel creates an entry point to a program, and the program has already been compiled for an specific device (CPU or GPU). So, there is nothing that you can do at the create kernel level if the program is already compiled in one or the other way.
By passing different compiled program objects, clCreateKernel will create different kernel objects for different devices.
The key to control the GPU/CPU mode is at the clBuildProgram step, where a device has to be specified.
Additionally the compilation can be further refined with external defines to disable/enable pieces of code specifically designed for CPU/GPU.
You would create only kernel with the same name. To discriminate between devices you would use the #ifdef queries inside the kernel, i.e.:
kernel void foo(global float *bar)
{
#ifdef HAVE_CPU
bar[0] = 23.0;
#elif HAVE_GPU
bar[0] = 42.0;
#endif
}
You can obtain this flag by
program.build({device}, "-DHAVE_CPU")
or -DHAVE_GPU. Remark: -D... is not a typo.

Passing a pointer to device memory between classes in CUDA

I would appreciate some help involving CUDA device memory pointers. Basically I want to split my CUDA kernel code into multiple files for readability and because it is a large program. So what I want to do is be able to pass the same device memory pointers to multiple CUDA kernels, not simultaneously. Below is a rough example of what I need
//random.h
class random{
public:
int* dev_pointer_numbers;
};
so the object simply needs to store the pointer to device memory
//random_kernel.cu
__global__ void doSomething(int *values){
//do some processing}
extern "C" init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
extern "C" runKernel(int *devPtr){
doSomething<<<1,1>>>(devPtr);
}
and the main file:
//main.cpp
//ignoring all the details etc
random rnd;
void CUDA(int *hostArray)
{
init_memory(rnd.dev_pointer_numbers,hostArray,10);
runKernel(rnd.dev_pointer_numbers);
}
I understand that when I run the kernel code with the object pointer it isnt mapped in device memory so thats why the kernel code fails. What I want to know is how can I store to the pointer to a particular block in device memory in my main file so that it can be reused amongst other cuda kernel files?
You're losing your pointer!
Check out your init_memory function:
init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
So you pass in a pointer, at which point you have a local copy named devPtr. Then you call cudaMalloc() with the address of the local copy of the pointer. When the function returns the local copy (on the stack) is destroyed, so you have lost the pointer.
Instead try this:
init_memory(int **devPtr,int *host_memory,int arraysize)
{
cudaMalloc(devPtr,arraysize*sizeof(int));
cudaMemcpy(*devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
...
init_memory(&rnd.dev_pointer_numbers,hostArray,10);
As a side note, consider removing the extern "C", since you're calling this from C++ (main.cpp) there's no point and it just clutters your code.

OpenCL kernel fails to compile asking for address space qualifier

The following opencl code fails to compile.
typedef struct {
double d;
double* da;
long* la;
uint ui;
} MyStruct;
__kernel void MyKernel (__global MyStruct* s) {
}
The error message is as follows.
line 11: error: kernel pointer arguments must point to addrSpace global, local, or constant
__kernel void MyKernel (__global MyStruct* s) {
^
As you can see I have clearly qualified the argument with '__global' as the error suggests I should. What am I doing wrong and how can I resolve this error?
Obviously this happens during kernel compilation so I haven't posted my host code here as it doesn't even get further than this.
Thanks.
I think the problem is that you have pointers in your struct, which is not allowed. You cannot point to host memory from your kernel like that, so pointers in kernel argument structs don't make much sense. Variable-sized arrays are backed up in OpenCL by a cl_mem host object, and that counts for one whole argument, so as far as I know, you can only pass variable-sized arrays directly as a kernel argument (and adjust the number of work units accordingly, of course).
You might prefer to put size information in your struct and pull out the arrays as standalone kernel arguments.

Resources