In OpenCL, can I set kernel argument as following?
cl_uint a = 0;
kernel.setArg(0, sizeof(a), &a);
I want to read&write one value from/to a kernel function, not only write to.
Setting a kernel argument in this manner can only be used for inputs to the kernel. Any output you want to read (either in a subsequent kernel or from the host program) must be written to a buffer or an image. In your case, that means you need to create a single-element buffer and pass the buffer to the kernel.
One way to think about this is that when you call setArg with the parameter &a, the OpenCL kernel is using the value of a, not the location of a. If the kernel were to write to kernel argument zero, your host program would have no way of recovering the value that was written.
Your code creates an argument of type unsigned int, not pointer to unsigned int.
clSetKernelArg takes a pointer to the argument value, not the value itself.
If you want to pass a pointer argument, you will have to create a buffer with clCreateBuffer (even if it's just one value in there) and call clSetKernelArg with the resulting cl_mem.
The following code creates a buffer for 1 cl_uint in __global memory, and copies the value of my_value to it. After running the kernel, it copies the (possibly modified) value back to my_value.
cl_uint my_value = 0;
const unsigned int count = 1;
// Allocate buffer
cl_mem hDeviceMem = clCreateBuffer(hContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, count * sizeof(cl_uint), &my_value, &nError);
// Set pointer to buffer as argument
clSetKernelArg(hKernel, 0, sizeof(cl_mem), &hDeviceMem);
// Run kernel
clEnqueueNDRangeKernel(...);
// Copy values back
clEnqueueReadBuffer(hCmdQueue, hDeviceMem, CL_TRUE, 0, count * sizeof(cl_uint), &my_value, 0, NULL, NULL);
Your kernel should then look like this:
__kernel void myKernel(__global unsigned int* value)
{
// read/write to *value here
}
This should work the same as sending a 1-length vector as a param. You might have to use __global uint aParam in your kernel definition.
Related
I want to do some experiments in OpenCL and I want to know possibility to change states during kernel execution from host code using buffer.
I attempted to alter the state of a while loop in the kernel code by modifying the buffer value from within the host code, however the execution is hung.
void my_kernel(
__global bool *in,
__global int *out)
{
int i = get_global_id(0);
while(1) {
if(1 == *in) {
printf("while loop is finished");
break;
}
}
printf("out[0] = %d\n", out[0]);
}
I call second time the function clEnqueueWriteBuffer() to change state of input value.
input[0] = 1;
err = clEnqueueWriteBuffer(commands, input_buffer,
CL_TRUE, 0, sizeof(int), (void*)input,
0, NULL,NULL);
At least for OpenCL 1.x, this is not permitted, and any behaviour you may observe in one implementation cannot be relied upon.
See the NOTE in the OpenCL 1.2 specification, section 5.2.2, Reading, Writing and Copying Buffer Objects:
Calling clEnqueueWriteBuffer to update the latest bits in a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being written is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
The host memory region given by (host_ptr + offset, cb) contains the latest bits when the enqueued write command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the write command has finished execution.
The final condition is not met by your code, therefore its behaviour is undefined.
I am not certain if the situation is different with OpenCL 2.x's Shared Virtual Memory (SVM) feature, as I have no practical experience using it, perhaps someone else can contribute an answer for that.
I created a buffer on the OpenCL device (a GPU), and from the host I need to know the global on-device pointer address so that I can put that on-device address in another buffer so that the kernel can then read from that buffer that contains the address of the first buffer so that then it can access the contents of that buffer.
If that's confusing here's what I'm trying to do: I create a generic floats-containing buffer representing a 2D image, then from the host I create a todo list of all the things my kernel needs to draw, which lines, which circles, which images... So from that list the kernel has to know where to find that image, but the reference to that image cannot be passed as a kernel argument, because that kernel might draw no image, or a thousand different images, all depending on what the list says, so it has to be referenced in that buffer that serves as a todo list for my kernel.
The awkward way I've done it so far:
To do so I tried making a function that calls a kernel after the creation of the image buffer that gets the buffer and returns the global on-device address as a ulong in another buffer, then the host stores that value in a 64-bit integer, like this:
uint64_t get_clmem_device_address(clctx_t *clctx, cl_mem buf)
{
const char kernel_source[] =
"kernel void get_global_ptr_address(global void *ptr, global ulong *devaddr) \n"
"{ \n"
" *devaddr = (ulong) ptr; \n"
"} \n";
int32_t i;
cl_int ret;
static int init=1;
static cl_program program;
static cl_kernel kernel;
size_t global_work_size[1];
static cl_mem ret_buffer;
uint64_t devaddr;
if (init)
{
init=0;
ret = build_cl_program(clctx, &program, kernel_source);
ret = create_cl_kernel(clctx, program, &kernel, "get_global_ptr_address");
ret_buffer = clCreateBuffer(clctx->context, CL_MEM_WRITE_ONLY, 1*sizeof(uint64_t), NULL, &ret);
}
if (kernel==NULL)
return ;
// Run the kernel
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), &ret_buffer);
global_work_size[0] = 1;
ret = clEnqueueNDRangeKernel(clctx->command_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // enqueue the kernel
ret = clEnqueueReadBuffer(clctx->command_queue, ret_buffer, CL_FALSE, 0, 1*sizeof(uint64_t), &devaddr, 0, NULL, NULL); // copy the value
ret = clFlush(clctx->command_queue);
clFinish(clctx->command_queue);
return devaddr;
}
Apparently this works (it does return a number, although it's hard to know if it's correct), but then I put this devaddr (a 64-bit integer on the host) in the todo list buffer that the kernel uses to know what to do, and then if necessary (according to the list) the kernel calls the function below, le here being a pointer to the relevant entry in the todo list, and the 64-bit address being the first element:
float4 blit_sprite(global uint *le, float4 pv)
{
const int2 p = (int2) (get_global_id(0), get_global_id(1));
ulong devaddr;
global float4 *im;
int2 im_dim;
devaddr = ((global ulong *) le)[0]; // global address for the start of the image as a ulong
im_dim.x = le[2];
im_dim.y = le[3];
im = (global float4 *) devaddr; // ulong is turned into a proper global pointer
if (p.x < im_dim.x)
if (p.y < im_dim.y)
pv += im[p.y * im_dim.x + p.x]; // this gives me a CL_OUT_OF_RESOURCES error, even when changing it to im[0]
return pv;
}
but big surprise this doesn't work, it gives me a CL_OUT_OF_RESOURCES error, which I assume means my im pointer isn't valid. Actually it works, it didn't work when I used two different contexts. But it's still pretty unwieldy.
Is there a less weird way to do what I want to do?
OpenCL standard doesn't guarantee that memory objects will not be physically reallocated between kernel calls. So, original Device-side address is valid only within single kernel NDRange. That's one of the reasons why OpenCL memory objects are represented on Host side as transparent structure pointers.
Though, you can save offset to memory object's first byte in 1st kernel and pass it to 2nd kernel. Every time you launch your kernel, you will obtain actual Device-side address within your kernel & increment it by saved shift value. That would be perfectly "legal".
I have allocated a buffer on the device:
cl_mem buff;
I want to pass this buffer plus an offset to my kernel
i.e.
buff + offset;
I find that this is not allowed. If I instead pass buff into my kernel and then
calculate the offset buffer inside the kernel, then this is fine. But it adds a needless calculation to each kernel run.
So, I get that the device memory space is different than the host, so I can't do simple pointer arithmetic. But, is there a way of taking an address to a device memory buffer,
calculating an offset, and passing this offset buffer into the kernel?
I think this may be possible with clCreateSubBuffer, but the offset needs to be aligned to the device's CL_DEVICE_MEM_BASE_ADDR_ALIGN, and this is not always possible for my kernel.
Using clCreateSubBuffer
If offset can be calculated statically, export macro, when building Program of your Kernel;
Assuming you are using C++
std::string macro;
std::stringstream ss;
// e. g. let it be 2^10
std::size_t offset = 1024;
ss << offset;
macro = "-D offset=";
macro += ss.str();
...
// When building Programm
clBuildProgram(..., macro.c_str(), ...);
//Inside your Kernel macro "offset" is defined
void __kenel my(
__global const uchar* data)
{
__global const uchar* data_with_shift = data + offset;
return;
}
Though, calculations inside kernel are extreamly cheap, so Marco13 gave you good advice.
I want to traverse a tree at GPU with OpenCL, so i assemble the tree in a contiguous block at host and i change the addresses of all pointers so as to be consistent at device as follows:
TreeAddressDevice = (size_t)BaseAddressDevice + ((size_t)TreeAddressHost - (size_t)BaseAddressHost);
I want the base address of the memory buffer:
At host i allocate memory for the buffer, as follows:
cl_mem tree_d = clCreateBuffer(...);
The problem is that cl_mems are objects that track an internal representation of the data. Technically they're pointers to an object, but they are not pointers to the data. The only way to access a cl_mem from within a kernel is to pass it in as an argument via setKernelArgs.
Here http://www.proxya.net/browse.php?u=%3A%2F%2Fwww.khronos.org%2Fmessage_boards%2Fviewtopic.php%3Ff%3D37%26amp%3Bt%3D2900&b=28 i found the following solution, but it doesnot work:
__kernel void getPtr( __global void *ptr, __global void *out )
{
*out = ptr;
}
that can be invoked as follows
Code:
...
cl_mem auxBuf = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(void*), NULL, NULL );
void *gpuPtr;
clSetKernelArg( getterKernel, 0, sizeof(cl_mem), &myBuf );
clSetKernelArg( getterKernel, 1, sizeof(cl_mem), &auxBuf );
clEnqueueTask( commandQueue, getterKernel, 0, NULL, NULL );
clEnqueueReadBuffer( commandQueue, auxBuf, CL_TRUE, 0, sizeof(void*), &gpuPtr, 0, NULL, NULL );
clReleaseMemObject(auxBuf);
...
Now "gpuPtr" should contain the address of the beginning of "myBuf" in GPU memory space.
The solution is obvious and i can't find it? How can I get back a pointer to device memory when creating buffers?
It's because in the OpenCL model, host memory and device memory are disjoint. A pointer in device memory will have no meaning on the host.
You can map a device buffer to host memory using clEnqueueMapBuffer. The mapping will synchronize device to host, and unmapping will synchronize back host to device.
Update. As you explain in the comments, you want to send a tree structure to the GPU. One solution would be to store all tree nodes inside an array, replacing pointers to nodes with indices in the array.
As Eric pointed out, there are two sets of memory to consider: host memory and device memory. Basically, OpenCL tries to hide the gritty details of this interaction by introducing the buffer object for us to interact with in our program on the host side. Now, as you noted, the problem with this methodology is that it hides away the details of our device when we want to do something trickier than the OpenCL developers intended or allowed in their scope. The solution here is to remember that OpenCL kernels use C99 and that the language allows us to access pointers without any issue. With this in mind, we can just demand the pointer be stored in an unsigned integer variable to be referenced later.
Your implementation was on the right track, but it needed a little bit more C syntax to finish up the transfer.
OpenCL Kernel:
// Kernel used to obtain pointer from target buffer
__kernel void mem_ptr(__global char * buffer, __global ulong * ptr)
{
ptr[0] = &buffer[0];
}
// Kernel to demonstrate how to use that pointer again after we extract it.
__kernel void use_ptr(__global ulong * ptr)
{
char * print_me = (char *)ptr[0];
/* Code that uses all of our hard work */
/* ... */
}
Host Program:
// Create the buffer that we want the device pointer from (target_buffer)
// and a place to store it (ptr_buffer).
cl_mem target_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
MEM_SIZE * sizeof(char), NULL, &ret);
cl_mem ptr_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
1 * sizeof(cl_ulong), NULL, &ret);
/* Setup the rest of our OpenCL program */
/* .... */
// Setup our kernel arguments from the host...
ret = clSetKernelArg(kernel_mem_ptr, 0, sizeof(cl_mem), (void *)&target_buffer);
ret = clSetKernelArg(kernel_mem_ptr, 1, sizeof(cl_mem), (void *)&ptr_buffer);
ret = clEnqueueTask(command_queue, kernel_mem_ptr, 0, NULL, NULL);
// Now it's just a matter of storing the pointer where we want to use it for later.
ret = clEnqueueCopyBuffer(command_queue, ptr_buffer, dst_buffer, 0, 1 * sizeof(cl_ulong),
sizeof(cl_ulong), 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, ptr_buffer, CL_TRUE, 0,
1 * sizeof(cl_ulong), buffer_ptrs, 0, NULL, NULL);
There you have it. Now, keep in mind that you don't have to use the char variables I used; it works for any type. However, I'd recommend using cl_ulong for the storing of pointers. This shouldn't matter for devices with less than 4GB of accessible memory. But for devices with a larger address space, you have to use cl_ulong. If you absolutely NEED to save space on your device but have a device whose memory > 4GB, then you might be able to create a struct that can store the lower 32 LSB of the address into a uint type, with the MSB's being stored in a small type.
I'm brand new to using OpenCL, and this seems like it should be very simple, so bear with me.
I'm writing a simple kernel to scan an array and look for a particular value. If that value is found anywhere in the array, I'd like a flag to be set. If the value is not found, the flag should remain 0;
Currently I'm creating a cl_mem object to hold an int
cl_mem outputFlag = clCreateBuffer(mCLContext, CL_MEM_WRITE_ONLY, sizeof(cl_int), NULL, NULL);
setting it as a kernel argument
clSetKernelArg(mCLKernels[1],1, sizeof(cl_mem), &outputFlag);
and executing my kernel which looks like:
__kernel void checkForHole(__global uchar *image , __global int found, uchar holeValue)
{
int i = get_global_id(0);
int j = get_global_id(1);
uchar sample = image[i*j];
if (sample == holeValue) {
found = 1;
}
}
Note that my array is 2D, though it shouldn't matter.
When I put a printf statement inside my found condition, it does get called (the value is found). But when I read back my value via:
cl_int result;
errorCode = clEnqueueReadBuffer(mCLCommandQueue, outputFlag, CL_TRUE
, 0, sizeof(cl_int), &result, 0, NULL, NULL);
I get 0. Is there a proper way to set a flag in openCL? it would also be nice if there was a way to halt the entire execution and just return my value if it is found.
Can I write a bool return type kernel and just return true?
Thanks!
In the kernel the output flag should be a pointer to an int.
Change the kernel parameter to __global int *found
I always seem to figure out my issues just by writing them here....
If anyone knows a way to halt the execution though, or if it's even possible, I'd still be interested :)