I happen to have a library that already gives me a GPU Ptr in OpenCL internally.
auto* gpuImagePtr = upImpl->upCaffeNet->blobs().at(0)->mutable_gpu_data();
cl::Buffer imageBuffer = cl::Buffer((cl_mem)gpuImagePtr);
I want to write data into the pointer pointed by imageBuffer, without allocating new memory. How would I actually do that? If I instantiate a cl::Buffer, it will simply write a new address. I don't want to write a kernel to do this.
In CUDA I would do this
auto* gpuImagePtr = upImpl->upCaffeNet->blobs().at(0)->mutable_gpu_data();
cudaMemcpy(gpuImagePtr, inputData.getConstPtr(), inputData.getVolume() * sizeof(float),
cudaMemcpyHostToDevice);
In OpenCL, I tried doing this but it just segfaults:
auto* gpuImagePtr = upImpl->upCaffeNet->blobs().at(0)->mutable_gpu_data();
cl::Buffer imageBuffer = cl::Buffer((cl_mem)gpuImagePtr);
op::CLManager::getInstance(upImpl->mGpuId)->getQueue().enqueueWriteBuffer(imageBuffer, true, 0, inputData.getVolume() * sizeof(float), inputData.getConstPtr());
Okay, it turns out I need to use the retain flag:
cl::Buffer imageBuffer = cl::Buffer((cl_mem)gpuImagePtr, true);
Now it doesn't crash anymore
Related
I have OpenCL 1.1, one device, out of order execution command queue,
and want that multiple kernels output their results into one buffer to different, not overlapped, arbitrary, regions.
Is it possible?
cl::CommandQueue commandQueue(context, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);
cl::Buffer buf_as(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &as[0]);
cl::Buffer buf_bs(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &bs[0]);
cl::Buffer buf_rs(context, CL_MEM_WRITE_ONLY, data_size, NULL);
cl::Kernel kernel(program, "dist");
kernel.setArg(0, buf_as);
kernel.setArg(1, buf_bs);
int const N = 4;
int const d = data_size / N;
std::vector<cl::Event> events(N);
for(int i = 0; i != N; ++i) {
int const beg = d * i;
int const len = d;
kernel_leaf.setArg(2, beg);
kernel_leaf.setArg(3, len);
commandQueue.enqueueNDRangeKernel(kernel, NULL, cl::NDRange(block_size_x), cl::NDRange(block_size_x), NULL, &events[i]);
}
commandQueue.enqueueReadBuffer(buf_rs, CL_FALSE, 0, data_size, &rs[0], &events, NULL);
commandQueue.finish();
I wanted to give an official committee response to this. We realise the specification is ambiguous and have made modifications to rectify this.
This is not guaranteed under OpenCL 1.x or indeed 2.0 rules. cl_mem objects are only guaranteed to be consistent at synchronization points, even when processed only on a single device and even when used by OpenCL 2.0 kernels using memory_scope_device.
Multiple child kernels of an OpenCL 2.0 parent kernel can share the parent's cl_mem objects at device scope.
Coarse-grained SVM objects can be shared at device scope between multiple kernels, as long as the memory locations written to are not overlapping.
The writes should work fine if the global memory addresses are non-overlapping as you have described. Just make sure both kernels are finished before reading the results back to the host.
I don't think it is defined. Although you say you are writing to non-overlapping regions at the software level, it is not guaranteed that at the hardware level the accesses won't map onto same cache lines - in which case you'll have multiple modified versions flying around.
I'm trying to send an SFML Image to OpenCL. But I only get corrupted data back. My guess is that I handle the pixelData wrong, but I don't know how to do it correctly :/
void ImageWindow::xBlend()
{
size_t size = sizeof(cl_uchar4) * textureSize.x * textureSize.y;
cl::Buffer bufferIn(manager->context, CL_MEM_READ_ONLY, size);
cl::Buffer bufferOut(manager->context, CL_MEM_WRITE_ONLY, size);
auto pixelData = (cl_uchar4*)image.getPixelsPtr();
manager->queue.enqueueWriteBuffer(bufferIn, CL_TRUE, 0, size, pixelData);
functorXBlend(bufferIn, bufferOut);
manager->queue.enqueueReadBuffer(bufferOut, CL_TRUE, 0, size, pixelData);
newTexture.update(image);
}
Found the error myself. Most stupid mistake ever. The above code is correct, the problem was that I messed up the local worksize.
functorXBlend = cl::KernelFunctor(
cl::Kernel(manager->program, "xBlend"),
manager->queue, cl::NullRange,
cl::NDRange(textureSize.x, textureSize.y),
cl::NullRange);
Instead I had this:
functorXBlend = cl::KernelFunctor(
cl::Kernel(manager->program, "xBlend"),
manager->queue, cl::NullRange,
cl::NDRange(textureSize.x, textureSize.y),
cl::NDRange(textureSize.x, textureSize.y));
I have the following lines of code which I use to first determine the file size of the .cl file I am reading from (and loading into a buffer), and subsequently building my program and kernel from the buffer. Assuming calculate.cl contains a simple vector addition kernel.
//get size of kernel source
FILE *f = fopen("calculate.cl", "r");
fseek(f, 0, SEEK_END);
size_t programSize = ftell(f);
rewind(f);
//load kernel into buffer
char *programBuffer = (char*)malloc(programSize + 1);
programBuffer[programSize] = '\0';
fread(programBuffer, sizeof(char), programSize, f);
fclose(f);
//create program from buffer
cl_program program = clCreateProgramWithSource(context, 1, (const char**) &programBuffer, &programSize, &status);
//build program for devices
status = clBuildProgram(program, numDevices, devices, NULL, NULL, NULL);
//create the kernel
cl_kernel calculate = clCreateKernel(program, "calculate", &status);
However, when I run my program, the output produced is zero instead of the intended vector addition results. I've verified that the problem is not to do with the kernel itself (I used a different method to load an external kernel which worked and gave me the intended results) however I am still curious as to why this initial method I attempted did not work.
Any help?
the problem's been solved.
following bl0z0's suggestion and looking up the error, I've found the solution here:
OpenCL: Expected identifier in kernel
thanks everyone :D I really appreciate it!
I believe this gives the programing size in terms of the number of chars:
size_t programSize = ftell(f);
and here you need to allocate in terms of bytes:
char *programBuffer = (char*)malloc(programSize + 1);
so I think that previous line should be
char *programBuffer = (char*)malloc(programSize * sizeof(char) + 1);
Double check this by just printing the programBuffer.
I want to traverse a tree at GPU with OpenCL, so i assemble the tree in a contiguous block at host and i change the addresses of all pointers so as to be consistent at device as follows:
TreeAddressDevice = (size_t)BaseAddressDevice + ((size_t)TreeAddressHost - (size_t)BaseAddressHost);
I want the base address of the memory buffer:
At host i allocate memory for the buffer, as follows:
cl_mem tree_d = clCreateBuffer(...);
The problem is that cl_mems are objects that track an internal representation of the data. Technically they're pointers to an object, but they are not pointers to the data. The only way to access a cl_mem from within a kernel is to pass it in as an argument via setKernelArgs.
Here http://www.proxya.net/browse.php?u=%3A%2F%2Fwww.khronos.org%2Fmessage_boards%2Fviewtopic.php%3Ff%3D37%26amp%3Bt%3D2900&b=28 i found the following solution, but it doesnot work:
__kernel void getPtr( __global void *ptr, __global void *out )
{
*out = ptr;
}
that can be invoked as follows
Code:
...
cl_mem auxBuf = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(void*), NULL, NULL );
void *gpuPtr;
clSetKernelArg( getterKernel, 0, sizeof(cl_mem), &myBuf );
clSetKernelArg( getterKernel, 1, sizeof(cl_mem), &auxBuf );
clEnqueueTask( commandQueue, getterKernel, 0, NULL, NULL );
clEnqueueReadBuffer( commandQueue, auxBuf, CL_TRUE, 0, sizeof(void*), &gpuPtr, 0, NULL, NULL );
clReleaseMemObject(auxBuf);
...
Now "gpuPtr" should contain the address of the beginning of "myBuf" in GPU memory space.
The solution is obvious and i can't find it? How can I get back a pointer to device memory when creating buffers?
It's because in the OpenCL model, host memory and device memory are disjoint. A pointer in device memory will have no meaning on the host.
You can map a device buffer to host memory using clEnqueueMapBuffer. The mapping will synchronize device to host, and unmapping will synchronize back host to device.
Update. As you explain in the comments, you want to send a tree structure to the GPU. One solution would be to store all tree nodes inside an array, replacing pointers to nodes with indices in the array.
As Eric pointed out, there are two sets of memory to consider: host memory and device memory. Basically, OpenCL tries to hide the gritty details of this interaction by introducing the buffer object for us to interact with in our program on the host side. Now, as you noted, the problem with this methodology is that it hides away the details of our device when we want to do something trickier than the OpenCL developers intended or allowed in their scope. The solution here is to remember that OpenCL kernels use C99 and that the language allows us to access pointers without any issue. With this in mind, we can just demand the pointer be stored in an unsigned integer variable to be referenced later.
Your implementation was on the right track, but it needed a little bit more C syntax to finish up the transfer.
OpenCL Kernel:
// Kernel used to obtain pointer from target buffer
__kernel void mem_ptr(__global char * buffer, __global ulong * ptr)
{
ptr[0] = &buffer[0];
}
// Kernel to demonstrate how to use that pointer again after we extract it.
__kernel void use_ptr(__global ulong * ptr)
{
char * print_me = (char *)ptr[0];
/* Code that uses all of our hard work */
/* ... */
}
Host Program:
// Create the buffer that we want the device pointer from (target_buffer)
// and a place to store it (ptr_buffer).
cl_mem target_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
MEM_SIZE * sizeof(char), NULL, &ret);
cl_mem ptr_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
1 * sizeof(cl_ulong), NULL, &ret);
/* Setup the rest of our OpenCL program */
/* .... */
// Setup our kernel arguments from the host...
ret = clSetKernelArg(kernel_mem_ptr, 0, sizeof(cl_mem), (void *)&target_buffer);
ret = clSetKernelArg(kernel_mem_ptr, 1, sizeof(cl_mem), (void *)&ptr_buffer);
ret = clEnqueueTask(command_queue, kernel_mem_ptr, 0, NULL, NULL);
// Now it's just a matter of storing the pointer where we want to use it for later.
ret = clEnqueueCopyBuffer(command_queue, ptr_buffer, dst_buffer, 0, 1 * sizeof(cl_ulong),
sizeof(cl_ulong), 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, ptr_buffer, CL_TRUE, 0,
1 * sizeof(cl_ulong), buffer_ptrs, 0, NULL, NULL);
There you have it. Now, keep in mind that you don't have to use the char variables I used; it works for any type. However, I'd recommend using cl_ulong for the storing of pointers. This shouldn't matter for devices with less than 4GB of accessible memory. But for devices with a larger address space, you have to use cl_ulong. If you absolutely NEED to save space on your device but have a device whose memory > 4GB, then you might be able to create a struct that can store the lower 32 LSB of the address into a uint type, with the MSB's being stored in a small type.
I am getting some unusual behaviour in my openCL program.
In a host part of the program I create an array of double and set all elements to zero. That array is copied to the GPU using:
memObjects[4] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * I_numel, I, NULL);
Inside the kernel some elements are set to 1 depending on some condition and then I read it back to the host with:
errNum = clEnqueueReadBuffer(commandQueue, memObjects[4], CL_TRUE, 0,
I_numel * sizeof(double), I, 0, NULL, NULL);
However, some of the elements that were supposed to be zero have changed to very small ( 6.953267903e-310 ) or very large numbers ( 2.0002319483e+161 ) !?!
I've tried changing double to float but the results are similar. I am using nvidia implementation of openCL, version is 1.1. Does anyone know what is the problem?
I suspect there's something wrong with your kernel code. What happens if you do just the clEnqueueRead without running the kernel at all, do you then get all zeros? How about if you drop the CL_MEM_COPY_HOST_PTR and clear the buffer with clEnqueueWrite instead?
I tried to reproduce the issue with this simplified kernel, but the output was just alternating zeros and ones, as expected:
kernel void enqueueReadBuffer(global float* outputValueArray) {
int gid = get_global_id(0);
if (gid % 2 == 0) {
outputValueArray[gid] = 1.0f;
}
}
I ran this on three different OpenCL drivers on Windows 7, including NVIDIA Quadro FX4800 (R307.45), and got the correct result on all of them.
Try replacing the shown code with this and then post the err numbers
cl_int err;
memObjects[4] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * I_numel, I, &err);
printf("Buffer creation error no = %d", err);
And for the copy back
cl_int err2;
err2= clEnqueueReadBuffer(commandQueue, memObjects[4], CL_TRUE, 0,
I_numel * sizeof(double), I, 0, NULL, NULL);
printf("Copy back error no = %d", err2);