OpenCL write to buffer choices [duplicate] - opencl

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Two ways to create a buffer object in opencl: clCreateBuffer vs. clCreateBuffer + clEnqueueWriteBuffer
What is the difference between copying data to the device immediately upon buffer creation vs. later?
ie.
cl_mem memObj = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
, size, dataPtr, NULL);
or
cl_mem memObj = clCreateBuffer( context, CL_MEM_READ_ONLY , size, NULL, NULL);
clEnqueueWriteBuffer( commandQueue, memObj, CL_TRUE, 0, size, dataPtr, 0, NULL, NULL);
I'm brand new to OpenCL, so I'm just trying to figure things out ie. which method is best to use.
Thanks!

The whole point of the create/enqueue thing (in general, not just in opencl) is that once you create a buffer, you can write to it later after you compute what you want to write, and write an arbitrary number of times. There's no functional difference between initializing a buffer with data in it and making a buffer and then adding the data. Futhermore, any performace difference should be optimized away by your compiler.

Related

Code never runs for arrays larger than 8000 entries with errors

I just started building a code for parallel computation with OpenCL.
As far as I understand, the data generated from CPU side (host) is transffered through the buffers (clCreateBuffer -> clEnqueueWriteBuffer -> clSetKernelArg, then processed by the device).
I mainly have to deal with arrays (or matrices) of large size with double precision.
However, I realized the code never runs for arrays larger than 8000 entries with errors.
(This makes sense because 64kb is equivalent to 8000 double precision numbers.)
The error codes were either -6 (CL_OUT_OF_HOST_MEMORY) or -31 (CL_INVALID_VALUE).
One more thing when I set the argument to 2-dimensional array, I could set the size up to 8000 x 8000.
So far, I guess the maximum data size for double precision is 8000 (64kb) for 1D arrays, but I have no idea what happens for 2D or 3D arrays.
Is there any other way to transfer the data larger than 64kb?
If I did something wrong for OpenCL setup in data transfer, what would be recommended?
I appreciate your kind answer.
The hardware that I'm using is Tesla V100 which is installed on the HPC cluster in my school.
The following is a part of my code snippet that I'm testing the data transfer.
bfr(0) = clCreateBuffer(context,
& CL_MEM_READ_WRITE + CL_MEM_COPY_HOST_PTR,
& sizeof(a), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err=clEnqueueWriteBuffer(queue,bfr(0),CL_TRUE,0_8,
& sizeof(a),c_loc(a),0,C_NULL_PTR,C_NULL_PTR)
err = clSetKernelArg(kernel, 0,
& sizeof(bfr(0)), C_LOC(bfr(0)))
print*, err
if(err.ne.0)then
print *, "clSetKernelArg kernel"
print*, err
stop
endif
The code was build by Fortran with using clfortran module.
Thank you again for your answer.
You can do much larger arrays in OpenCL, as large as memory is abailable. For example I'm commonly working with linearized 4D arrays of 2 Billion floats in a CFD application.
Arrays need to be 1D only; if you have 2D or 3D arrays, linearize them, for example with n=x+y*size_x for 2D->1D coordinates. Some older devices only allow arrays 1/4 the size of the device memory. However modern devices typically have an extension to the OpenCL specification to enable larger buffers.
Here is a quickover view on what the OpenCL C bindings do:
clCreateBuffer allocates memory on the device side (video memory for GPUs, RAM for CPUs). Buffers can be as large as host/device memory allows or on some older devices 1/4 of device memory.
clEnqueueWriteBuffer copies memory over PCIe from RAM to video memory. Both on CPU and GPU side buffers must be allocated beforehand. There is no limit on transfer size; it can be as large as the entire buffer or only a subrange of a buffer.
clSetKernelArg links the GPU buffers to the Input parameters of the kernel, so it knows which kernel parameter corresponds to which buffer. Make sure data types of the buffers and kernel arguments match as you won't get an error if they don't. Also make sure the order of kernel arguments matches.
In your case there are several possible causes for the error:
Maybe you have integer overflow during computation of the array size. In this case use 64-bit integer numbers instead to compute the array size/indices.
You are out of memory because other buffers already take up too much memory. Do some bookkeeping to keep track on total (video) memory utilization.
You have selected the wrong device, for example integrated graphics instead of the dedicated GPU, in which case much less memory is available and you end up with cause 2.
To give you a more definitive answer, please provide some additional details:
What hardware do you use?
Show a code snippet of how you allocate device memory.
UPDATE
I see some errors in your code:
The length argument in clCreateBuffer and clEnqueueWriteBuffer requires the number of bytes that your array a has. If a is of type double, then this is a_length*sizeof(double), where a_length is the number of elements in the array a. sizeof(double) returns the number of bytes for one double number which is 8. So the length argument is 8 bytes times the number of elements in the array.
For multiple flags, you typically use bitwise or | instead of +. Shouldn't be an issue here, but is unconvenional.
You had "0_8" as buffer offset. This needs to be zero (0).
const int a_length = 8000;
double* a = new double[a_length];
bfr(0) = clCreateBuffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, a_length*sizeof(double), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err = clEnqueueWriteBuffer(queue, bfr(0), CL_TRUE, 0, a_length*sizeof(double), c_loc(a), 0, C_NULL_PTR, C_NULL_PTR);
err = clSetKernelArg(kernel, 0, sizeof(bfr(0)), C_LOC(bfr(0)));
print*, err;
if(err.ne.0) {
print *, "clSetKernelArg kernel"
print*, err
stop
}

How to clear buffers in OpenCL 1.2 c++

In my program, after calculations, there is a lot of memory left that is not cleared and is stored just like that. It is necessary to clear the buffer memory. What command can this be done in C ++?
Buffer myBuffer = Buffer(context, CL_MEM_READ_ONLY, count * sizeof(double));
queue.enqueueWriteBuffer(myBuffer, CL_TRUE, 0, count * sizeof(double), openF);
clEnqueueFillBuffer
But TBH i don't quite understand why you need it after the calculation. It's usually done before calculation, after it you just release the buffer.

OpenCL Sub Buffer Host pointer

I have created a buffer with attributes CL_MEM_READ_WRITE and CL_MEM_ALLOC_HOST_PTR. I have enqueued this buffer to GPU kernels. GPU kernels process the input given and fill these buffers. During this process CPU is made to wait. I have modified this design by partitioning the buffer in to three uniform sections using sub-buffers. Now GPU after filling one sub-buffer, CPU can start processing. This reduces CPU wait to one sub-buffer as opposed to one full frame processing.
The problem i am facing is, the mapped pointer (cpu side pointers) of sub-buffers and buffer are strange. The map pointer of the first sub-buffer and buffer are same. This is alright. But the map pointer of second sub-buffer is not equal to the map pointer of buffer + offset of second sub-buffer. I tried this on integrated GPU models (Intel HD graphics 4000). It was working fine. But when i run this on dedicated graphics card devices (nvidia zotac) i am facing this problem. Have you encountered such a scenario before. Can you provide some pointers to where to look to fix this problem.
typedef struct opencl_buffer {
cl_mem opencl_mem;
void *mapped_pointer;
int size;
}opencl_buffer;
// alloc gpu output buffers
opencl->opencl_mem = clCreateBuffer(
opencl->context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
3 * alloc_size, NULL, &status);
if (status != CL_SUCCESS)
goto fail;
// create output sub buffers
for (sub_idx = 0; sub_idx < 3; ++sub_idx) {
cl_buffer_region sf_region;
SubFrameInfo subframe;
sf_region.origin = alloc_size * sub_idx;
sf_region.size = alloc_size;
opencl->gpu_output_sub_buf[sub_idx].size = sf_region.size;
opencl->gpu_output_sub_buf[sub_idx].opencl_mem =
clCreateSubBuffer(opencl->opencl_mem,
CL_MEM_READ_WRITE,
CL_BUFFER_CREATE_TYPE_REGION,
&sf_region, &status);
if (status != CL_SUCCESS)
goto fail;
}
Now, when i map gpu_output_sub_buf[0].opencl_mem and gpu_output_sub_buf[1].opencl_mem, the difference between CPU side pointers is expected to be alloc_size (assume char pointers). This happens to be the case in Intel HD graphics. But Nvidia platform is providing a different result.
There is no specification-based reason a mapped sub-buffer should be at an address that is a known offset from the mapped main buffer (or mapped sub-buffer that aligns with same). Mapping only creates a range of host memory that you can use, and then you unmap to get it back on the device. It doesn't have to even be at the same address each time.
Of course OpenCL 2.0 SVM changes all this, but you didn't say you're using SVM, and NVIDIA doesn't support OpenCL 2.0 today anyway.

Changing the size of the array in an OpenCL kernel

I hope someone can help me with this.
I need to pass a long array representing a matrix to an opencl kernel using something like this:
memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * dets_numel, dets, NULL);
Inside the kernel I would like to remove some rows of the matrix depending on some condition and then read it back to the host using something like:
errNum = clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0,
dims1[0] * dims1[1] * sizeof(double), dets,
0, NULL, NULL);
Is there a way to let the host part of the program know the exact size of the array (matrix) without executing another kernel that will compute the size and read the result from the buffer back to the host?
Maybe there is a workaround specifically for your problem, but in general - No. You either find out new size implicitly, either read back explicit value.
I think you can't change the size of the stored device memory, but you can write on just part of it and read that part back to the host:
For the first part, you should make a bit different mapping on your
kernel, according to what you want.
Try using the clEnqueueReadBufferRect() function for the second
part.

Affect of local_work_size on performance and why it is

Hello Everyone....
i am new to opencl and trying to explore more # it.
What is the work of local_work_size in openCL program and how it matters in performance.
I am working on some image processing algo and for my openCL kernel i gave
as
size_t local_item_size = 1;
size_t global_item_size = (int) (ceil((float)(D_can_width*D_can_height)/local_item_size))*local_item_size; // Process the entire lists
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL);
and for same kernel when i changed
size_t local_item_size = 16;
keeping everything same.
i got around 4-5 times faster performance.
The local-work-size, aka work-group-size, is the number of work-items in each work-group.
Each work-group is executed on a compute-unit which is able to handle a bunch of work-items, not only one.
So when you are using too small groups you waste some computing power, and only got a coarse parallelization at the compute-unit level.
But if you have too many work-items in a group you can also lose some opportunnity for parallelization as some compute-units may not be used, whereas other would be overused.
So you could test with many values to find the best one or just let OpenCL pick a good one for you by passing NULL as the local-work-size.
PS : I'll be interested in knowing the peformance with OpenCL choice compared to your previous values, so could you please make a test and post the results.
Thanks :)

Resources