In my program, after calculations, there is a lot of memory left that is not cleared and is stored just like that. It is necessary to clear the buffer memory. What command can this be done in C ++?
Buffer myBuffer = Buffer(context, CL_MEM_READ_ONLY, count * sizeof(double));
queue.enqueueWriteBuffer(myBuffer, CL_TRUE, 0, count * sizeof(double), openF);
clEnqueueFillBuffer
But TBH i don't quite understand why you need it after the calculation. It's usually done before calculation, after it you just release the buffer.
Related
I just started building a code for parallel computation with OpenCL.
As far as I understand, the data generated from CPU side (host) is transffered through the buffers (clCreateBuffer -> clEnqueueWriteBuffer -> clSetKernelArg, then processed by the device).
I mainly have to deal with arrays (or matrices) of large size with double precision.
However, I realized the code never runs for arrays larger than 8000 entries with errors.
(This makes sense because 64kb is equivalent to 8000 double precision numbers.)
The error codes were either -6 (CL_OUT_OF_HOST_MEMORY) or -31 (CL_INVALID_VALUE).
One more thing when I set the argument to 2-dimensional array, I could set the size up to 8000 x 8000.
So far, I guess the maximum data size for double precision is 8000 (64kb) for 1D arrays, but I have no idea what happens for 2D or 3D arrays.
Is there any other way to transfer the data larger than 64kb?
If I did something wrong for OpenCL setup in data transfer, what would be recommended?
I appreciate your kind answer.
The hardware that I'm using is Tesla V100 which is installed on the HPC cluster in my school.
The following is a part of my code snippet that I'm testing the data transfer.
bfr(0) = clCreateBuffer(context,
& CL_MEM_READ_WRITE + CL_MEM_COPY_HOST_PTR,
& sizeof(a), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err=clEnqueueWriteBuffer(queue,bfr(0),CL_TRUE,0_8,
& sizeof(a),c_loc(a),0,C_NULL_PTR,C_NULL_PTR)
err = clSetKernelArg(kernel, 0,
& sizeof(bfr(0)), C_LOC(bfr(0)))
print*, err
if(err.ne.0)then
print *, "clSetKernelArg kernel"
print*, err
stop
endif
The code was build by Fortran with using clfortran module.
Thank you again for your answer.
You can do much larger arrays in OpenCL, as large as memory is abailable. For example I'm commonly working with linearized 4D arrays of 2 Billion floats in a CFD application.
Arrays need to be 1D only; if you have 2D or 3D arrays, linearize them, for example with n=x+y*size_x for 2D->1D coordinates. Some older devices only allow arrays 1/4 the size of the device memory. However modern devices typically have an extension to the OpenCL specification to enable larger buffers.
Here is a quickover view on what the OpenCL C bindings do:
clCreateBuffer allocates memory on the device side (video memory for GPUs, RAM for CPUs). Buffers can be as large as host/device memory allows or on some older devices 1/4 of device memory.
clEnqueueWriteBuffer copies memory over PCIe from RAM to video memory. Both on CPU and GPU side buffers must be allocated beforehand. There is no limit on transfer size; it can be as large as the entire buffer or only a subrange of a buffer.
clSetKernelArg links the GPU buffers to the Input parameters of the kernel, so it knows which kernel parameter corresponds to which buffer. Make sure data types of the buffers and kernel arguments match as you won't get an error if they don't. Also make sure the order of kernel arguments matches.
In your case there are several possible causes for the error:
Maybe you have integer overflow during computation of the array size. In this case use 64-bit integer numbers instead to compute the array size/indices.
You are out of memory because other buffers already take up too much memory. Do some bookkeeping to keep track on total (video) memory utilization.
You have selected the wrong device, for example integrated graphics instead of the dedicated GPU, in which case much less memory is available and you end up with cause 2.
To give you a more definitive answer, please provide some additional details:
What hardware do you use?
Show a code snippet of how you allocate device memory.
UPDATE
I see some errors in your code:
The length argument in clCreateBuffer and clEnqueueWriteBuffer requires the number of bytes that your array a has. If a is of type double, then this is a_length*sizeof(double), where a_length is the number of elements in the array a. sizeof(double) returns the number of bytes for one double number which is 8. So the length argument is 8 bytes times the number of elements in the array.
For multiple flags, you typically use bitwise or | instead of +. Shouldn't be an issue here, but is unconvenional.
You had "0_8" as buffer offset. This needs to be zero (0).
const int a_length = 8000;
double* a = new double[a_length];
bfr(0) = clCreateBuffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, a_length*sizeof(double), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err = clEnqueueWriteBuffer(queue, bfr(0), CL_TRUE, 0, a_length*sizeof(double), c_loc(a), 0, C_NULL_PTR, C_NULL_PTR);
err = clSetKernelArg(kernel, 0, sizeof(bfr(0)), C_LOC(bfr(0)));
print*, err;
if(err.ne.0) {
print *, "clSetKernelArg kernel"
print*, err
stop
}
I am trying to calculate the throughput of my kernel which is written in my openCL. But I am not sure how to do that, I have tried to find some file generated after compilation which shows throughput as 0.435(" found in the .attrb file") but not sure what does that mean. Is there any other way to find throughput?
Throughput of kernel in OpenCL calculated as:
(NumReadBytes + NumWriteBytes)/ElapsedTime
For measuring time use cl_event.
double getDuration(cl_event event)
{
cl_ulong start_time, end_time;
clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &start_time,NULL);
clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &end_time,NULL);
double total_time = (end_time - start_time) * 1e-6;
return total_time;
}
cl_event timer;
int ret = clEnqueueNDRangeKernel(cq, kernel, 1, p_global_work_offset, &global_work_size,
&local_work_size, 0, NULL, &timer);
printf("T:%zu L:%zu T:%fms",global_work_size, local_work_size, getDuration(timer));
This is a very vague question.
Do you mean only the kernel without loading the data?
What does the kernel going do, on what kind of hardware are you running it, how is your data organized, how do you manage your buffers?
Is everything in global memory? Are you defining latencies also? Do you need to maximaze the throughput? Are you going to optimize for specific hardware?
For me many questions rise.
I hope someone can help me with this.
I need to pass a long array representing a matrix to an opencl kernel using something like this:
memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * dets_numel, dets, NULL);
Inside the kernel I would like to remove some rows of the matrix depending on some condition and then read it back to the host using something like:
errNum = clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0,
dims1[0] * dims1[1] * sizeof(double), dets,
0, NULL, NULL);
Is there a way to let the host part of the program know the exact size of the array (matrix) without executing another kernel that will compute the size and read the result from the buffer back to the host?
Maybe there is a workaround specifically for your problem, but in general - No. You either find out new size implicitly, either read back explicit value.
I think you can't change the size of the stored device memory, but you can write on just part of it and read that part back to the host:
For the first part, you should make a bit different mapping on your
kernel, according to what you want.
Try using the clEnqueueReadBufferRect() function for the second
part.
Hello Everyone....
i am new to opencl and trying to explore more # it.
What is the work of local_work_size in openCL program and how it matters in performance.
I am working on some image processing algo and for my openCL kernel i gave
as
size_t local_item_size = 1;
size_t global_item_size = (int) (ceil((float)(D_can_width*D_can_height)/local_item_size))*local_item_size; // Process the entire lists
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL);
and for same kernel when i changed
size_t local_item_size = 16;
keeping everything same.
i got around 4-5 times faster performance.
The local-work-size, aka work-group-size, is the number of work-items in each work-group.
Each work-group is executed on a compute-unit which is able to handle a bunch of work-items, not only one.
So when you are using too small groups you waste some computing power, and only got a coarse parallelization at the compute-unit level.
But if you have too many work-items in a group you can also lose some opportunnity for parallelization as some compute-units may not be used, whereas other would be overused.
So you could test with many values to find the best one or just let OpenCL pick a good one for you by passing NULL as the local-work-size.
PS : I'll be interested in knowing the peformance with OpenCL choice compared to your previous values, so could you please make a test and post the results.
Thanks :)
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Two ways to create a buffer object in opencl: clCreateBuffer vs. clCreateBuffer + clEnqueueWriteBuffer
What is the difference between copying data to the device immediately upon buffer creation vs. later?
ie.
cl_mem memObj = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
, size, dataPtr, NULL);
or
cl_mem memObj = clCreateBuffer( context, CL_MEM_READ_ONLY , size, NULL, NULL);
clEnqueueWriteBuffer( commandQueue, memObj, CL_TRUE, 0, size, dataPtr, 0, NULL, NULL);
I'm brand new to OpenCL, so I'm just trying to figure things out ie. which method is best to use.
Thanks!
The whole point of the create/enqueue thing (in general, not just in opencl) is that once you create a buffer, you can write to it later after you compute what you want to write, and write an arbitrary number of times. There's no functional difference between initializing a buffer with data in it and making a buffer and then adding the data. Futhermore, any performace difference should be optimized away by your compiler.