I hope someone can help me with this.
I need to pass a long array representing a matrix to an opencl kernel using something like this:
memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * dets_numel, dets, NULL);
Inside the kernel I would like to remove some rows of the matrix depending on some condition and then read it back to the host using something like:
errNum = clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0,
dims1[0] * dims1[1] * sizeof(double), dets,
0, NULL, NULL);
Is there a way to let the host part of the program know the exact size of the array (matrix) without executing another kernel that will compute the size and read the result from the buffer back to the host?
Maybe there is a workaround specifically for your problem, but in general - No. You either find out new size implicitly, either read back explicit value.
I think you can't change the size of the stored device memory, but you can write on just part of it and read that part back to the host:
For the first part, you should make a bit different mapping on your
kernel, according to what you want.
Try using the clEnqueueReadBufferRect() function for the second
part.
Related
I just started building a code for parallel computation with OpenCL.
As far as I understand, the data generated from CPU side (host) is transffered through the buffers (clCreateBuffer -> clEnqueueWriteBuffer -> clSetKernelArg, then processed by the device).
I mainly have to deal with arrays (or matrices) of large size with double precision.
However, I realized the code never runs for arrays larger than 8000 entries with errors.
(This makes sense because 64kb is equivalent to 8000 double precision numbers.)
The error codes were either -6 (CL_OUT_OF_HOST_MEMORY) or -31 (CL_INVALID_VALUE).
One more thing when I set the argument to 2-dimensional array, I could set the size up to 8000 x 8000.
So far, I guess the maximum data size for double precision is 8000 (64kb) for 1D arrays, but I have no idea what happens for 2D or 3D arrays.
Is there any other way to transfer the data larger than 64kb?
If I did something wrong for OpenCL setup in data transfer, what would be recommended?
I appreciate your kind answer.
The hardware that I'm using is Tesla V100 which is installed on the HPC cluster in my school.
The following is a part of my code snippet that I'm testing the data transfer.
bfr(0) = clCreateBuffer(context,
& CL_MEM_READ_WRITE + CL_MEM_COPY_HOST_PTR,
& sizeof(a), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err=clEnqueueWriteBuffer(queue,bfr(0),CL_TRUE,0_8,
& sizeof(a),c_loc(a),0,C_NULL_PTR,C_NULL_PTR)
err = clSetKernelArg(kernel, 0,
& sizeof(bfr(0)), C_LOC(bfr(0)))
print*, err
if(err.ne.0)then
print *, "clSetKernelArg kernel"
print*, err
stop
endif
The code was build by Fortran with using clfortran module.
Thank you again for your answer.
You can do much larger arrays in OpenCL, as large as memory is abailable. For example I'm commonly working with linearized 4D arrays of 2 Billion floats in a CFD application.
Arrays need to be 1D only; if you have 2D or 3D arrays, linearize them, for example with n=x+y*size_x for 2D->1D coordinates. Some older devices only allow arrays 1/4 the size of the device memory. However modern devices typically have an extension to the OpenCL specification to enable larger buffers.
Here is a quickover view on what the OpenCL C bindings do:
clCreateBuffer allocates memory on the device side (video memory for GPUs, RAM for CPUs). Buffers can be as large as host/device memory allows or on some older devices 1/4 of device memory.
clEnqueueWriteBuffer copies memory over PCIe from RAM to video memory. Both on CPU and GPU side buffers must be allocated beforehand. There is no limit on transfer size; it can be as large as the entire buffer or only a subrange of a buffer.
clSetKernelArg links the GPU buffers to the Input parameters of the kernel, so it knows which kernel parameter corresponds to which buffer. Make sure data types of the buffers and kernel arguments match as you won't get an error if they don't. Also make sure the order of kernel arguments matches.
In your case there are several possible causes for the error:
Maybe you have integer overflow during computation of the array size. In this case use 64-bit integer numbers instead to compute the array size/indices.
You are out of memory because other buffers already take up too much memory. Do some bookkeeping to keep track on total (video) memory utilization.
You have selected the wrong device, for example integrated graphics instead of the dedicated GPU, in which case much less memory is available and you end up with cause 2.
To give you a more definitive answer, please provide some additional details:
What hardware do you use?
Show a code snippet of how you allocate device memory.
UPDATE
I see some errors in your code:
The length argument in clCreateBuffer and clEnqueueWriteBuffer requires the number of bytes that your array a has. If a is of type double, then this is a_length*sizeof(double), where a_length is the number of elements in the array a. sizeof(double) returns the number of bytes for one double number which is 8. So the length argument is 8 bytes times the number of elements in the array.
For multiple flags, you typically use bitwise or | instead of +. Shouldn't be an issue here, but is unconvenional.
You had "0_8" as buffer offset. This needs to be zero (0).
const int a_length = 8000;
double* a = new double[a_length];
bfr(0) = clCreateBuffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, a_length*sizeof(double), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err = clEnqueueWriteBuffer(queue, bfr(0), CL_TRUE, 0, a_length*sizeof(double), c_loc(a), 0, C_NULL_PTR, C_NULL_PTR);
err = clSetKernelArg(kernel, 0, sizeof(bfr(0)), C_LOC(bfr(0)));
print*, err;
if(err.ne.0) {
print *, "clSetKernelArg kernel"
print*, err
stop
}
Is there a way of creating a RAWSXP vector that is backed by an existing C char* ptr.
Below I show my current working version which needs to reallocate and copy the bytes,
and a second imagined version that doesn't exist.
// My current slow solution that uses lots of memory
SEXP getData() {
// has size, and data
Response resp = expensive_call();
//COPY OVER BYTE BY BYTE
SEXP respVec = Rf_allocVector(RAWSXP, resp.size);
Rbyte* ptr = RAW(respVec);
memcpy(ptr, resp.msg, resp.size);
// free the memory
free(resp.data);
return respVec;
}
// My imagined solution
SEXP getDataFast() {
// has size, and data
Response resp = expensive_call();
// reuse the ptr
SEXP respVec = Rf_allocVectorViaPtr(RAWSXP, resp.data, resp.size);
return respVec;
}
I also noticed Rf_allocVector3 which seems to give control over memory allocations of the vector, but I couldn't get this to work. This is my first time writing an R extension, so I imagine I must be doing something stupid. I'm trying to avoid the copy as the data will be around a GB (very large, sparse though, matrices).
Copying over 1 GB is < 1 second. If your call is expensive, it might be a marginal cost that you should profile to see if it's really a bottleneck.
The way you are trying to do things is probably not possible, because how would R know how to garbage collect the data?
But assuming you are using STL containers, one neat trick I've recently seen is to use the second template argument of STL containers -- the allocator.
template<
class T,
class Allocator = std::allocator<T>
> class vector;
The general outline of the strategy is like this:
Create a custom allocator using R-memory that meets all the requirements (essentially you just need allocate and deallocate)
Every time you need to a return data to R from an STL container, make sure you initialize it with your custom allocator
On returning the data, pull out the underlying R data created by your R-memory allocator -- no copy
This approach gives you all the flexibility of STL containers while using only memory R is aware of.
I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.
I have a list of vectors that I need to transform by matrices on the CPU. I am storing these as a dynamically allocated array (Eigen::Vector4f*). Once they have been transformed I need to run an OpenCL kernel on the vectors. I'm wondering what the best way is to pass this data into OpenCL without having to copy the data from Eigen::Vector --> float array as this will be fairly costly. My understanding is that Eigen internally stores the vectors values in some sort of buffer I can access?
There are many ways,
1 - The best is probably to use a Matrix4Xf because it allows to work on the whole set of vectors at once:
Matrix4Xf vecs(4,n);
Matrix4f transform;
vecs = transform * vecs;
vecs.row(1) // read-write access to all y components
vecs.col(i) // read-write access to i-th vector
float* raw_ptr = vecs.data();
2 - use a std::vector<Vector4f> (same as Vector4f* but no memory management issues):
std::vector<Vector4f> vecs(n);
for(auto& v:vecs) v = transform * v;
float* raw_ptr = vecs[0].data(); // assuming vecs is not empty
// you can still see it as Matrix4Xf:
Map<Matrix4Xf> vecs_as_mat(raw_ptr,4,n);
Okay -- did a bit more research. The solution is to use the raw buffers exposed by the Eigen::Map class:
https://eigen.tuxfamily.org/dox/group__TutorialMapClass.html
I can create a raw buffer of floats then create Eigen::Map objects that wrap the float buffers into vectors.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Two ways to create a buffer object in opencl: clCreateBuffer vs. clCreateBuffer + clEnqueueWriteBuffer
What is the difference between copying data to the device immediately upon buffer creation vs. later?
ie.
cl_mem memObj = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
, size, dataPtr, NULL);
or
cl_mem memObj = clCreateBuffer( context, CL_MEM_READ_ONLY , size, NULL, NULL);
clEnqueueWriteBuffer( commandQueue, memObj, CL_TRUE, 0, size, dataPtr, 0, NULL, NULL);
I'm brand new to OpenCL, so I'm just trying to figure things out ie. which method is best to use.
Thanks!
The whole point of the create/enqueue thing (in general, not just in opencl) is that once you create a buffer, you can write to it later after you compute what you want to write, and write an arbitrary number of times. There's no functional difference between initializing a buffer with data in it and making a buffer and then adding the data. Futhermore, any performace difference should be optimized away by your compiler.