Eigen library: pass array of Vector4f to OpenCL kernel as floats - opencl

I have a list of vectors that I need to transform by matrices on the CPU. I am storing these as a dynamically allocated array (Eigen::Vector4f*). Once they have been transformed I need to run an OpenCL kernel on the vectors. I'm wondering what the best way is to pass this data into OpenCL without having to copy the data from Eigen::Vector --> float array as this will be fairly costly. My understanding is that Eigen internally stores the vectors values in some sort of buffer I can access?

There are many ways,
1 - The best is probably to use a Matrix4Xf because it allows to work on the whole set of vectors at once:
Matrix4Xf vecs(4,n);
Matrix4f transform;
vecs = transform * vecs;
vecs.row(1) // read-write access to all y components
vecs.col(i) // read-write access to i-th vector
float* raw_ptr = vecs.data();
2 - use a std::vector<Vector4f> (same as Vector4f* but no memory management issues):
std::vector<Vector4f> vecs(n);
for(auto& v:vecs) v = transform * v;
float* raw_ptr = vecs[0].data(); // assuming vecs is not empty
// you can still see it as Matrix4Xf:
Map<Matrix4Xf> vecs_as_mat(raw_ptr,4,n);

Okay -- did a bit more research. The solution is to use the raw buffers exposed by the Eigen::Map class:
https://eigen.tuxfamily.org/dox/group__TutorialMapClass.html
I can create a raw buffer of floats then create Eigen::Map objects that wrap the float buffers into vectors.

Related

How to create a RAWSXP vector from C char* ptr without reallocation

Is there a way of creating a RAWSXP vector that is backed by an existing C char* ptr.
Below I show my current working version which needs to reallocate and copy the bytes,
and a second imagined version that doesn't exist.
// My current slow solution that uses lots of memory
SEXP getData() {
// has size, and data
Response resp = expensive_call();
//COPY OVER BYTE BY BYTE
SEXP respVec = Rf_allocVector(RAWSXP, resp.size);
Rbyte* ptr = RAW(respVec);
memcpy(ptr, resp.msg, resp.size);
// free the memory
free(resp.data);
return respVec;
}
// My imagined solution
SEXP getDataFast() {
// has size, and data
Response resp = expensive_call();
// reuse the ptr
SEXP respVec = Rf_allocVectorViaPtr(RAWSXP, resp.data, resp.size);
return respVec;
}
I also noticed Rf_allocVector3 which seems to give control over memory allocations of the vector, but I couldn't get this to work. This is my first time writing an R extension, so I imagine I must be doing something stupid. I'm trying to avoid the copy as the data will be around a GB (very large, sparse though, matrices).
Copying over 1 GB is < 1 second. If your call is expensive, it might be a marginal cost that you should profile to see if it's really a bottleneck.
The way you are trying to do things is probably not possible, because how would R know how to garbage collect the data?
But assuming you are using STL containers, one neat trick I've recently seen is to use the second template argument of STL containers -- the allocator.
template<
class T,
class Allocator = std::allocator<T>
> class vector;
The general outline of the strategy is like this:
Create a custom allocator using R-memory that meets all the requirements (essentially you just need allocate and deallocate)
Every time you need to a return data to R from an STL container, make sure you initialize it with your custom allocator
On returning the data, pull out the underlying R data created by your R-memory allocator -- no copy
This approach gives you all the flexibility of STL containers while using only memory R is aware of.

Using vector types to improve OpenCL kernel performance

I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.

OpenCL Kernel : Parallel Summation within a __global variable

I've already learnt about the Parallel Sum Reduction technique. However, I want to know if it is possible to add up different values from different Threads to a same __global variable like this :
float a = ...; // Assign different a values for each Thread
Gvar[1] += a; // Do the sum simultaneously to the same global variable index
Thanks
For updating the same global memory location from different work-groups you will need to use atomic functions.
You can do so from different work-items in the same group, but this is usually a bad idea. It is almost always more efficient to perform reduction in local memory within the group and only update a global memory location once in each work-group.
Note that most atomic functions are not available for floating-point data types, so you will need to either use integer types or an alternative method to solve your problem.

Append OpenCL result to list / Reduce solution room

I have an OpenCL Kernel with multiple work items. Let's assume for discussion, that I have a 2-D Workspace with x*y elements working on an equally sized, but sparce, array of input elements. Few of these input elements produce a result, that I want to keep, most don't. I want to enqueue another kernel, that only takes the kept results as an input.
Is it possible in OpenCL to append results to some kind of list to pass them as input to another Kernel or is there a better idea to reduce the volume of the solution space? Furthermore: Is this even a good question to ask with the programming model of OpenCL in mind?
What I would do if the amount of result data is a small percentage (ie: 0-10%) is use local atomics and global atomics, with a global counter.
Data interface between kernel 1 <----> Kernel 2:
int counter //used by atomics to know where to write
data_type results[counter]; //used to store the results
Kernel1:
Create a kernel function that does the operation on the data
Work items that do produce a result:
Save the result to local memory, and ensure no data races occur using local atomics in a local counter.
Use the work item 0 to save all the local results back to global memory using global atomics.
Kernel2:
Work items lower than "counter" do work, the others just return.

OpenCL select/delete points from large array

I have an array of 2M+ points (planned to be increased to 20M in due course) that I am running calculations on via OpenCL. I'd like to delete any points that fall within a random triangle geometry.
How can I do this within an OpenCL kernel process?
I can already:
identify those points that fall outside the triangle (simple point in poly algorithm in the kernel)
pass their coordinates to a global output array.
But:
an openCL global output array cannot be variable and so I initialise it to match the input array of points in terms of size
As a result, 0,0 points occur in the final output when a point falls within the triangle
The output array therefore does not result in any reduction per se.
Can the 0,0 points be deleted within the openCL context?
n.b. I am coding in OpenFrameworks, so c++ implementations are linking to .cl files
Just an alternative for the case where most of the points fall inside the atomic condition:
It is possible to have a local counter, and local atomic. Then to merge that atomic to the global value it is possible to use atomic_add(). Witch will return the "previous" global value. So, you just copy the indexes to that address and up.
It should be a noticeable speed up, since the threads will sync locally and only once globally. The global copy can be parallel since the address will never overlap.
For example:
__kernel mykernel(__global MyType * global_out, __global int * global_count, _global MyType * global_in){
int lid = get_local_id(0);
int lws = get_local_size(0);
int idx = get_global_id(0);
__local int local_count;
__local int global_val;
//I am using a local container, but a local array of pointers to global is possible as well
__local MyType local_out[WG_SIZE]; //Ensure this is higher than your work_group size
if(lid==0){
local_count = 0; global_val = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
//Classify them
if(global_in[idx] == ....)
local_out[atomic_inc(local_count)] = global_in[idx];
barrier(CLK_LOCAL_MEM_FENCE);
//If not, we are done
if(local_count > 0){
//Only the first local ID does the atomic to global
if(lid == 0)
global_val = atomic_add(global_count,local_count);
//Resync all the local workers here
barrier(CLK_LOCAL_MEM_FENCE);
//Copy all the data
for(int i=0; i<local_count; i+=lws)
global_out[global_val+i] = local_out[i];
}
}
NOTE: I didn't compile it but should more or less work.
If I understood your problem, you can do:
--> In your kernel, you can identify the points in the triangle and:
if(element[idx]!=(0,0))
output_array[atomic_inc(number_of_elems)] = element[idx];
Finally, in first number_of_elems of output_array in the host you will have
your inner points.
I hope this help you,
Best
There are alternatives, all working better or worse, depending on how the data looks like. I put one below.
Deleting the identified points can also be done by registering them in a separate array per workgroup - you need to use the same atomic_inc as with Moises's answer (see my remark there about doing this at workgroup-level!!). The end-result is a list of start-points and end-points of parts that don't need to be deleted. You can then copy parts of the array those by different threads. This is less effective if you have clusters of points that need to be deleted

Resources