OpenCL fast iteration through all pairs - opencl

I am quite new in OpenCL and it makes me problem to think about all GPU execution consequences. I am trying to write sumuation, so I have 2D points and need to calculate "gravity" forces acting in between them all. My best idea of OpenCL kernel looks like this:
kernel void ker_fun(global const double* pts, uint pts_size, global double* fxs, global double* fys, double vertexRepulsion)
{
double x=pts[2*get_global_id(0)];
double y=pts[2*get_global_id(0)+1];
double fx=0;
double fy=0;
for (size_t i=get_global_id(0)+1; i<pts_size; ++i) {
double dx=x-pts[2*i]; // point[i] -> points[THIS]
double dy=y-pts[2*i+1];
double r2=pow(dx, 2)+pow(dy, 2);
r2=max(r2, 0.0001); // to prevent (r2==0) issue
double f=gravityConstant/r2;
double ratio=f/sqrt(r2);
dx*=ratio;
dy*=ratio;
fx+=dx;
fy+=dy;
atomic_add_double(&fxs[i], -dx);
atomic_add_double(&fys[i], -dy);
}
atomic_add_double(&fxs[get_global_id(0)], fx);
atomic_add_double(&fys[get_global_id(0)], fy);
where fxs and fys are force values in X and X direction (i.e. my result) and atomic_add_double function is copied from this site (OpenCL - using atomic reduction for double).
This function works and calculates desired result. But it is slow. Could you please advise me, how to do this different and better way.
Thank you, for your time and help

Related

Does code hoisting exists in openCL? If not, is there any way to implement it?

I would like to know whether the code hoisting exists on a variety of platforms including Nvidia, AMD, and Intel. I created a simple example and it seems like this feature does not exist. Since I'm still new to openCL so I don't know if I test it correctly. The example code just perform a matrix addition and add a constant to each entry, Here is the code:
//Just some complicate operation on the variable random_private
#define zero random_private[0]*random_private[1]*random_private[2]*random_private[3]*random_private[4]*random_private[5]*random_private[6]*random_private[7]*random_private[8]*random_private[9]
#define zero1 powr((double)zero*zero+zero,10)
#define zero2 zero1/(zero1+1)
#define zero3 zero2+zero2*zero2
//Test if the code hoisting exist
//C=A+B+something
kernel void matrix_add1(global double *A, global double *B,global double *C ,global uint* random) {
uint rowNum=10000;
uint colNum=100;
//localize the variable random to make sure the code hoisting is valid(Otherwise it is possible that the variable random can be changed by other thread when excuting the loop and therefore the code hoisting results in incorrect answer)
uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
for(uint j=0;j<colNum;j++){
for(uint i=0;i<rowNum;i++){
//zero3 is a macro to do some super complicate operation on random_private
C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+zero3;
}
}
}
//Manually do the code hoisting
kernel void matrix_add2(global double *A, global double *B,global double *C ,global uint* random) {
uint rowNum=10000;
uint colNum=100;
uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
//Compute the loop-invariant code
uint tmp=zero3;
for(uint j=0;j<colNum;j++){
for(uint i=0;i<rowNum;i++){
C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+tmp;
}
}
}
The example runs 20 times with just one thread, here is the result on my computer:
Nvidia 1070:
matrix_add1: 28.46 sec
matrix_add2: 4.3 sec
AMD 1600X:
matrix_add1: 5.78 sec
matrix_add2: 0.16 sec
The function matrix_add1 is much slower than the function matrix_add2. Did I made any mistake on this example? Or is there any third-party compiler that can implement such optimization and generate the intermediate code for us? Thanks!

declaring and defining pointer vetors of vectors in OpenCL Kernel

I have a variable which is vector of vector, And in c++, I am easily able to define and declare it but in OpenCL Kernel, I am facing the issues. Here is an example of what I am trying to do.
std::vector<vector <double>> filter;
for (int m= 0;m<3;m++)
{
const auto& w = filters[m];
-------sum operation using w
}
Now Here, I can easily referencing the values of filters[m] in w, but I am not able to do this OpenCl kernel file. Here is what I have tried,but it is giving me wrong output.
In host code:-
filter_dev = cl::Buffer(context,CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR,filter_size,(void*)&filters,&err);
filter_dev_buff = cl::Buffer(context,CL_MEM_READ_WRITE,filter_size,NULL,&err);
kernel.setArg(0, filter_dev);
kernel.setArg(1, filter_dev_buff);
In kernel code:
__kernel void forward_shrink(__global double* filters,__global double* weight)
{
int i = get_global_id[0]; // I have tried to use indiviadual values of i in filters j, just to check the output, but is not giving the same values as in serial c++ implementation
weight = &filters[i];
------ sum operations using weight
}
Can anyone help me? Where I am wrong or what can be the solution?
You are doing multiple things wrong with your vectors.
First of all (void*)&filters doesn't do what you want it to do. &filters doesn't return a pointer to the beginning of the actual data. For that you'll have to use filters.data().
Second you can't use an array of arrays in OpenCL (or vector of vectors even less). You'll have to flatten the array yourself to a 1D array before you pass it to a OpenCL kernel.

Copy portion of global array to local memory

I'm using PyOpenCL to let my GPU do some regression on a large data set. Right now the GPU is slower than the CPU, probably because there is a loop that requires access to the global memory during each increment (I think...). The data set is too large to store into the local memory, but each loop does not require the entire data set, so I want to copy a portion of this array to the local memory. My question is: how do I do this? In Python one can easily slice a portion, but I don't think that's possible in OpenCL.
Here's the OpenCL code I'm using, if you spot any more potential optimisations, please shout:
__kernel void gpu_slope(__global double * data, __global double * time, __global int * win_results, const unsigned int N, const unsigned int Nmax, const double e, __global double * result) {
__local unsigned int n, length, leftlim, rightlim, i;
__local double sumx, sumy, x, y, xx, xy, invlen, a, b;
n = get_global_id(0);
leftlim = win_results[n*2];
rightlim = win_results[n*2+1];
sumx = 0;
sumy = 0;
xy = 0;
xx = 0;
length = rightlim - leftlim;
for(i = leftlim; i <= rightlim; i++) {
x = time[i]; /* I think this is fetched from global memory */
y = data[i];
sumx += x;
sumy += y;
xy += x*y;
xx += x*x;
}
invlen = 1.0/length;
a = xy-(sumx*sumy)*invlen;
b = xx-(sumx*sumx)*invlen;
result[n] = a/b;
}
I'm new to OpenCL, so please bear with me. Thanks!
The main(ish) point in GPU computing is trying to utilize hardware parallelism as much as possible. Instead of using the loop, launch a kernel with a different thread for every one of the coordinates. Then, either use atomic operations (the quick-to-code, but slow-performance option), or parallel reduction, for the various sums.
AMD has A tutorial on this subject. (NVidia does too, but theirs would be CUDA-based...)
You will find examples copying to local memory in PyOpenCL's examples folder: https://github.com/inducer/pyopencl/tree/master/examples
I recommend you read, run, and customize several of these examples to learn.
I also recommend the Udacity parallel programming course: https://www.udacity.com/course/cs344 This course will help solidify your grasp of fundamental OpenCL concepts.

OpenCL matrix vector multiplication code gives correct and incorrect solutions from run to run

I am working on OpenCL code for sparse matrix operations and I find that it works when the code including the kernel is executed once or twice. But every few runs the answer is slightly off. Here is the very simple kernel I am using:
__kernel void dsmv( int N, __global int * IA,
__global int * JA, __global float * A,
__global float * X, __global float * Y){
int IBGN, ICOL, IEND, ii;
ICOL = get_global_id(0);
if(ICOL < N)
{
IBGN = JA[ICOL]-1;
IEND = JA[ICOL+1]-1-1;
for (ii = IBGN; ii <= IEND; ii++)
{
Y[IA[ii]-1] += A[ii]*X[ICOL];
}
}
}
I can also post the fortran code that uses this kernel. I am using FortranCL.
What could cause the multiplication to give different answers from run to run?
This line looks suspicious:
Y[IA[ii]-1] += A[ii]*X[ICOL];
It seems that two work items may increment the same memory location, so there is a potential race condition here, and since += is not an atomic operation this is a problem.
Unfortunately you can't use the built-in atomic_add instead because it doesn't support floats, but atomic_cmpxchg does, so you can use it to implement a floating-point atomic add - or just look at this existing implementation of an atomic add for floats.

OpenCL - is it possible to invoke another function from within a kernel?

I am following along with a tutorial located here: http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
The kernel they have listed is this, which computes the sum of two numbers and stores it in the output variable:
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
{
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
}
1) Say for example that the operation performed was much more complex than a summation - something that warrants its own function. Let's call it ComplexOp(in1, in2, out). How would I go about implementing this function such that vector_add_gpu() can call and use it? Can you give example code?
2) Now let's take the example to the extreme, and I now want to call a generic function that operates on the two numbers. How would I set it up so that the kernel can be passed a pointer to this function and call it as necessary?
Yes it is possible. You just have to remember that OpenCL is based on C99 with some caveats. You can create other functions either inside of the same kernel file or in a seperate file and just include it in the beginning. Auxiliary functions do not need to be declared as inline however, keep in mind that OpenCL will inline the functions when called. Pointers are also not available to use when calling auxiliary functions.
Example
float4 hit(float4 ray_p0, float4 ray_p1, float4 tri_v1, float4 tri_v2, float4 tri_v3)
{
//logic to detect if the ray intersects a triangle
}
__kernel void detection(__global float4* trilist, float4 ray_p0, float4 ray_p1)
{
int gid = get_global_id(0);
float4 hitlocation = hit(ray_p0, ray_p1, trilist[3*gid], trilist[3*gid+1], trilist[3*gid+2]);
}
You can have auxiliary functions for use in the kernel, see OpenCL user defined inline functions . You can not pass function pointers into the kernel.

Resources