I have a kernel which I am running on a NVidia GTX 680 that increased in execution time when switching from using global memory to local memory.
My kernel which is part of a finite element ray tracer now loads each element into local memory before processing. The data for each element is stored in a struct fastTriangle which has the following definition :
typedef struct fastTriangle {
float cx, cy, cz, cw;
float nx, ny, nz, nd;
float ux, uy, uz, ud;
float vx, vy, vz, vd;
} fastTriangle;
I pass an array of these object to the kernel which is written as follows (I have removed the irrelevant code for brevity:
__kernel void testGPU(int n_samples, const int n_objects, global const fastTriangle *objects, __local int *x_res, __global int *hits) {
// Get gid, lid, and lsize
// Set up random number generator and thread variables
// Local storage for the two triangles being processed
__local fastTriangle triangles[2];
for(int i = 0; i < n_objects; i++) { // Fire ray from each object
event_t evt = async_work_group_copy((local float*)&triangles[0], (global float*)&objects[i],sizeof(fastTriangle)/sizeof(float),0);
//Initialise local memory x_res to 0's
barrier(CLK_LOCAL_MEM_FENCE);
wait_group_events(1, &evt);
Vector wsNormal = { triangles[0].cw*triangles[0].nx, triangles[0].cw*triangles[0].ny, triangles[0].cw*triangles[0].nz};
for(int j = 0; j < n_samples; j+= 4) {
// generate a float4 of random numbers here (rands
for(int v = 0; v < 4; v++) { // For each ray in ray packet
//load the first object to be intesected
evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[0],sizeof(fastTriangle)/sizeof(float),0);
// Some initialising code and calculate ray here
// Should have ray fully specified at this point;
for(int w = 0; w < n_objects; w++) { // Check for intersection against each ray
wait_group_events(1, &evt);
// Check for intersection against object w
float det = wsDir.x*triangles[1].nx + wsDir.y*triangles[1].ny + wsDir.z*triangles[1].nz;
float dett = triangles[1].nd - (triangles[0].cx*triangles[1].nx + triangles[0].cy*triangles[1].ny + triangles[0].cz*triangles[1].nz);
float detpx = det*triangles[0].cx + dett*wsDir.x;
float detpy = det*triangles[0].cy + dett*wsDir.y;
float detpz = det*triangles[0].cz + dett*wsDir.z;
float detu = detpx*triangles[1].ux + detpy*triangles[1].uy + detpz*triangles[1].uz + det*triangles[1].ud;
float detv = detpx*triangles[1].vx + detpy*triangles[1].vy + detpz*triangles[1].vz + det*triangles[1].vd;
// Interleaving the copy of the next triangle
evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[w+1],sizeof(fastTriangle)/sizeof(float),0);
// Complete intersection calculations
} // end for each object intersected
if(objectNo != -1) atomic_inc(&x_res[objectNo]);
} // end for sub rays
} // end for each ray
barrier(CLK_LOCAL_MEM_FENCE);
// Add all the local x_res to global array hits
barrier(CLK_GLOBAL_MEM_FENCE);
} // end for each object
}
When I first wrote this kernel I did not buffer each object in local memory and instead just accessed it form global memory i.e instead of triangles[0].cx I would use objects[i].cx
When setting out to optimise I switched to using local memory as listed above but then observed a execution run time increase of around 25%.
Why would performance be worse when using local memory to buffer the objects instead of directly accessing them in global memory?
It really depends on your program if local memory helps you to run faster. There are two things to consider when using local memory:
you have additional computation when copying the data from global to local and from local to global again.
I see that you have 3 times "barrier(...)", these barriers are performance killers. All OpenCL tasks have to wait at the barrier for all others. This way the parallelism is hindered and the tasks don't run independent any more.
Local memory is great when you read data lots of times in your computation. But the fast reads and writes need to get you more performance gain than the copying and synchronizing takes.
Related
I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...
There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};
I would like to use the local/shared memory optimization to reduce global memory access, so I basically have this function
float __attribute__((always_inline)) test_unoptimized(const global float* data, ...) {
// ...
for(uint j=0; j<def_data_length; j++) {
const float x = data[j];
// do sime computation with x, like finding the minimum value ...
}
// ...
return x_min;
}
and do the usual local/shared memory optimization on it:
float __attribute__((always_inline)) test_optimized(const global float* data, ...) {
// ...
const uint lid = get_local_id(0); // shared memory optimization (only works with first ray)
local float cache_x[def_ws];
for(uint j=0; j<def_data_length; j+=def_ws) {
cache_x[lid] = data[j+lid];
barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for(uint k=0; k<min(def_ws, def_data_length-j); k++) {
const float x = cache_x[k];
// do sime computation with x, like finding the minimum value ...
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// ...
return x_min;
}
Now the difficulty is that test_optimized is called in the kernel only in one of two possible if/else branches. If only some threads in a workgroup execute the else-branch, all other threads must not choose the if-branch for the local memory optimization in test_optimized to work. So I created a workaround: The condition for each thread in the workgroup is atomic_or-ed into an integer and then the integer, which is the same for all threads, is checked for branching. This ensures that, if 1 or more threads in the thread block choose the else-branch, all the others do too.
kernel void test_kernel(const global float* data, global float* result...) {
const uint n = get_global_id(0);
// ...
const bool condition = ...; // here I get some condition based on the thread ID n and global data
local uint condition_any; // make sure all threads within a workgroup are in the if/else part
condition_any = 0u;
barrier(CLK_LOCAL_MEM_FENCE);
atomic_or(&condition_any, condition);
barrier(CLK_LOCAL_MEM_FENCE);
if(condition_any==0u) {
// if-part is very short
result = 0;
return;
} else {
// else-part calls test_optimized function
const float x_min = test_optimized(data, ...);
result = condition ? x_min : 0;
}
}
The above code works flawlessly and is about 25% faster than with the test_unoptimized function. But atomically jamming a bit into the same local memory from all threads in the workgroup seems a bit like a hack to me and it only runs efficiently for small workgroup size (def_ws) 32, 64 or 128, but not 256 or greater.
Is this trick used in other codes and does it have a name?
If not: Is there a better way to do it?
With OpenCL 1.2 or older, I don't think there's a way to do this any faster. (I'm not aware of any relevant vendor extensions, but check your implementation's list for anything promising.)
With OpenCL 2.0+, you can use workgroup functions, in this case specifically work_group_any() for this sort of thing.
I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.
The Kernel
I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:
__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
__global const float* A, __global const float* B, const float beta,
__global float* C, __local float* Asub, __local float* Bsub) {
const uint row = get_local_id(0);
const uint col = get_local_id(1);
const uint TS = get_local_size(0); // Tile size
const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)
// Initialise the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numtiles = K / TS;
for (int t = 0; t < numtiles; t++) {
const int tiledRow = TS * t + row;
const int tiledCol = TS * t + col;
Asub[col * TS + row] = A[tiledCol * M + globalRow];
Bsub[col * TS + row] = B[globalCol * K + tiledRow];
barrier(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < TS; k++) {
acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}
Tile Size (TS) is now a value defined in the calling code, which looks like this:
// A, B and C are 2D matrices, their cl::Buffers have already been set up
// and values appropriately set.
kernel.setArg(0, (cl_int)nrowA);
kernel.setArg(1, (cl_int)ncolB);
kernel.setArg(2, (cl_int)ncolA);
kernel.setArg(3, alpha);
kernel.setArg(4, A_buffer);
kernel.setArg(5, B_buffer);
kernel.setArg(6, beta);
kernel.setArg(7, C_buffer);
kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));
cl::NDRange global(nrowA, ncolB);
cl::NDRange local(nrowA, ncolB);
status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);
The Problem
The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)
I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.
However, I can get this correct result maybe just 75% of the time.
Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.
I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.
What am I doing wrong here?
Things I have tried
I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.
Other points to note
I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.
This looks like a complicated one. There are several things to address and they won't fit into comments, so I'll post all this as an answer even though it does not solve your problem (yet).
I am baffled because if I have made an algorithmic or mathematical
mistake, the mistake should be consistent; instead I am getting
incorrect results only randomly.
Such a behavior is a typical indicator of race conditions.
I have tried to initialize both __local memory to 0.0, but this causes
all results to become wrong (but frankly, I'm not really sure how to
initialize it properly)
Actually this is a good thing. Finally we have some consistency.
Initializing local memory
Initializing local memory can be done using the work items, e.g. if you have a 1D workgroup of 16 items and your local memory consists of 16 floats, just do this:
local float* ptr = ... // your pointer to local memory
int idx = get_local_id(0); // get the index for the current work-item
ptr[idx] = 0.f; // init with value 0
barrier(CLK_LOCAL_MEM_FENCE); // synchronize local memory access within workgroup
If your local memory is larger, e.g. 64 floats, you will have to use a loop where each work item initializes 4 values, at least that is the most efficient way. However, no one will stop you from using every work item to initialize every value in the local memory, even though that is complete nonsense since you're essentially initializing it multiple times.
Your changes
The original algorithm looks like it is especially designed to use quadratic tiles.
__local float Asub[TS][TS];
__local float Bsub[TS][TS];
Not only that but the size of local memory matches the workgroup size, in their example 32x32.
When I look at your kernel parameters for local memory, I can see that you use parameters that are defined as M and N in the original algorithm. This doesn't seem correct.
Update 1
Since you have not described if the original algorithm works for you, this is what you should do to find your error:
Create a set of testdata. Make sure you only use data sizes that are actually supported by the original algorithm (e.g. minimum size, mulitples of x, etc.). Also, use large data sets since some errors only show if multiple workgroups are dispatched.
Use the original, unaltered algorithm with your testdata sets and verify the results.
Change the algorithm only that instead of fixed size local memory, dynamic local memory size is used, but make sure it has the same size as the fixed size approach. This is what you tried but I think it failed due to what I have described under "Your changes".
I'm using PyOpenCL to let my GPU do some regression on a large data set. Right now the GPU is slower than the CPU, probably because there is a loop that requires access to the global memory during each increment (I think...). The data set is too large to store into the local memory, but each loop does not require the entire data set, so I want to copy a portion of this array to the local memory. My question is: how do I do this? In Python one can easily slice a portion, but I don't think that's possible in OpenCL.
Here's the OpenCL code I'm using, if you spot any more potential optimisations, please shout:
__kernel void gpu_slope(__global double * data, __global double * time, __global int * win_results, const unsigned int N, const unsigned int Nmax, const double e, __global double * result) {
__local unsigned int n, length, leftlim, rightlim, i;
__local double sumx, sumy, x, y, xx, xy, invlen, a, b;
n = get_global_id(0);
leftlim = win_results[n*2];
rightlim = win_results[n*2+1];
sumx = 0;
sumy = 0;
xy = 0;
xx = 0;
length = rightlim - leftlim;
for(i = leftlim; i <= rightlim; i++) {
x = time[i]; /* I think this is fetched from global memory */
y = data[i];
sumx += x;
sumy += y;
xy += x*y;
xx += x*x;
}
invlen = 1.0/length;
a = xy-(sumx*sumy)*invlen;
b = xx-(sumx*sumx)*invlen;
result[n] = a/b;
}
I'm new to OpenCL, so please bear with me. Thanks!
The main(ish) point in GPU computing is trying to utilize hardware parallelism as much as possible. Instead of using the loop, launch a kernel with a different thread for every one of the coordinates. Then, either use atomic operations (the quick-to-code, but slow-performance option), or parallel reduction, for the various sums.
AMD has A tutorial on this subject. (NVidia does too, but theirs would be CUDA-based...)
You will find examples copying to local memory in PyOpenCL's examples folder: https://github.com/inducer/pyopencl/tree/master/examples
I recommend you read, run, and customize several of these examples to learn.
I also recommend the Udacity parallel programming course: https://www.udacity.com/course/cs344 This course will help solidify your grasp of fundamental OpenCL concepts.
I want to pass a structure to opencl kernel, the structure is
struct test
{
int *x;
float *y;
char *z;
};
and the memory allocation and initialization is like
struct test t;
t.x = (int*)malloc(sizeof(int)*100);
t.y = (float*) malloc (sizeof(float)*50);
t.z = (char*) malloc (sizeof(char) *25);
for(i = 0;i<100;i++)
{
t.x[i] = i;
if(i<50)
{
t.y[i] = i;
if(i<25)
t.z[i] = 'a';
}
}
can i pass such a structure to opencl kernel
You can pass such a structure, but it will be pointless because x, y and z point to different memory regions. Each of these memory buffers must be transferred on its own.
Instead of structure, its better to allocate memory at the host side and send them to kernel as kernel parameters.