Problems compiling with atomic functions - opencl

I'm trying to compile this openCl code:
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
__kernel void nQueens( __global int * data, __global int * result, __local int * stack, __local int *stack_size, int board_size)
{
atom_inc( stack_size );
}
And I get this error:
Your OpenCL kernels failed to compile: Error: Code selection failed to
select: 0x5307370: i32,ch = AtomicLoadAdd 0x53072e8, 0x5303d68,
0x53011a8 <0x4edf478:0> alignment=4
Error: CL_BUILD_PROGRAM_FAILURE
Thanks.

atom_inc is the 64 bit version and atomic_inc is the 32 bit version. Also stack_size should be declared volatile. Thus, since you are using 32 bit integers you should use atomic_inc instead.
From http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/atomic_inc.html :
int atomic_inc (volatile __local int *p )
"A 64-bit version of this function, atom_inc, is enabled by cl_khr_int64_base_atomics. "

Related

Static variable in OpenCL C

I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...
There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};

OpenCL says CL_KERNEL_WORK_GROUP_SIZE is 256 - but accepts 1024; why?

On NVIDIA GPUs from recent years, one can run kernels with up to 1024 "threads" per block, or work-items per workgroup in OpenCL parlance - as long as the kernel uses less than 64 registers. Specifically, that holds for a simple kernel such as:
__kernel void vectorAdd(
__global unsigned char * __restrict C,
__global unsigned char const * __restrict A,
__global unsigned char const * __restrict B,
unsigned long length)
{
int i = get_global_id(0);
if (i < length)
C[i] = A[i] + B[i] + 2;
}
however, if we build this kernel and run:
built_kernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(device);
this yields 256, not 1024 (on a Quadro RTX 6000 with CUDA 11.6.55 for example).
That seems like a bug with the API. Is it? Or - is there some legitimate reason why CL_KERNEL_WORK_GROUP_SIZE should be 256?

OpenCL: copy from constant memory directly to global output corrupts data

Is this a driver bug, or are you required to copy to local memory before going back out to global? The broken version has the same byte position corrupted in each output.
__kernel void test(__constant item_t items[], __constant uint *xs, uint stride, __global ushort8 *output)
{
ushort8 stats;
size_t id = get_global_id(0);
xs += id * stride;
//stats = items[xs[0]].stats; output[id] = stats; -- this works
output[id] = items[xs[0]].stats; // this doesn't.
}
Tested on Geforce GTX 280, driver 331.82, Windows 8.1 64bit.
Edit:
Nevermind copying locally to 'stats' doesn't fix it.
Edit2:
__constant ushort8 input gives corrupted results.
__global ushort8 input gives OK results.
__constant ushort[8] --> OK.
__global ushort[8] --> OK.
https://devtalk.nvidia.com/default/topic/470881/use-of-constant-memory-breaks-with-opencl-1-1-constant-memory-worked-fine-in-1-0-but-fails-in-1-1/
Ok, so, the problem is that declaring all the global scope variables as __constant I exceeded the maximum number of possible __constant variables and when this happens the content of the __constant variables will be random...
In simpler words, you have to count all the times that you use the
__constant keyword in the *.cl source file and this number has to be less than the maximum number of possible __constant variables
supported by the device in use.

OpenCL scalar vs vector

I have simple kernel:
__kernel vecadd(__global const float *A,
__global const float *B,
__global float *C)
{
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Why when I change float to float4, kernel runs more than 30% slower?
All tutorials says, that using vector types speeds up computation...
On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.
Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.
Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.
EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.
For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?
I don't know what are the tutorials you refer to, but they must be old.
Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now.
Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code.
It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.
I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.
If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?
__kernel vecadd(__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
C[idx] += B[idx];
}
Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.
__kernel vecadd(__global const float4 *A,
__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
float4 tmp = A[idx] + B[idx];
C[idx] = tmp;
}

Error CL_INVALID_KERNEL_NAME when I use cl_khr_fp64 in a kernel

I have an error in an OpenCL kernel, when I try to use the cl_khr_fp64 extension, the kernel compiles and the build log is empty, but when I call clCreateKernel, I have CL_INVALID_KERNEL_NAME error.
The source that fails:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void simple( __global char *x, __global char *y ){
int id = get_global_id(0);
y[id]=2*x[id];
}
This source compiles right:
__kernel void simple( __global char *x, __global char *y ){
int id = get_global_id(0);
y[id]=2*x[id];
}
I'm using OpenCL 1.0 with a Tesla C1060 that have cl_khr_fp64 in CL_DEVICE_EXTENSIONS, driver 280.13 and CL_PLATFORM_VERSION=OpenCL 1.1 CUDA 4.0.1
The problem was that before call to clCreateProgramWithSource, we remove the newlines from source. E.g: the source:
"__kernel void f( __global char *x ){\nint id = get_global_id(0);\nx[id]=2;\n}"
Becomes:
"__kernel void simple( __global char *x, __global char *y ){"
"int id = get_global_id(0);"
"x[id]=2;}"
It causes no problem until we add the preproccessor directive.
It's the OpenCL preprocessor which actually wants newlines to be there. So, it should be written as:
"__kernel void simple( __global char *x, __global char *y ){\n"
"int id = get_global_id(0);\n"
"x[id]=2;}\n"
This is one of the very things that has been bugging me. The problem I think is the previously compiled code is cached somewhere and being reused. So your new changes give weird errors.
To fix it (NOT a "real solution" but it works for me) try changing your program name (and kernel name, maybe) e.g. if program is a.out then next time you compile make it a2.out and see if it is fixed. I hope this helps.
if you find a better solution please let us know.
I also came cross such bug a few days ago and I just work it out. So I'm here sharing my solution although it's quite wired and I still don't know why.
static inline void CreateOCLKernels()
{
std::cout << "ocl lowlevelengine: Creating ocl kernels ...\n";
filterSubsample_ocl_kernel = clCreateKernel(program, "filterSubsampleUChar4Kernel", &clError);
checkErr(clError, "clCreateKernel0");
filterSubsampleWithHoles_float4_ocl_kernel = clCreateKernel(program, "filterSubsampleWithHolesFloat4Kernel", &clError);
checkErr(clError, "clCreateKernel1");
filterSubsampleWithHoles_float_ocl_kernel = clCreateKernel(program, "filterSubsampleWithHolesFloatKernel", &clError);
checkErr(clError, "clCreateKernel2");
gradientX_ocl_kernel = clCreateKernel(program, "gradientXKernel", &clError);
checkErr(clError, "clCreateKernel3");
gradientY_ocl_kernel = clCreateKernel(program, "gradientYKernel", &clError);
checkErr(clError, "clCreateKernel4");
//type-dependent ocl memset kernels
memset_ocl_kernel_Vector4s = clCreateKernel(program, "memsetKernelVector4s", &clError);
checkErr(clError, "clCreateKernel5");
}
This is my original code which is a static function called by a constructor of some class. The constructor can be called without any question. However, each time the above function is called, I would received the bug "invalid kernel name" resulting from opencl cannot find the kernel "filterSubsampleUChar4Kernel".
I've tried a lot but none of them worked. But today, very occasionally, I try to change the function name and I succeed. What I do is nothing more than change "filterSubsampleUChar4Kernel" to "filterSubsampleKernel". I also tried to change other names, eg. "filterSubsampleKernel_test", "filterSubsample1Kernel". But they didn't work. It's quite wired, isn't it?
I guess you write OpenCL code using string. such as
std::string code =
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable"
"__kernel void simple( __global char *x, __global char *y )"
"{"
"int id = get_global_id(0);"
"y[id]=2*x[id];"
"}"
just add "\n" at the end of #pragma line
std::string code =
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable\n"
"__kernel void simple( __global char *x, __global char *y )"
"{"
"int id = get_global_id(0);"
"y[id]=2*x[id];"
"}"

Resources