Yesterday i got to run the unit tests of our current application on the new notebooks and got the CL_OUT_OF_RESOURCES error doing so. The code itself runs without errors on ATI cards or Intel CPU's.
The thing that got me suspicious is that the M2000M supports 'OpenCL 1.2 CUDA'. Is this standard 'OpenCL 1.2' or does it differ and do i need to modify the code?
Here the code:
__kernel void pointNormals(__global const uint* cellLinkIds, __global const uint* cellLinks,
__global const float3* cellnormals, __global float3* pointnormals,
const uint nrPoints)
{
const uint gid = get_global_id(0);
if(gid < nrPoints)
{
const uint first = select(cellLinkIds[gid-1], (uint)0, gid==0);
const uint last = cellLinkIds[gid];
float3 pointnormal = (float3)0.f;
for(uint i = first; i < last; ++i)
{
pointnormal += cellnormals[cellLinks[i]];
}
pointnormals[gid] = normalize(pointnormal);
}
}
/edit:
In the tests i get 6 errors, first at the call of clWaitForEvents the others are from clEnqueueWriteBuffer
found the cause ...
the line with const uint first = select(cellLinkIds[gid-1], (uint)0, gid==0); caused invalid memory access when gid is 0 (first element afaik).
fixed ith with const uint first = gid == 0 ? (uint)0 : cellLinkIds[gid - 1];. but what i dont get is why AMD cards did work with that bug and Nvidia did return an error.
Related
I work Interchangeably with 32 bit floats and 32 bit integers. I want two kernels that do exactly the same thing, but one is for integers and one is for floats. At first I thought I could use templates or something, but it does not seem possible to specify two kernels with the same name but different argument types?
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void arange(__global int *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
__kernel void arange(__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
""").build()
Error:
<kernel>:8:15: error: conflicting types for 'arange'
__kernel void arange(__global float *res_g)
^
<kernel>:2:15: note: previous definition is here
__kernel void arange(__global int *res_g)
What is the most convenient way of doing this?
#define directive can be used for that:
code = """
__kernel void arange(__global TYPE *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
"""
prg_int = cl.Program(ctx, code).build("-DTYPE=int")
prg_float = cl.Program(ctx, code).build("-DTYPE=float")
In my OpenCL kernel i'm checking if the global_id is inside the global problem size but it is not working.
If the global problem size is dividable by the workgroupsize everything is fine. If not, the kernel is doing just nothing.
__kernel void move_points(const unsigned int points,
const unsigned int floors,
const unsigned int gridWidth,
const unsigned int gridHeight,
__global const GraphData *graph,
__global const float *pin_x,
__global const float *pin_y,
__global const float *pin_z,
__global float *pout_x,
__global float *pout_y,
__global float *pout_z,
__global clrngMrg31k3pHostStream *streams)
{
int id = get_global_id(0);
if (id < points) {
do kernel things...
}
}
Do somebody know where the problem is?
Thanks a lot. Robin.
If your global size is not divisible by your local size (workgroup size), then the kernel will not run at all.
The enqueueNDRangeKernel() call will return CL_INVALID_WORK_GROUP_SIZE as an error as specified here.
If you really want to follow the CUDA mode, where you may have unused work items. Then put the check in the kernel (as you already have), and use a bigger global size, that is multiple of your local size.
I get error -9999
Breakpoint 7, cl::detail::errHandler (err=-9999, errStr=0x43cc1f "clWaitForEvents") at /opt/AMDAPPSDK-3.0-0-Be
ta/include/CL/cl.hpp:321
at event.wait() because of the following line
valid[id] = 1;
where valid is __global int* valid.
The .cl code is
__kernel void validateRecords(__global const char* buffer, __global const struct RecordInfo* allRecords, __global int* valid, const unsigned int n)
{
const int id=get_global_id(0);
if (id < n)
{
char* record = buffer[allRecords[id].position];
int size = allRecords[id].length;
int updateTimeLen = findFixed(record, size, ',');
if(updateTimeLen == -1 || updateTimeLen != UPDATE_TIME_LEN)
{
valid[id] = 1;
return;
}
}
}
and I get error
code -9999 atvalid[id] = 1.
I just noticed that if I comment //valid[id] = 1; or //int updateTimeLen = findFixed(record, size, ','); all is fine, but when both are used I get the above error.
The device is GTX 980 with OpenCL 1.1. Can you help please?
I think I found the issue. First of all, I am using the AMD SDK on Nvidia card. This worked well for me in the past (with GTX 780 Ti). However this time I noticed a few problems. The first one is this line:
char* record = buffer[allRecords[id].position];
which I initially (correctly) wrote as
char* record = buffer + allRecords[id].position;
but it didn't compile and I mechanically changed it to the above and it did compile. That was the first problem, but it had nothing to do with the error.
The second was that passing a __private char* continued to throw error -9999 no matter what I do (perhaps because of the different SDK, but it might be something else), so I passed the __global char* buffer to the function and all worked fine.
I've created a bubble sort code. The clbuildprogram in the user function createProgram is giving an error
My kernel looks like:
__kernel void sort_kernel(__global const float *a, __global const float *b)
{
const int n=100;
int j;
float temp;
int gid = get_global_id(0);
b[gid]=a[gid];
for(j=0; j < n-gid; j++)
{
if(b[j+1]<b[j])
{
temp=b[j];
b[j]=b[j+1];
b[j+1]=temp;
}
}
}
clbuildprogram is giving an error as per the runtime error.
***Error in kernel: :1:1: error: unknown type name '_kernel'
_kernel void sort_kernel(__global const float *a, __global const float *b) //, ^
:1:9: error: expected identifier or '('
_kernel void sort_kernel(__global const float *a, __global const float *b) //,
^
:21:3: error: expected external declaration
}
^
:23:1: error: expected external declaration } ^
:23:1: error: expected external declaration***
Please tell me what is the error and how can I rectify it...?
You missed a _ in your program. The error is obvious.I dont think the code pasted here is the same as you run.
Correct your _kernel to __kernel in your program.
I am trying to implement atomic functions in my opencl kernel. Multiple threads I am creating are parallely trying to write a single memory location. I want them to perform serial execution on that particular line of code. I have never used an atomic function before.
I found similar problems on many blogs and forums,and I am trying one solution.,i.e. use of two different functions 'acquire' and 'release' for locking and unlocking the semaphore. I have included necessary opencl extensions, which are all surely supported by my device (NVIDIA GeForce GTX 630M).
My kernel execution configuration:
global_item_size = 8;
ret = clEnqueueNDRangeKernel(command_queue2, kernel2, 1, NULL, &global_item_size2, &local_item_size2, 0, NULL, NULL);
Here is my code: reducer.cl
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
typedef struct data
{
double dattr[10];
int d_id;
int bestCent;
}Data;
typedef struct cent
{
double cattr[5];
int c_id;
}Cent;
__global void acquire(__global int* mutex)
{
int occupied;
do {
occupied = atom_xchg(mutex, 1);
} while (occupied>0);
}
__global void release(__global int* mutex)
{
atom_xchg(mutex, 0); //the previous value, which is returned, is ignored
}
__kernel void reducer(__global int *keyMobj, __global int *valueMobj,__global Data *dataMobj,__global Cent *centMobj,__global int *countMobj,__global double *sumMobj, __global int *mutex)
{
__local double sum[2][2];
__local int cnt[2];
int i = get_global_id(0);
int n,j;
if(i<2)
cnt[i] = countMobj[i];
barrier(CLK_GLOBAL_MEM_FENCE);
n = keyMobj[i];
for(j=0; j<2; j++)
{
barrier(CLK_GLOBAL_MEM_FENCE);
acquire(mutex);
sum[n][j] += dataMobj[i].dattr[j];
release(mutex);
}
if(i<2)
{
for(j=0; j<2; j++)
{
sum[i][j] = sum[i][j]/countMobj[i];
centMobj[i].cattr[j] = sum[i][j];
}
}
}
Unfortunately the solution doesn't seem like working for me. When I am reading back the centMobj into the host memory, using
ret = clEnqueueReadBuffer(command_queue2, centMobj, CL_TRUE, 0, (sizeof(Cent) * 2), centNode, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue2, sumMobj, CL_TRUE, 0, (sizeof(double) * 2 * 2), sum, 0, NULL, NULL);
it is giving me error with error code = -5 (CL_OUT_OF_RESOURCES) for both centMobj and sumMobj.
I am not getting if there is any problem in my atomic function code or problem is in reading back data into the host memory. If I am using the atomic function incorrectly, please make me correct.
Thank you in advance.
In OpenCL, synchronization between work items can be done only inside a work-group. Code trying to synchronize work-items across different work-groups may work in some very specific (and implementation/device dependent) cases, but will fail in the general case.
The solution is to either use atomics to serialize accesses to the same memory location (but without blocking any work item), or redesign the code differently.