OpenCL: Basic example not working. clSetKernelArg -38 Error - opencl

I am attempting a very simple OpenCL example. I have developed the following code below. It compiles a simple kernel, and then I create a simple float* buffer and set it to a cl::Buffer. However, when I attempt to call the kernel.setArg() function, it crashes, with an error -38. This error states that my cl::Buffer is invalid. I have no idea why this is happening:
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#include <CL/cl2.hpp>
#define MULTI_LINE_STRING(ARG) #ARG
namespace op
{
const char *resizeAndMergeKernel = MULTI_LINE_STRING(
__kernel void testKernel(__global float* image)
{
}
);
}
void testCL(){
cl::Device device;
cl::Context context;
cl::CommandQueue queue;
int deviceId = 0;
// Load Device
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
std::string deviceName;
cl_uint i, type;
cl::Platform::get(&platforms);
type = platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
if( type == CL_SUCCESS)
{
// Get only relavent device
cl::Context allContext(devices);
std::vector<cl::Device> gpuDevices;
gpuDevices = allContext.getInfo<CL_CONTEXT_DEVICES>();
bool deviceFound = false;
for(int i=0; i<gpuDevices.size(); i++){
if(i == deviceId){
device = gpuDevices[i];
context = cl::Context(device);
queue = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE);
deviceFound = true;
cout << "Made new GPU Instance: " << deviceId << endl;
break;
}
}
if(!deviceFound)
{
throw std::runtime_error("Error: Invalid GPU ID");
}
}
// Create Kernel
cl::Program program = cl::Program(context, op::resizeAndMergeKernel, true);
cl::Kernel kernel = cl::Kernel(program, "testKernel");
// Simple Buffer
cl_int err;
float* test = new float[3*224*224];
cl::Buffer x = cl::Buffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(float) * 3 * 224 * 224, (void*)test, &err);
cout << err << endl;
kernel.setArg(0,x); // CRASHES WITH cl::Error -38
}
As you can see the last line kernel.setArg(0,x) crashes with error -38.

It's not a "crash", it's an error code. OpenCL error -38 is CL_INVALID_MEM_OBJECT. It means the cl_mem_obj is not valid. It is because you are passing a cl::Buffer object to setArg, but you need to instead pass the cl_mem handle which represents that buffer. The cl::Buffer operator() method returns that. So use kernel.setArg(0,x()). Note the () are the added part (yes, it's subtle).

Related

openCL trouble saving compiled binaries for CPU and GPU simultaneously

So I'm writing an openCL program that runs on both CPU + GPU and am currently trying to save/cache the binaries after creating my program with clCreateProgramWithSource(). I create my clContext and clProgram with CL_DEVICE_TYPE_ALL and build the source with those specifications.
I then take the binaries and store them to disk (with one binary file per device) so that on subsequent starts my program automatically calls clBuildProgramWithBinary.
The problem is that if I save the binaries to disk that were created with the setting CL_DEVICE_TYPE_ALL, the binary for the CPU gets corrupted and clBuildProgramWithBinary throws an error.
In order to get all the binary files saved to disk properly, I've had to edit my code to first run using CL_DEVICE_TYPE_CPU and save the CPU binary on its own, then edit my code again to run using CL_DEVICE_TYPE_GPU, save the gpu binaries and then finally switch it back to CL_DEVICE_TYPE_ALL. If I do this, clBuildProgramWithBinary is able to accurately build the binary for each device type and execute my program.
So is this just a quirk of openCL that I can't build binaries for GPUs and CPUs together? Or am I just doing this incorrectly?
I'm basing my code on the implementation of binary saving found here: https://code.google.com/p/opencl-book-samples/source/browse/trunk/src/Chapter_6/HelloBinaryWorld/HelloBinaryWorld.cpp?r=42 with modifications in place to handle multiple devices.
Here are some portions of my code below:
/*----Initial setup of platform, context and devices---*/
cl_int err, deviceCount;
cl_device_id *devices;
cl_platform_id platform;
cl_context context;
cl_program program;
err = clGetPlatformIDs(1, &platform, NULL);
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &deviceCount);
devices = new cl_device_id[deviceCount];
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, deviceCount, devices, NULL);
context = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
/*---Build Program---*/
int numFiles = 2;
const char *sourceFiles[] =
{
"File1.cl",
"File2.cl",
};
char *sourceStrings[numFiles];
for(int i = 0; i < numFiles; i++)
{
sourceStrings[i] = ReadFile(sourceFiles[i]);
}
/*---Create the compute program from the source buffer---*/
program = clCreateProgramWithSource(context, numFiles, (const char **)sourceStrings, NULL, &err);
/*---Build the program executable---*/
err = clBuildProgram(program, deviceCount, devices, NULL, NULL, NULL);
/*----Save binary to disk---*/
//Determine the size of each program binary
size_t *programBinarySizes = new size_t[deviceCount];
err = clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t) * deviceCount, programBinarySizes, NULL);
if(err != CL_SUCCESS)
{
delete [] devices;
delete [] programBinarySizes;
return false;
}
unsigned char **programBinaries = new unsigned char*[deviceCount];
for(cl_uint i = 0; i < deviceCount; i++)
{
programBinaries[i] = new unsigned char[programBinarySizes[i]];
}
//Get all of the program binaries
err = clGetProgramInfo(program, CL_PROGRAM_BINARIES, sizeof(unsigned char *) * deviceCount, programBinaries, NULL);
if (err != CL_SUCCESS)
{
delete [] devices;
delete [] programBinarySizes;
for (cl_uint i = 0; i < deviceCount; i++)
{
delete [] programBinaries[i];
}
delete [] programBinaries;
}
//Store the binaries
for(cl_uint i = 0; i < deviceCount; i++)
{
// Store the binary for all devices
std::string currFile = binaryFile + to_string(i) + ".txt";
FILE *fp = fopen(currFile.c_str(), "wb");
fwrite(programBinaries[i], 1, programBinarySizes[i], fp);
fclose(fp);
}
// Cleanup
delete [] programBinarySizes;
for (cl_uint i = 0; i < deviceCount; i++)
{
delete [] programBinaries[i];
}
delete [] programBinaries;
And then on the next go around my code with call this function to create the program from the binaries:
unsigned char **programBinaries = new unsigned char *[deviceCount];
size_t sizes[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
string currFile = binaryFile + to_string(i) + ".txt";
FILE *fp = fopen(currFile.c_str(), "rb");
if(!fp) return NULL;
size_t binarySize;
fseek(fp, 0, SEEK_END);
binarySize = ftell(fp);
sizes[i] = binarySize;
rewind(fp);
programBinaries[i] = new unsigned char[binarySize];
fread(programBinaries[i], 1, binarySize, fp);
fclose(fp);
}
cl_int errNum = 0;
cl_program program;
cl_int binaryStatus;
program = clCreateProgramWithBinary(context,
deviceCount,
devices,
sizes,
(const unsigned char **)programBinaries,
&binaryStatus,
&errNum);
delete [] programBinaries;
errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
I have a rmbp which has three devices on the only one apple platform. I run your code on it and encountered the same problem. Actually I do not know the solution, but I can give you some hints for debugging.
do not use ftell to compute the size of a regular file, see the reason here
I modified your snippet as follows:
#include <sys/stat.h>
unsigned char **programBinaries = new unsigned char *[deviceCount];
size_t sizes[deviceCount];
int fd;
struct stat st;
for(cl_uint i = 0; i < deviceCount; i++)
{
string currFile = binaryFile + to_string(i) + ".txt";
fd = open(currFile.c_str(), O_RDONLY);
if (fd == -1) {
return -1;
}
if ((fstat(fd, &st) != 0) || (!S_ISREG(st.st_mode))) {
return -2;
}
size_t binarySize;
FILE *fp = fdopen(fd, "rb");
if (fseeko(fp, 0 , SEEK_END) != 0) {
return -3;
}
binarySize = ftello(fp);
cout << "device " << i << ": " << binarySize << endl;
sizes[i] = binarySize;
rewind(fp);
programBinaries[i] = new unsigned char[binarySize];
fread(programBinaries[i], 1, binarySize, fp);
fclose(fp);
close(fd);
}
on my system, however, I got the same result as your original code.
according to
cl_program clCreateProgramWithBinary ( cl_context context,
cl_uint num_devices,
const cl_device_id *device_list,
const size_t *lengths,
const unsigned char **binaries,
cl_int *binary_status,
cl_int *errcode_ret)
binary_status: Returns whether the program binary for each device specified in device_list was loaded successfully or not. It is an array of num_devices entries and returns CL_SUCCESS in binary_status[i] if binary was successfully loaded for device specified by device_list[i]; otherwise returns CL_INVALID_VALUE if lengths[i] is zero or if binaries[i] is a NULL value or CL_INVALID_BINARY in binary_status[i] if program binary is not a valid binary for the specified device. If binary_status is NULL, it is ignored.
if you modify your code like this:
cl_int binaryStatus[deviceCount];
program = clCreateProgramWithBinary(context,
deviceCount,
devices,
sizes,
(const unsigned char **)programBinaries,
binaryStatus,
&errNum);
for (cl_uint i = 0; i < deviceCount; ++i)
{
cout << "device: " << i << ": " << binaryStatus[i] << endl;
}
normally, you will get the following results:
device: 0: 0
device: 1: -42
the first line means that the first binary program (for CPU) was successfully loaded. -42 in the second line corresponds CL_INVALID_BINARY ,which means it is failed to load the binary program.
I also try to retrieve the build options from the program, but got nothing.
//set device_id to 0,1,3...
cl_uint device_id = 0;
cl_build_status status;
// Determine the reason for the error
char buildOptions[16384];
char buildLog[16384];
clGetProgramBuildInfo(program, devices[device_id], CL_PROGRAM_BUILD_STATUS,
sizeof(cl_build_status), &status, NULL);
std::cout << "status: " << status << endl;
clGetProgramBuildInfo(program, devices[device_id], CL_PROGRAM_BUILD_OPTIONS,
sizeof(buildOptions), buildOptions, NULL);
std::cout << "build options: " << endl;
std::cout << buildOptions;
clGetProgramBuildInfo(program, devices[device_id], CL_PROGRAM_BUILD_LOG,
sizeof(buildLog), buildLog, NULL);
std::cout << "build log: " << endl;
std::cout << buildLog;
I guess it is a bug of opencl driver. hope the above stuff is helpful for you.

Mandelbrot in OpenCL

I have this Mandelbrot Kernel written for an OpenCL program. For test I've decided to have all my complex plane on a vector. My problem is when I print the output I obtain a list of 1 (like the initialization of the results array) and not the result of the kernel work.
Where can I have the problem?
#include <iostream>
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
int main(){
using namespace std;
int xPixel=100;
int yPixel=100;
float ics[xPixel];
for(int i=0;i<xPixel;++i)
ics[i]=-2+i*((float)4/xPixel);
float ypsilon[yPixel];
for(int i=0;i<yPixel;++i)
ypsilon[i]=-2+i*((float)4/yPixel);
int results[xPixel*yPixel];
for(int i=0;i<xPixel*yPixel;++i)
results[i]=1;
cl_context context;
cl_context_properties properties[3];
cl_kernel kernel;
cl_command_queue command_queue;
cl_program program;
cl_int err;
cl_uint num_of_platforms=0;
cl_platform_id platform_id;
cl_device_id device_id;
cl_uint num_of_devices=0;
cl_mem memX, memY, memOutput;
size_t global;
const char *KernelSource =
"__kernel void mandelbrot(__global float *ics, __global float *ypsilon, __global int *output){\n"\
"size_t id=get_global_id(0);\n"\
"int yPixel=100;\n"\
"for(int i=0;i<yPixel;i++){\n"\
"float x=0;\n"\
"float y=0;\n"\
"int counter=0;\n"\
"while(counter<1000){\n"\
"if(x*x+y*y>2*2){\n"\
"output[(id*yPixel)+i]=counter;\n"\
"break;\n"\
"}\n"\
"float xTemp=x*x-y*y+ics[id];\n"\
"y=2*x*y+ypsilon[i];\n"\
"x=xTemp;\n"\
"counter++;\n"\
"}\n"\
"}\n"\
"}\n";
// retreives a list of platforms available
if (clGetPlatformIDs(1, &platform_id, &num_of_platforms)!= CL_SUCCESS){
cout<<"Unable to get platform_id\n"<<endl;;
return 1;
}
// try to get a supported GPU device
if (clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id,&num_of_devices) != CL_SUCCESS){
cout<<"Unable to get device_id\n"<<endl;
return 1;
}
//context properties list - nust be terminated with 0
properties[0]=CL_CONTEXT_PLATFORM;
properties[1]=(cl_context_properties)platform_id;
properties[2]=0;
//create a context with the GPU device
context=clCreateContext(properties,1,&device_id,NULL,NULL,&err);
//create a command queue using the context and device
command_queue=clCreateCommandQueue(context,device_id,0,&err);
//create a program from the kernel source code
program=clCreateProgramWithSource(context,1,(const char**)&KernelSource,NULL,&err);
//compile the program
if(clBuildProgram(program,0,NULL,NULL,NULL,NULL)!=CL_SUCCESS){
cout<<"Error building program"<<endl;
return 1;
}
//specify which kernel from the program to execute
kernel=clCreateKernel(program,"mandelbrot",&err);
//create buffers for input and output
memX=clCreateBuffer(context,CL_MEM_READ_ONLY,sizeof(float)*xPixel,NULL,NULL);
memY=clCreateBuffer(context,CL_MEM_READ_ONLY,sizeof(float)*yPixel,NULL,NULL);
memOutput=clCreateBuffer(context,CL_MEM_WRITE_ONLY,sizeof(int)*(xPixel*yPixel),NULL,NULL);
//load data into the input buffer
clEnqueueWriteBuffer(command_queue,memX,CL_TRUE,0,sizeof(float)*xPixel,ics,0,NULL,NULL);
clEnqueueWriteBuffer(command_queue,memY,CL_TRUE,0,sizeof(float)*yPixel,ypsilon,0,NULL,NULL);
//set the argument list for the kernel command
clSetKernelArg(kernel,0,sizeof(cl_mem),&memX);
clSetKernelArg(kernel,1,sizeof(cl_mem),&memY);
clSetKernelArg(kernel,2,sizeof(cl_mem),&memOutput);
global=xPixel*yPixel;
//enqueue the kernel command for execution
clEnqueueNDRangeKernel(command_queue,kernel,1,NULL,&global,NULL,0,NULL,NULL);
clFinish(command_queue);
//copy the results from out of the output buffer
clEnqueueReadBuffer(command_queue,memOutput,CL_TRUE,0,sizeof(int)*(xPixel*yPixel),results,0,NULL,NULL);
//print output
for(int i=0;i<xPixel;++i){
for(int j=0;j<yPixel;++j){
cout<<results[(i*yPixel)+j]<<" ";
}
cout<<endl;
}
//cleanup - release OpenCL resources
clReleaseMemObject(memX);
clReleaseMemObject(memY);
clReleaseMemObject(memOutput);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(command_queue);
clReleaseContext(context);
}
I'm not seeing the exact reason, but I do have a question: If you're running this on every element then what is the "i" looping over "yPixel" for? It seems like you're doing X*Y*Y work instead of X*Y work (your global size is X*Y then the kernel loops on Y again).
If you add "output[(id*yPixel)+i]=42" before the "i" loop then what does your output buffer hold? That will tell you if the problem lies in your kernel or your host code.
To help anyone else looking at this, I've reformatted the kernel code:
__kernel void mandelbrot(__global float *ics, __global float *ypsilon, __global int *output)
{
size_t id=get_global_id(0);
int yPixel=100;
for(int i=0;i<yPixel;i++)
{
float x=0;
float y=0;
int counter=0;
while(counter<1000)
{
if(x*x+y*y>2*2)
{
output[(id*yPixel)+i]=counter;
break;
}
float xTemp=x*x-y*y+ics[id];
y=2*x*y+ypsilon[i];
x=xTemp;
counter++;
}
}
}

What might cause OpenCL to crash on cl::Program.build?

This program crashes when I try to cl::Program.build() but I don't know why. It crashes on the last line of this block of code:
#define __NO_STD_VECTOR
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hPP>
#include <iostream>
#include <fstream>
#include <string>
#include <CL/cl.h>
using namespace std;
using namespace cl;
int _tmain(int argc, _TCHAR* argv[])
{
int tmpSize = 1024;
float **my2D = new float*[tmpSize];
for(int i = 0; i < tmpSize; i++)
{
my2D[i] = new float[tmpSize];
for(int i2 = 0; i2 < tmpSize; i2++)
{
my2D[i][i2] = 5;
}
}
cl::vector <Platform> platforms;
Platform::get(&platforms);
cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[1]()), 0};
Context context(CL_DEVICE_TYPE_ALL, cps);
cl::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
CommandQueue queue = CommandQueue(context, devices[0], 0);
int W = tmpSize; //i.width();
int H = tmpSize; //i.height();
Buffer d_ip = Buffer(context, CL_MEM_READ_ONLY, W*H*sizeof(float));
Buffer d_op = Buffer(context, CL_MEM_WRITE_ONLY, W*H*sizeof(float));
queue.enqueueWriteBuffer(d_ip, CL_TRUE, 0, W*H*sizeof(float), my2D);
std::ifstream sourceFileName("c:\\users\\me\\desktop\\img_rotate_kernel.cl");
std::string sourceFile(istreambuf_iterator<char>(sourceFileName), (istreambuf_iterator<char>()));
Program::Sources rotn_source(1,std::make_pair(sourceFile.c_str(), sourceFile.length() + 1));
Program rotn_program(context, rotn_source);
rotn_program.build(devices); // <----- CRASHES HERE
}
using this kernel
__kernel void img_rotate(__global float* dest_data, __global float* src_data, int W, int H, float sinTheta, float cosTheta)
const int ix = get_global_id(0);
const int iy = get_global_id(1);
float x0 = W/2.0f;
float y0 = W/2.0f;
float xOff = ix-x0;
float yOff = iy - y0;
int xpos = (int)(xOff*cosTheta + yOff*sinTheta + x0);
int ypos = (int)(yOff*cosTheta - yOff*sinTheta + y0);
if(((int)xpos>=0) && ((int)xpos < W) && ((int)ypos>=0) && ((int)ypos<H))
{
dest_data[iy*W+ix] = src_data[ypos*W+xpos];
}
}
Here is exception dialog I get when it crashes
From the OpenCL C++ wrapper spec:
cl::Program::Program returns a valid program object and err is set to CL_SUCCESS if the program object is
created successfully. Otherwise, it returns one of the following error values returned in err [...]
Your program object was likely not created properly, change your program construction call to use the err parameter following this signature
cl::Program::Program(const Context& context, const Sources& sources, cl_int * err = NULL)
And make sure err == CL_SUCCESS before doing anything else with your program object.
Most OpenCL calls allow you to pass a pointer to an error parameter. You should really do so and check it after your calls (at least in debug builds I guess) to reduce future headaches.
Ok so I modified your source code a little. Here it is I'll explain my changes right after.
#define __NO_STD_VECTOR
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <iostream>
#include <fstream>
#include <string>
#include <CL/cl.h>
#define ARRAY_SIZE 128
using namespace std;
using namespace cl;
int main(int, char**)
{
int err;
float my2D[ARRAY_SIZE * ARRAY_SIZE] = { 0 };
for(int i = 0; i < ARRAY_SIZE * ARRAY_SIZE; i++)
{
my2D[i] = 5;
}
cl::vector <Platform> platforms;
err = Platform::get(&platforms);
if(err != CL_SUCCESS) {
std::cout << "Platform::get failed - " << err << std::endl;
std::cin.get();
}
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()), 0 };
Context context(CL_DEVICE_TYPE_ALL, cps, nullptr, nullptr, &err);
if(err != CL_SUCCESS) {
std::cout << "Context::Context failed - " << err << std::endl;
std::cin.get();
}
cl::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>(&err);
if(err != CL_SUCCESS) {
std::cout << "Context::getInfo failed - " << err << std::endl;
std::cin.get();
}
CommandQueue queue = CommandQueue(context, devices[0], 0, &err);
if(err != CL_SUCCESS) {
std::cout << "CommandQueue::CommandQueue failed - " << err << std::endl;
std::cin.get();
}
int W = ARRAY_SIZE; //i.width();
int H = ARRAY_SIZE; //i.height();
Buffer d_ip = Buffer(context, CL_MEM_READ_ONLY, W*H*sizeof(float), nullptr, &err);
if(err != CL_SUCCESS) {
std::cout << "Buffer::Buffer 1 failed - " << err << std::endl;
std::cin.get();
}
Buffer d_op = Buffer(context, CL_MEM_WRITE_ONLY, W*H*sizeof(float), nullptr, &err);
if(err != CL_SUCCESS) {
std::cout << "Buffer::Buffer 2 failed - " << err << std::endl;
std::cin.get();
}
err = queue.enqueueWriteBuffer(d_ip, CL_TRUE, 0, W*H*sizeof(float), &my2D[0]);
if(err != CL_SUCCESS) {
std::cout << "Queue::enqueueWriteBuffer failed - " << err << std::endl;
std::cin.get();
}
std::ifstream sourceFileName("so_question.cl");
std::string sourceFile(std::istreambuf_iterator<char>(sourceFileName), (std::istreambuf_iterator<char>()));
Program::Sources rotn_source(1,std::make_pair(sourceFile.c_str(), sourceFile.length() + 1));
Program rotn_program(context, rotn_source, &err);
if(err != CL_SUCCESS) {
std::cout << "Program::Program failed - " << err << std::endl;
std::cin.get();
}
err = rotn_program.build(devices);
if(err != CL_SUCCESS) {
std::cout << "Program::build failed - " << err << std::endl;
std::cin.get();
}
}
You'll notice I added a lot more error checks. This allowed me to find out that the call to Context::Context actually did fail in your initial program. The issue likely was that platforms[1] didn't exist (there was 1 element in the vector) so I switched it to platforms[0].
Once that was fixed, I was getting an access violation on the queue.enqueueWriteBuffer(); call. The issue was that your 2-dimensional array was actually an array of heap allocated arrays. That's a problem because OpenCL expects to be able to read data from contiguous memory, which is not the case when allocating with new in a loop like you did. There actually was no guarantee that your arrays were next to each other in memory.
To solve this point, I allocated a one dimensional array on the stack (see the loop at the beginning). The call then becomes
queue.enqueueWriteBuffer(d_ip, CL_TRUE, 0, W*H*sizeof(float), &my2D[0]);
However, you probably won't be able to do so with a 1024 x 1024 array of float because you'll bust stack space. If you need an array that big, you probably want to new a single one dimensional array large enough to contain your data and perform the index arithmetic yourself. This ensures you get your entire storage space as one contiguous chunk.
The code now errors with CL_BUILD_PROGRAM_FAILURE on the err = rotn_program.build() call which means there's probably an error in your CL program code. Since this is an entirely different issue, I'll let you figure this one out.

clBuildProgram yields AccessViolationException when building this specific kernel

This is a part of some sort of parallel reduction/extremum kernel. I have reduced it to the minimum code that still gets clBuildProgram crashing (note that it really crashes, and doesn't just return an error code):
EDIT: It seems like this also happens when local_value is declared global instead of local.
EDIT2 / SOLUTION: The problem was that there was an infinite loop. I should have written remaining_items >>= 1 instead of remaining_items >> 1. As has been said in the answers, the nvidia compiler seems not very robust when it comes to compile/optimization errors.
kernel void testkernel(local float *local_value)
{
size_t thread_id = get_local_id(0);
int remaining_items = 1024;
while (remaining_items > 1)
{
// throw away the right half of the threads
remaining_items >> 1; // <-- SPOTTED THE BUG
if (thread_id > remaining_items)
{
return;
}
// look for a greater value in the right half of the memory space
int right_index = thread_id + remaining_items;
float right_value = local_value[right_index];
if (right_value > local_value[thread_id])
{
local_value[thread_id] = right_value;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
Removing the lines return; and/or local_value[thread_id] = right_value; causes clBuildProgram to finish successfully.
I can reproduce this problem on all of my computers (NVIDIA GTX 560, GT 555M, GT 540M, they're all Fermi 2.1 architecture). It's apparent on the NVIDIA CUDA Toolkit SDK versions 4.0, 4.1 and 4.2, when using either x64 or x86 libraries.
Does anyone have an idea what could be the problem?
Is it possible, that local (aka shared) memory is automatically assumed to be (WORK_GROUP_SIZE) * siezof(its_base_type)? That would explain why it works when the lines I mentioned above are removed.
Minimal host code (C99 compatible) for reproduction:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
#define RETURN_THROW(expression) do { cl_int ret = expression; if (ret) { printf(#expression " FAILED: %d\n" , ret); exit(1); } } while (0)
#define REF_THROW(expression) do { cl_int ret; expression; if (ret) { printf(#expression " FAILED: %d\n" , ret); exit(1); } } while (0)
int main(int argc, char **argv)
{
// Load the kernel source code into the array source_str
FILE *fp;
fp = fopen("testkernel.cl", "rb");
if (!fp)
{
fprintf(stderr, "Failed to load kernel.\n");
exit(1);
}
fseek(fp, 0, SEEK_END);
int filesize = ftell(fp);
rewind(fp);
char *source_str = (char*)calloc(filesize, sizeof(char));
size_t bytes_read = fread(source_str, 1, filesize, fp);
source_str[bytes_read] = 0;
fclose(fp);
// Get platform information
cl_uint num_platforms;
RETURN_THROW(clGetPlatformIDs(0, NULL, &num_platforms));
cl_platform_id *platform_ids = (cl_platform_id *)calloc(num_platforms, sizeof(cl_platform_id));
RETURN_THROW(clGetPlatformIDs(num_platforms, platform_ids, NULL));
cl_device_id selected_device_id = NULL;
printf("available platforms:\n");
for (cl_uint i = 0; i < num_platforms; i++)
{
char platform_name[50];
RETURN_THROW(clGetPlatformInfo(platform_ids[i], CL_PLATFORM_NAME, 50, platform_name, NULL));
printf("%s\n", platform_name);
// get devices for this platform
cl_uint num_devices;
RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, 0, NULL, &num_devices));
cl_device_id *device_ids = (cl_device_id *)calloc(num_devices, sizeof(cl_device_id));
RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, num_devices, device_ids, NULL));
// select first nvidia device
if (strstr(platform_name, "NVIDIA")) // ADAPT THIS ACCORDINGLY
{
selected_device_id = device_ids[0];
}
}
if (selected_device_id == NULL)
{
printf("No NVIDIA device found\n");
exit(1);
}
// Create an OpenCL context
cl_context context;
REF_THROW(context = clCreateContext(NULL, 1, &selected_device_id, NULL, NULL, &ret));
// Create a program from the kernel source
cl_program program;
REF_THROW(program = clCreateProgramWithSource(context, 1, (const char **)&source_str, NULL, &ret));
// Build the program
cl_int ret = clBuildProgram(program, 1, &selected_device_id, NULL, NULL, NULL);
if (ret)
{
printf("BUILD ERROR\n");
// build error - get build log and display it
size_t build_log_size;
ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &build_log_size);
char *build_log = new char[build_log_size];
ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, build_log_size, build_log, NULL);
printf("%s\n", build_log);
exit(1);
}
printf("build finished successfully\n");
return 0;
}
In my experience the nvidia compiler isn't very robust when it comes to handling build errors, so you probably have a compile error somewhere.
I think your problem is indeed the return, or more to the point its combination with barrier. According to the opencl spec about barriers:
All work-items in a work-group executing the kernel on a processor
must execute this function before any are allowed to continue
execution beyond the barrier. This function must be encountered by all
work-items in a work-group executing the kernel.
If barrier is inside a conditional statement, then all work-items must enter the
onditional if any work-item enters the conditional statement and
executes the barrier.
If barrer is inside a loop, all work-items
must execute the barrier for each iteration of the loop before any are
allowed to continue execution beyond the barrier.
So I think your problem is probably that a lot of threads would return before getting to the barrier, making this code invalid. Maybe you should try something like this:
kernel void testkernel(local float *local_value) {
size_t thread_id = get_local_id(0);
int remaining_items = 1024;
while (remaining_items > 1) {
remaining_items >>= 1;// throw away the right half of the threads
if (thread_id <= remaining_items) {
// look for a greater value in the right half of the memory space
int right_index = thread_id + remaining_items;
float right_value = local_value[right_index];
if (right_value > local_value[thread_id])
local_value[thread_id] = right_value;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
Edit: Furthermore as noted in the comments it needs to be remaining_items>>=1 instead of remaining_items>>1 in order to avoid producing an infinite loop.

OpenCL enqueueNDRangeKernel causes Access Violation error

I am continuously getting an Access Violation Error with a all my kernels which I am trying to build. Other kernels which I take from books seem to work fine.
https://github.com/ssarangi/VideoCL - This is where the code is.
Something seems to be missing in this. Could someone help me with this.
Thanks so much.
[James] - Thanks for the suggestion and you are right. I am doing it on Win 7 with a AMD Redwood card. I have the Catalyst 11.7 drivers with AMD APP SDK 2.5. I am posting the code below.
#include <iostream>
#include "bmpfuncs.h"
#include "CLManager.h"
void main()
{
float theta = 3.14159f/6.0f;
int W ;
int H ;
const char* inputFile = "input.bmp";
const char* outputFile = "output.bmp";
float* ip = readImage(inputFile, &W, &H);
float *op = new float[W*H];
//We assume that the input image is the array “ip”
//and the angle of rotation is theta
float cos_theta = cos(theta);
float sin_theta = sin(theta);
try
{
CLManager* clMgr = new CLManager();
// Build the Source
unsigned int pgmID = clMgr->buildSource("rotation.cl");
// Create the kernel
cl::Kernel* kernel = clMgr->makeKernel(pgmID, "img_rotate");
// Create the memory Buffers
cl::Buffer* clIp = clMgr->createBuffer(CL_MEM_READ_ONLY, W*H*sizeof(float));
cl::Buffer* clOp = clMgr->createBuffer(CL_MEM_READ_WRITE, W*H*sizeof(float));
// Get the command Queue
cl::CommandQueue* queue = clMgr->getCmdQueue();
queue->enqueueWriteBuffer(*clIp, CL_TRUE, 0, W*H*sizeof(float), ip);
// Set the arguments to the kernel
kernel->setArg(0, clOp);
kernel->setArg(1, clIp);
kernel->setArg(2, W);
kernel->setArg(3, H);
kernel->setArg(4, sin_theta);
kernel->setArg(5, cos_theta);
// Run the kernel on specific NDRange
cl::NDRange globalws(W, H);
queue->enqueueNDRangeKernel(*kernel, cl::NullRange, globalws, cl::NullRange);
queue->enqueueReadBuffer(*clOp, CL_TRUE, 0, W*H*sizeof(float), op);
storeImage(op, outputFile, H, W, inputFile);
}
catch(cl::Error error)
{
std::cout << error.what() << "(" << error.err() << ")" << std::endl;
}
}
I am getting the error at the queue->enqueueNDRangeKernel line.
I have the queue and the kernel stored in a class.
CLManager::CLManager()
: m_programIDs(-1)
{
// Initialize the Platform
cl::Platform::get(&m_platforms);
// Create a Context
cl_context_properties cps[3] = {
CL_CONTEXT_PLATFORM,
(cl_context_properties)(m_platforms[0])(),
0
};
m_context = cl::Context(CL_DEVICE_TYPE_GPU, cps);
// Get a list of devices on this platform
m_devices = m_context.getInfo<CL_CONTEXT_DEVICES>();
cl_int err;
m_queue = new cl::CommandQueue(m_context, m_devices[0], 0, &err);
}
cl::Kernel* CLManager::makeKernel(unsigned int programID, std::string kernelName)
{
cl::CommandQueue queue = cl::CommandQueue(m_context, m_devices[0]);
cl::Kernel* kernel = new cl::Kernel(*(m_programs[programID]), kernelName.c_str());
m_kernels.push_back(kernel);
return kernel;
}
I checked your code. I'm on Linux though. At runtime I'm getting Error -38, which means CL_INVALID_MEM_OBJECT. So I went and checked your buffers.
cl::Buffer* clIp = clMgr->createBuffer(CL_MEM_READ_ONLY, W*H*sizeof(float));
cl::Buffer* clOp = clMgr->createBuffer(CL_MEM_READ_WRITE, W*H*sizeof(float));
Then you pass the buffers as a Pointer:
kernel->setArg(0, clOp);
kernel->setArg(1, clIp);
But setArg is expecting a value, so the buffer pointers should be dereferenced:
kernel->setArg(0, *clOp);
kernel->setArg(1, *clIp);
After those changes the cat rotates ;)

Resources