OpenCL enqueueNDRangeKernel causes Access Violation error - opencl

I am continuously getting an Access Violation Error with a all my kernels which I am trying to build. Other kernels which I take from books seem to work fine.
https://github.com/ssarangi/VideoCL - This is where the code is.
Something seems to be missing in this. Could someone help me with this.
Thanks so much.
[James] - Thanks for the suggestion and you are right. I am doing it on Win 7 with a AMD Redwood card. I have the Catalyst 11.7 drivers with AMD APP SDK 2.5. I am posting the code below.
#include <iostream>
#include "bmpfuncs.h"
#include "CLManager.h"
void main()
{
float theta = 3.14159f/6.0f;
int W ;
int H ;
const char* inputFile = "input.bmp";
const char* outputFile = "output.bmp";
float* ip = readImage(inputFile, &W, &H);
float *op = new float[W*H];
//We assume that the input image is the array “ip”
//and the angle of rotation is theta
float cos_theta = cos(theta);
float sin_theta = sin(theta);
try
{
CLManager* clMgr = new CLManager();
// Build the Source
unsigned int pgmID = clMgr->buildSource("rotation.cl");
// Create the kernel
cl::Kernel* kernel = clMgr->makeKernel(pgmID, "img_rotate");
// Create the memory Buffers
cl::Buffer* clIp = clMgr->createBuffer(CL_MEM_READ_ONLY, W*H*sizeof(float));
cl::Buffer* clOp = clMgr->createBuffer(CL_MEM_READ_WRITE, W*H*sizeof(float));
// Get the command Queue
cl::CommandQueue* queue = clMgr->getCmdQueue();
queue->enqueueWriteBuffer(*clIp, CL_TRUE, 0, W*H*sizeof(float), ip);
// Set the arguments to the kernel
kernel->setArg(0, clOp);
kernel->setArg(1, clIp);
kernel->setArg(2, W);
kernel->setArg(3, H);
kernel->setArg(4, sin_theta);
kernel->setArg(5, cos_theta);
// Run the kernel on specific NDRange
cl::NDRange globalws(W, H);
queue->enqueueNDRangeKernel(*kernel, cl::NullRange, globalws, cl::NullRange);
queue->enqueueReadBuffer(*clOp, CL_TRUE, 0, W*H*sizeof(float), op);
storeImage(op, outputFile, H, W, inputFile);
}
catch(cl::Error error)
{
std::cout << error.what() << "(" << error.err() << ")" << std::endl;
}
}
I am getting the error at the queue->enqueueNDRangeKernel line.
I have the queue and the kernel stored in a class.
CLManager::CLManager()
: m_programIDs(-1)
{
// Initialize the Platform
cl::Platform::get(&m_platforms);
// Create a Context
cl_context_properties cps[3] = {
CL_CONTEXT_PLATFORM,
(cl_context_properties)(m_platforms[0])(),
0
};
m_context = cl::Context(CL_DEVICE_TYPE_GPU, cps);
// Get a list of devices on this platform
m_devices = m_context.getInfo<CL_CONTEXT_DEVICES>();
cl_int err;
m_queue = new cl::CommandQueue(m_context, m_devices[0], 0, &err);
}
cl::Kernel* CLManager::makeKernel(unsigned int programID, std::string kernelName)
{
cl::CommandQueue queue = cl::CommandQueue(m_context, m_devices[0]);
cl::Kernel* kernel = new cl::Kernel(*(m_programs[programID]), kernelName.c_str());
m_kernels.push_back(kernel);
return kernel;
}

I checked your code. I'm on Linux though. At runtime I'm getting Error -38, which means CL_INVALID_MEM_OBJECT. So I went and checked your buffers.
cl::Buffer* clIp = clMgr->createBuffer(CL_MEM_READ_ONLY, W*H*sizeof(float));
cl::Buffer* clOp = clMgr->createBuffer(CL_MEM_READ_WRITE, W*H*sizeof(float));
Then you pass the buffers as a Pointer:
kernel->setArg(0, clOp);
kernel->setArg(1, clIp);
But setArg is expecting a value, so the buffer pointers should be dereferenced:
kernel->setArg(0, *clOp);
kernel->setArg(1, *clIp);
After those changes the cat rotates ;)

Related

OpenCL: Basic example not working. clSetKernelArg -38 Error

I am attempting a very simple OpenCL example. I have developed the following code below. It compiles a simple kernel, and then I create a simple float* buffer and set it to a cl::Buffer. However, when I attempt to call the kernel.setArg() function, it crashes, with an error -38. This error states that my cl::Buffer is invalid. I have no idea why this is happening:
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#include <CL/cl2.hpp>
#define MULTI_LINE_STRING(ARG) #ARG
namespace op
{
const char *resizeAndMergeKernel = MULTI_LINE_STRING(
__kernel void testKernel(__global float* image)
{
}
);
}
void testCL(){
cl::Device device;
cl::Context context;
cl::CommandQueue queue;
int deviceId = 0;
// Load Device
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
std::string deviceName;
cl_uint i, type;
cl::Platform::get(&platforms);
type = platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
if( type == CL_SUCCESS)
{
// Get only relavent device
cl::Context allContext(devices);
std::vector<cl::Device> gpuDevices;
gpuDevices = allContext.getInfo<CL_CONTEXT_DEVICES>();
bool deviceFound = false;
for(int i=0; i<gpuDevices.size(); i++){
if(i == deviceId){
device = gpuDevices[i];
context = cl::Context(device);
queue = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE);
deviceFound = true;
cout << "Made new GPU Instance: " << deviceId << endl;
break;
}
}
if(!deviceFound)
{
throw std::runtime_error("Error: Invalid GPU ID");
}
}
// Create Kernel
cl::Program program = cl::Program(context, op::resizeAndMergeKernel, true);
cl::Kernel kernel = cl::Kernel(program, "testKernel");
// Simple Buffer
cl_int err;
float* test = new float[3*224*224];
cl::Buffer x = cl::Buffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(float) * 3 * 224 * 224, (void*)test, &err);
cout << err << endl;
kernel.setArg(0,x); // CRASHES WITH cl::Error -38
}
As you can see the last line kernel.setArg(0,x) crashes with error -38.
It's not a "crash", it's an error code. OpenCL error -38 is CL_INVALID_MEM_OBJECT. It means the cl_mem_obj is not valid. It is because you are passing a cl::Buffer object to setArg, but you need to instead pass the cl_mem handle which represents that buffer. The cl::Buffer operator() method returns that. So use kernel.setArg(0,x()). Note the () are the added part (yes, it's subtle).

OpenCl Segmentation Error at clEnqueueNDRangeKernel

I have been working on Convolution using OpenCL on Eclipse. It is giving a segmentation Fault after enqueueNDRangeKernel.
Here is my host code: -
I have taken input image using OpenCV and then : -
const int width = image.size().width;
const int height = image.size().height;
std::cout<<"width: \t"<<width<<"\t height: "<<height<<std::endl;
std::size_t in_imagesize = (width*height)*sizeof(float);
std::vector<float> ptr(width*height,0);
const float filter[3] = {1,2,3};
float filter_size = 3*sizeof(float);
const int FilterRadius = 1;
cv::Mat result_image = cv::Mat(cvSize(width,height), CV_32FC1);
std::size_t out_imagesize = sizeof(float)*(width*height);
std::vector<float> read_buffer(width*height,0);
Then context, command queue, kernel program and after that: -
cl::Buffer input_dev, filter_kernel, output_dev;
input_dev = cl::Buffer(ctx,CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PT R,in_imagesize,image.data,&err);
if(error!= CL_SUCCESS){
std::cout<<"Input Buffer Failed "<<std::endl;
}
output_dev =cl::Buffer(ctx,CL_MEM_READ_WRITE,out_imagesize,NU LL,&err);
if(error!= CL_SUCCESS){
std::cout<<"Output Buffer Failed "<<std::endl;
}
filter_kernel = cl::Buffer(ctx,CL_MEM_READ_ONLY,filter_size,NULL,& err);
if(error!= CL_SUCCESS){
std::cout<<"Output Buffer Failed "<<std::endl;
std::cout<<"filter_kernel write buffer "<<std::endl;
queue.enqueueWriteBuffer(filter_kernel,CL_TRUE,0,3 *sizeof(float),filter,NULL,NULL);
// Create Kernel
std::cout<<"Now try create kernel objects .."<<std::endl;
cl::Kernel kernel(prg,"ConvH_naive",&err);
if(error!= CL_SUCCESS)
{
std::cout<<"create Kernel_naive failed \n"<<std::endl;
}
Then Kernel Arguments and after that: -
cl::NDRange globalsize(width,height);
cl::NDRange localsize(1,1);
cl::NDRange offset(0,0);
std::cout<<"Enqueuing the Kernel"<<std::endl;
if(queue.enqueueNDRangeKernel(kernel,offset,global size,localsize,NULL,NULL)!=CL_SUCCESS)
{
std::cout<<"Failed enqueuing the Kernel"<<std::endl;
}
queue.finish();
After this Readbuffer and imshow. But the code stops after this statement giving a segmentation Fault.
Any one can help?? Is it possible that there is problem is Kernel Code? Shall I add that too??
local size of (1,1) is typically a very bad choice
what platform are you running on? What device (e.g. CPU, GPU)?
It could be that you are segfaulting since you are not handling boundary conditions and accessing a buffer out of bounds.

OpenCL corrupt input WIN32, valid on OSX Lion

I am having an issue with my OpenCL kernel. The input arguments are corrupt when they are passed to the kernel. What makes this strange is this same exact kernel executes flawlessly on mac osx. Once I started porting my code over to windows (windows 8 64-bit) I started having this issue.
I have provided an example using my camera struct. The x,y,z coordinates are defined as <0,0,200>. However, when they make it to my kernel they show as <0,-0.00132704, -0.00132704>.
I have a kernel that accepts two structs.
typedef struct{
cl_float d;
cl_float3 eye;
cl_float3 lookat;
cl_float3 u;
cl_float3 v;
cl_float3 w;
cl_float3 up;
}rt_cl_camera;
typedef struct {
float r;
float g;
float b;
} rt_cl_rgb;
I have slimmed down my kernel for the sake of testing. After tracking down the issues I noticed that my input paramaters were not coming over correctly. However, I have determined that my output is being passed back correctly.
__kernel void ray_trace_scene( __global rt_cl_rgb* output,
__global rt_cl_camera* camera,
const unsigned int pcount)
{
int pixel = get_global_id(0);
if(pixel < pcount){
output[pixel].r = camera->eye.x;
output[pixel].g = camera->eye.y;
output[pixel].b = camera->eye.z;
}// End Pixel computation
}//End kernel
I am creating my input buffer with the follwoing:
cl_mem cam_input;
cl_uint cam_error;
cam_input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(rt_cl_camera), NULL, &cam_error);
I am also checking to make sure my buffer was created successfully with
if (cam_error != CL_SUCCESS || !cam_input) {
throw std::runtime_error(CLERROR_FAILED_DEVBUFF);
}
I then write my data into my buffer with the following.
cl_uint err = 0;
err = clEnqueueWriteBuffer(commands, cam_input, CL_TRUE, 0, sizeof(rt_cl_camera), cam_ptr, 0, NULL, NULL);
if (err != CL_SUCCESS) {
throw std::runtime_error("Failed to write camera");
}
and finally linking my argument for the appropriate command line slot. Please note that slot zero is being used for my output.
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &cam_input);
and checking that everything was successful..
if (err != CL_SUCCESS) {
throw std::runtime_error(CLERROR_FAILED_CMDARGS);
}
I am not receiving any error messages from openCL at any step of the process. Has anyone run into this? Any help is greatly appreciated.
side note - At each step of the way I am printing out my local variables to make sure they are correct and valid before I pass them over to the GPU.
Looks an alignment/packing issue. Try using float4 instead of float3 in the struct, and move float d at the end.

clBuildProgram yields AccessViolationException when building this specific kernel

This is a part of some sort of parallel reduction/extremum kernel. I have reduced it to the minimum code that still gets clBuildProgram crashing (note that it really crashes, and doesn't just return an error code):
EDIT: It seems like this also happens when local_value is declared global instead of local.
EDIT2 / SOLUTION: The problem was that there was an infinite loop. I should have written remaining_items >>= 1 instead of remaining_items >> 1. As has been said in the answers, the nvidia compiler seems not very robust when it comes to compile/optimization errors.
kernel void testkernel(local float *local_value)
{
size_t thread_id = get_local_id(0);
int remaining_items = 1024;
while (remaining_items > 1)
{
// throw away the right half of the threads
remaining_items >> 1; // <-- SPOTTED THE BUG
if (thread_id > remaining_items)
{
return;
}
// look for a greater value in the right half of the memory space
int right_index = thread_id + remaining_items;
float right_value = local_value[right_index];
if (right_value > local_value[thread_id])
{
local_value[thread_id] = right_value;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
Removing the lines return; and/or local_value[thread_id] = right_value; causes clBuildProgram to finish successfully.
I can reproduce this problem on all of my computers (NVIDIA GTX 560, GT 555M, GT 540M, they're all Fermi 2.1 architecture). It's apparent on the NVIDIA CUDA Toolkit SDK versions 4.0, 4.1 and 4.2, when using either x64 or x86 libraries.
Does anyone have an idea what could be the problem?
Is it possible, that local (aka shared) memory is automatically assumed to be (WORK_GROUP_SIZE) * siezof(its_base_type)? That would explain why it works when the lines I mentioned above are removed.
Minimal host code (C99 compatible) for reproduction:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
#define RETURN_THROW(expression) do { cl_int ret = expression; if (ret) { printf(#expression " FAILED: %d\n" , ret); exit(1); } } while (0)
#define REF_THROW(expression) do { cl_int ret; expression; if (ret) { printf(#expression " FAILED: %d\n" , ret); exit(1); } } while (0)
int main(int argc, char **argv)
{
// Load the kernel source code into the array source_str
FILE *fp;
fp = fopen("testkernel.cl", "rb");
if (!fp)
{
fprintf(stderr, "Failed to load kernel.\n");
exit(1);
}
fseek(fp, 0, SEEK_END);
int filesize = ftell(fp);
rewind(fp);
char *source_str = (char*)calloc(filesize, sizeof(char));
size_t bytes_read = fread(source_str, 1, filesize, fp);
source_str[bytes_read] = 0;
fclose(fp);
// Get platform information
cl_uint num_platforms;
RETURN_THROW(clGetPlatformIDs(0, NULL, &num_platforms));
cl_platform_id *platform_ids = (cl_platform_id *)calloc(num_platforms, sizeof(cl_platform_id));
RETURN_THROW(clGetPlatformIDs(num_platforms, platform_ids, NULL));
cl_device_id selected_device_id = NULL;
printf("available platforms:\n");
for (cl_uint i = 0; i < num_platforms; i++)
{
char platform_name[50];
RETURN_THROW(clGetPlatformInfo(platform_ids[i], CL_PLATFORM_NAME, 50, platform_name, NULL));
printf("%s\n", platform_name);
// get devices for this platform
cl_uint num_devices;
RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, 0, NULL, &num_devices));
cl_device_id *device_ids = (cl_device_id *)calloc(num_devices, sizeof(cl_device_id));
RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, num_devices, device_ids, NULL));
// select first nvidia device
if (strstr(platform_name, "NVIDIA")) // ADAPT THIS ACCORDINGLY
{
selected_device_id = device_ids[0];
}
}
if (selected_device_id == NULL)
{
printf("No NVIDIA device found\n");
exit(1);
}
// Create an OpenCL context
cl_context context;
REF_THROW(context = clCreateContext(NULL, 1, &selected_device_id, NULL, NULL, &ret));
// Create a program from the kernel source
cl_program program;
REF_THROW(program = clCreateProgramWithSource(context, 1, (const char **)&source_str, NULL, &ret));
// Build the program
cl_int ret = clBuildProgram(program, 1, &selected_device_id, NULL, NULL, NULL);
if (ret)
{
printf("BUILD ERROR\n");
// build error - get build log and display it
size_t build_log_size;
ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &build_log_size);
char *build_log = new char[build_log_size];
ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, build_log_size, build_log, NULL);
printf("%s\n", build_log);
exit(1);
}
printf("build finished successfully\n");
return 0;
}
In my experience the nvidia compiler isn't very robust when it comes to handling build errors, so you probably have a compile error somewhere.
I think your problem is indeed the return, or more to the point its combination with barrier. According to the opencl spec about barriers:
All work-items in a work-group executing the kernel on a processor
must execute this function before any are allowed to continue
execution beyond the barrier. This function must be encountered by all
work-items in a work-group executing the kernel.
If barrier is inside a conditional statement, then all work-items must enter the
onditional if any work-item enters the conditional statement and
executes the barrier.
If barrer is inside a loop, all work-items
must execute the barrier for each iteration of the loop before any are
allowed to continue execution beyond the barrier.
So I think your problem is probably that a lot of threads would return before getting to the barrier, making this code invalid. Maybe you should try something like this:
kernel void testkernel(local float *local_value) {
size_t thread_id = get_local_id(0);
int remaining_items = 1024;
while (remaining_items > 1) {
remaining_items >>= 1;// throw away the right half of the threads
if (thread_id <= remaining_items) {
// look for a greater value in the right half of the memory space
int right_index = thread_id + remaining_items;
float right_value = local_value[right_index];
if (right_value > local_value[thread_id])
local_value[thread_id] = right_value;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
Edit: Furthermore as noted in the comments it needs to be remaining_items>>=1 instead of remaining_items>>1 in order to avoid producing an infinite loop.

Opencl: GPU Execution Time is always Zero

I am trying to print the execution time for some functions on GPU. But timing on GPU is always comming out to be 0. Also when I choose CL_DEVICE_TYPE_CPU in the following it works fine.
errcode = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_CPU, 1, &device_id, &ret_num_devices);
This works fine and shows non-zero value of execution time but if I choose CL_DEVICE_TYPE_GPU, then it always shows 0, irrespective of total no. of data points and threads. please note that in both cases (CL_DEVICE_TYPE_CPU and CL_DEVICE_TYPE_GPU), I am printing the execution time in same way. That is my host code and my kernel code is same in both cases(thats what openCL is!). Following are some of the code section:
// openCL code to get platform and device ids
errcode = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
errcode = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);
// to create context
clGPUContext = clCreateContext( NULL, 1, &device_id, NULL, NULL, &errcode);
//Create a command-queue
clCommandQue = clCreateCommandQueue(clGPUContext,
device_id, CL_QUEUE_PROFILING_ENABLE, &errcode);
// Setup device memory
d_instances= clCreateBuffer(clGPUContext,CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR,mem_size_i,instances->data, &errcode);
d_centroids = clCreateBuffer(clGPUContext,CL_MEM_READ_WRITE,mem_size_c, NULL, &errcode);
d_distance = clCreateBuffer(clGPUContext,CL_MEM_READ_WRITE,mem_size_d,NULL, &errcode);
// d_dist_X = clCreateBuffer(clGPUContext,CL_MEM_READ_WRITE,mem_size4,NULL, &errcode);
//d_dist_Y = clCreateBuffer(clGPUContext,CL_MEM_READ_WRITE,mem_size4,NULL, &errcode);
//to build program
clProgram = clCreateProgramWithSource(clGPUContext,1, (const char **)&source_str,(const
size_t*)&source_size, &errcode);
errcode = clBuildProgram(clProgram, 0,NULL, NULL, NULL, NULL);
if (errcode == CL_BUILD_PROGRAM_FAILURE)
{
// Determine the size of the log
size_t log_size;
clGetProgramBuildInfo(clProgram, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL,
&log_size);
// Allocate memory for the log
char *log = (char *) malloc(log_size);
// Get the log
clGetProgramBuildInfo(clProgram, device_id, CL_PROGRAM_BUILD_LOG, log_size, log,
NULL);
// Print the log
printf("%s\n", log);
}
clKernel = clCreateKernel(clProgram,"distance_finding", &errcode);
// Launch OpenCL kernel
size_t localWorkSize[1], globalWorkSize[1];
if(num_instances >= 500)
{
localWorkSize[0] = 500;
float block1=num_instances/localWorkSize[0];
int block= (int)(ceil(block1));
globalWorkSize[0] = block*localWorkSize[0];
}
else
{
localWorkSize[0]=num_instances;
globalWorkSize[0]=num_instances;
}
int iteration=1;
while(iteration < MAX_ITERATIONS)
{
errcode = clEnqueueWriteBuffer(clCommandQue,d_centroids , CL_TRUE, 0,
mem_size_c, (void*)centroids->data, 0, NULL, NULL);
errcode = clEnqueueWriteBuffer(clCommandQue,d_distance , CL_TRUE, 0, mem_size_d,
(void*)distance->data, 0, NULL, NULL);
//set kernel arguments
errcode = clSetKernelArg(clKernel, 0,sizeof(cl_mem), (void *)&d_instances);
errcode = clSetKernelArg(clKernel, 1,sizeof(cl_mem), (void *)&d_centroids);
errcode = clSetKernelArg(clKernel, 2,sizeof(cl_mem), (void *)&d_distance);
errcode = clSetKernelArg(clKernel, 3,sizeof(unsigned int), (void *)
&num_instances);
errcode = clSetKernelArg(clKernel,4,sizeof(unsigned int),(void *)&clusters);
errcode = clSetKernelArg(clKernel,5,sizeof(unsigned int),(void *)&dimensions);
errcode = clEnqueueNDRangeKernel(clCommandQue,clKernel, 1, NULL,
globalWorkSize,localWorkSize, 0, NULL, &myEvent);
clFinish(clCommandQue); // wait for all events to finish
clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_START,sizeof(cl_ulong),
&startTime, NULL);
clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_END,sizeof(cl_ulong),
&endTime, NULL);
kernelExecTimeNs = endTime-startTime;
gpu_time+= kernelExecTimeNs;
// Retrieve result from device
errcode = clEnqueueReadBuffer(clCommandQue,d_distance, CL_TRUE, 0,
mem_size_d,distance->data, 0, NULL, NULL);
Printing the time in ms
printf("\n\n Time taken by GPU is %llu ms",gpu_time/1000000);
If the way I am calculating the GPU timing is wrong, why would it work on a CPU (by changing to CL_DEVICE_TYPE_CPU)? What is wrong here?
Edited:
System Information
AMD APP SDK 2.4
AMD ATI FirePro GL 3D, having 800 cores
Kerenel
#pragma OPENCL EXTENSION cl_khr_fp64:enable
double distance_cal(__local float* cent,float* data,int dimensions)
{
float dist1=0.00;
for(int i=0;i<dimensions;i++)
dist1 += ((data[i]-cent[i]) * (data[i]-cent[i]));
double sq_dist=sqrt(dist1);
return sq_dist;
}
void fetch_col(float* data,__constant float* x,int col,int dimension,int len)
{
//hari[i]=8;
for(int i=0;i<dimension;i++)
{
data[i]=x[col];
col=col+len;
}
}
void fetch_col_cen(__local float* data,__global float* x,int col,int dimension,int len)
{
//hari[i]=8;
for(int i=0;i<dimension;i++)
{
data[i]=x[col];
col=col+len;
}
}
__kernel void distance_finding(__constant float* data,__global float* cen,__global float*
dist,int inst,int clus,const int dimensions)
{
int idx=get_global_id(0);
float data_col[4];
fetch_col( data_col,data,idx,dimensions,inst);
for(int i=0;i<clus;i++)
{
int k=i*inst; // take each dimension value for each cluster data
__local float cent[4];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
fetch_col_cen(cent,cen,i,dimensions,clus);
dist[idx+k]=distance_cal(cent,data_col,dimensions);// calculate distance wrt
each data n each centroid
}
}
clEnqueueNDRangeKernel() is asynchronous if it is using GPU and therefore you only see the time it took to enqueue the request but not to execution it.
That said, I could be wrong, but I usually write c++ code to do the timing and put the start_time before the instruction and end_time after the
clFinish(cmd_queue);
just like you did with C++ timing code, that would be a good test, if you're sure your GPU shouldn't be finishing by 0 seconds.
An easy way to check would be to introduce an abnormally long operation inside the kernel. If THAT shows up as zero when there a perceptible lag in actual execution - then you have your answer.
That said, I believe (even though the indicated thread is for Linux, it probably holds water on Windows too) you might need to install the instrumented drivers to even have the system write to the performance counters. You can also use the CUDA profiler on nVidia's OpenCL implementation because it sits on top of CUDA.
change to
clFinish(clCommandQue); // wait for all events to finish
// add this after clFinish()
// Ensure kernel execution is finished
clWaitForEvents(1 , &myEvent);
..
double gpu_time = endTime-startTime;
..
printf("\n\n Time taken by GPU is %0.3f ms", gpu_time/1000000.0);

Resources