enqueueWriteImage fail on GPU - opencl

I am developing some kernels which works with image buffers. The problem is that when I create my Image2D by directly copying the data of the image, everything works well.
If I try to enqueue a write to my image buffer, it won't works for my GPU.
Here is a basic kernel :
__kernel void myKernel(__read_only image2d_t in, __write_only image2d_t out) {
const int x = get_global_id(0);
const int y = get_global_id(1);
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
uint4 pixel = read_imageui(in, sampler, (int2)(x, y));
write_imageui(out, (int2)(x, y), pixel);
}
Well, that simple kernel give me a black image on my GPU, but works well on my CPU.
To make it works, I have to do release the buffer image and creating a new one by directly passing data using CL_MEM_COPY_HOST_PTR.
I use the good data format : CL_RGBA, CL_UNSIGNED_INT8 and the size of my image is good.
The problem has been encountered with JOCL and the C++ binding of the API. (I didn't test the C API).
Finally, it runs by recreating the buffer, but is it a good idea ? Is it just normal ? Which actions can I perform to avoid it ?
By the way, I'm running on Intel SDK for OpenCL (Intel Core I7) and ATI AMD APP SDK (HD6800).
[edit]
Here is the code I use to write in my buffers.
First, the allocation part :
cl_image_format imageFormat = new cl_image_format();
imageFormat.image_channel_order = CL_RGBA;
imageFormat.image_channel_data_type = CL_UNSIGNED_INT8;
inputImageMem = clCreateImage2D(
context, CL_MEM_READ_ONLY,
new cl_image_format[]{imageFormat}, imageSizeX, imageSizeY,
0, null, null);
And when running, called for each frame, the part which doesn't work on GPU :
clEnqueueWriteImage(commandQueue, inputImageMem, CL_TRUE, new long[]{0, 0, 0},
new long[]{imageSizeX, imageSizeY, 1}, 0, 0,
Pointer.to(data), 0, null, null);
The part which works on both GPU and CPU but force me to recreate the buffer :
clReleaseMemObject(inputImageMem);
cl_image_format imageFormat = new cl_image_format();
imageFormat.image_channel_order = CL_RGBA;
imageFormat.image_channel_data_type = CL_UNSIGNED_INT8;
inputImageMem = clCreateImage2D(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, new cl_image_format[]{imageFormat}, imageSizeX, imageSizeY, 0, Pointer.to(data), null);
The data sent is an array of int of size imageSizeX*imageSizeY. I get it by this code :
DataBufferInt dataBuffer = (DataBufferInt)image.getRaster().getDataBuffer();
int data[] = dataBuffer.getData();
The code above is in java using JOCL, the same problem appear in an another C++ program using the C++ OpenCL Wrapper. The only differences are that in Java the virtual machine crash (after 3~4 frames) and in C++ the result is a black image.

Well, I found the problem. That was my drivers acting weird.
I was using the 12.4 version (the version I installed when I began to work with OpenCL) and I just installed the 12.6 version and the problem just disappeared.
So, keep your drivers up to date !

Related

CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism

Kernel codes that produce the error:
__kernel void testDynamic(__global int *data)
{
int id=get_global_id(0);
atomic_add(&data[1],2);
}
__kernel void test(__global int * data)
{
int id=get_global_id(0);
atomic_add(&data[0],2);
if (id == 0) {
queue_t q = get_default_queue();
ndrange_t ndrange = ndrange_1D(1,1);
void (^my_block_A)(void) = ^{testDynamic(data);};
enqueue_kernel(q, CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
ndrange,
my_block_A);
}
}
I tested below code to be sure OpenCL 2.0 compiler is working.
__kernel void test2(__global int *data)
{
int id=get_global_id(0);
data[id]=work_group_scan_inclusive_add(id);
}
scan function gives 0,1,3,6 as outputs so OpenCL 2.0 reduction functions are working.
Is dynamic parallelism an extension to OpenCL 2.0? If I remove enqueue_kernel command, results are equal the the expected values(omitting child kernel).
Device: Amd RX550, driver: 17.6.2
Is there a special command that needs to be run on host side, to run child kernel on get_default_queue queue? For now, command queue is created with an OpenCL 1.2 way as below:
commandQueue = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
Does get_default_queue() have to be the same command queue which calls the parent kernel? Asking this because I'm using same command queue to upload data to GPU and then download results, in a single synchronization.
Moved solution from question to answer:
Edit: below API command was the solution:
commandQueue = cl::CommandQueue(context, device,
CL_QUEUE_ON_DEVICE|
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE |
CL_QUEUE_ON_DEVICE_DEFAULT, &err);
after creating this queue(only 1 per device), didn't use it for anything else and also the parent kernel is enqueued on any other host queue so it looks like get_default_queue() doesn't have to be the parent-calling queue.
Documentation says CL_INVALID_QUEUE_PROPERTIES will be thrown if CL_QUEUE_ON_DEVICE is specified but for my machine, dynamic parallelism works with it and doesn't throw that error(as the upper commandQueue constructor parameters).

Reading an external kernel in OpenCL

I have the following lines of code which I use to first determine the file size of the .cl file I am reading from (and loading into a buffer), and subsequently building my program and kernel from the buffer. Assuming calculate.cl contains a simple vector addition kernel.
//get size of kernel source
FILE *f = fopen("calculate.cl", "r");
fseek(f, 0, SEEK_END);
size_t programSize = ftell(f);
rewind(f);
//load kernel into buffer
char *programBuffer = (char*)malloc(programSize + 1);
programBuffer[programSize] = '\0';
fread(programBuffer, sizeof(char), programSize, f);
fclose(f);
//create program from buffer
cl_program program = clCreateProgramWithSource(context, 1, (const char**) &programBuffer, &programSize, &status);
//build program for devices
status = clBuildProgram(program, numDevices, devices, NULL, NULL, NULL);
//create the kernel
cl_kernel calculate = clCreateKernel(program, "calculate", &status);
However, when I run my program, the output produced is zero instead of the intended vector addition results. I've verified that the problem is not to do with the kernel itself (I used a different method to load an external kernel which worked and gave me the intended results) however I am still curious as to why this initial method I attempted did not work.
Any help?
the problem's been solved.
following bl0z0's suggestion and looking up the error, I've found the solution here:
OpenCL: Expected identifier in kernel
thanks everyone :D I really appreciate it!
I believe this gives the programing size in terms of the number of chars:
size_t programSize = ftell(f);
and here you need to allocate in terms of bytes:
char *programBuffer = (char*)malloc(programSize + 1);
so I think that previous line should be
char *programBuffer = (char*)malloc(programSize * sizeof(char) + 1);
Double check this by just printing the programBuffer.

Memory object allocation in Opencl for dynamic array in structure

I have created following structure 'data' in C
typedef struct data
{
double *dattr;
int d_id;
int bestCent;
}Data;
The 'dattr' is an array in above structure which is kept dynamic.
Suppose I have to create 10 objects of above structure. i.e.
dataNode = (Data *)malloc (sizeof(Data) * 10);
and for every object of this structure I have to reallocate the memory in C for array 'dattr' using:
for(i=0; i<10; i++)
dataNode[i].dattr = (double *)malloc(sizeof(double) * 3);
What should do to implement the same in OpenCL? How to allocate the memory for array 'dattr' once I allocate the memory for structure objects?
Memory allocation in OpenCL devices (for example, a GPU) must be performed in the host thread using clCreateBuffer (or clCreateImage2D/3D if you wish to use texture memory). These functions allow you automatically copy host data (created with malloc for example) to the device, but I usually prefer to explicitly use clEnqueueWriteBuffer/clEnqueueMapBuffer (or clEnqueueWriteImage/clEnqueueMapImage if using texture memory), so that I can profile the data transfers. Here's an example:
#define DATA_SIZE 1000
typedef struct data {
cl_uint id;
cl_uint x;
cl_uint y;
} Data;
...
// Allocate data array in host
size_t dataSizeInBytes = DATA_SIZE * sizeof(Data);
DATA * dataArrayHost = (DATA *) malloc(dataSizeInBytes);
// Initialize data
...
// Create data array in device
cl_mem dataArrayDevice = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSizeInBytes, NULL, &status );
// Copy data array to device
status = clEnqueueWriteBuffer(queue, dataArrayDevice, CL_TRUE, 0, dataSizeInBytes, &dataArrayHost, 0, NULL, NULL );
// Make sure to pass dataArrayDevice as kernel parameter
// Run kernel
...
What you need to consider is that you need to know the memory requirements of an OpenCL kernel before you execute it. As such memory allocation can be dynamic if performed before kernel execution (i.e. in host). Nothing stops you from calling the kernel several times, and in each of those times adjusting (allocating) the kernel memory requirements.
Having this into account, I advise you to rethink the way your approaching the problem. To begin, it is simpler (but not necessarily more efficient) to work with arrays of structures, than with structures of arrays (in which case, the arrays would have to have a fixed size anyway).
This is just to give you an idea of how OpenCL works. Take a look at Khronos OpenCL resource page, it has plenty of OpenCL tutorials and examples, and Khronos OpenCL page, which has the official OpenCL references, man pages and quick references cards.
As suggested by Faken if you are concern with dynamic memory allocation and you are eager to change the algorithm a little bit, here is some hint:
The following code dynamically allocates local memory space and passes it as the 8th argument to the OpenCL kernel:
int N; //Number_of_data_points, which will keep on changing as per your requirement
size_t localMemSize = ( N* sizeof(int));
...
// Dynamically allocate local memory (allocated per workgroup)
clSetKernelArg(kernel, 8, localMemSize, NULL);

Efficient conversion of AVFrame to QImage

I need to extract frames from a video in my Qt based application. Using ffmpeg libraries I am able to fetch frames as AVFrames which I need to convert to QImage to use in other parts of my application. This conversion needs to be efficient. So far it seems sws_scale() is the right function to use but I am not sure what source and destination pixel formats are to be specified.
Came up with the following 2-step process that first converts a decoded AVFame to another AVFrame in RGB colorspace and then to QImage. It works and is reasonably fast.
src_frame = get_decoded_frame();
AVFrame *pFrameRGB = avcodec_alloc_frame(); // intermediate pframe
if(pFrameRGB==NULL) {
;// Handle error
}
int numBytes= avpicture_get_size(PIX_FMT_RGB24,
is->video_st->codec->width, is->video_st->codec->height);
uint8_t *buffer = (uint8_t*)malloc(numBytes);
avpicture_fill((AVPicture*)pFrameRGB, buffer, PIX_FMT_RGB24,
is->video_st->codec->width, is->video_st->codec->height);
int dst_fmt = PIX_FMT_RGB24;
int dst_w = is->video_st->codec->width;
int dst_h = is->video_st->codec->height;
// TODO: cache following conversion context for speedup,
// and recalculate only on dimension changes
SwsContext *img_convert_ctx_temp;
img_convert_ctx_temp = sws_getContext(
is->video_st->codec->width, is->video_st->codec->height,
is->video_st->codec->pix_fmt,
dst_w, dst_h, (PixelFormat)dst_fmt,
SWS_BICUBIC, NULL, NULL, NULL);
QImage *myImage = new QImage(dst_w, dst_h, QImage::Format_RGB32);
sws_scale(img_convert_ctx_temp,
src_frame->data, src_frame->linesize, 0, is->video_st->codec->height,
pFrameRGB->data,
pFrameRGB->linesize);
uint8_t *src = (uint8_t *)(pFrameRGB->data[0]);
for (int y = 0; y < dst_h; y++)
{
QRgb *scanLine = (QRgb *) myImage->scanLine(y);
for (int x = 0; x < dst_w; x=x+1)
{
scanLine[x] = qRgb(src[3*x], src[3*x+1], src[3*x+2]);
}
src += pFrameRGB->linesize[0];
}
If you find a more efficient approach, let me know in the comments
I know, it's too late, but maybe someone will find it useful. From here I got the clue of doing the same conversion, which looks a bit shorter.
So I created QImage which is reused for every decoded frame:
QImage img( width, height, QImage::Format_RGB888 );
Created frameRGB:
frameRGB = av_frame_alloc();
//Allocate memory for the pixels of a picture and setup the AVPicture fields for it.
avpicture_alloc( ( AVPicture *) frameRGB, AV_PIX_FMT_RGB24, width, height);
After the the first frame is decoded I create conversion context SwsContext this way (it will be used for all the next frames):
mImgConvertCtx = sws_getContext( codecContext->width, codecContext->height, codecContext->pix_fmt, width, height, AV_PIX_FMT_RGB24, SWS_BICUBIC, NULL, NULL, NULL);
And finally for every decoded frame conversion is performed:
if( 1 == framesFinished && nullptr != imgConvertCtx )
{
//conversion frame to frameRGB
sws_scale(imgConvertCtx, frame->data, frame->linesize, 0, codecContext->height, frameRGB->data, frameRGB->linesize);
//setting QImage from frameRGB
for( int y = 0; y < height; ++y )
memcpy( img.scanLine(y), frameRGB->data[0]+y * frameRGB->linesize[0], mWidth * 3 );
}
See the link for the specifics.
A simpler approach, I think:
void takeSnapshot(AVCodecContext* dec_ctx, AVFrame* frame)
{
SwsContext* img_convert_ctx;
img_convert_ctx = sws_getContext(dec_ctx->width,
dec_ctx->height,
dec_ctx->pix_fmt,
dec_ctx->width,
dec_ctx->height,
AV_PIX_FMT_RGB24,
SWS_BICUBIC, NULL, NULL, NULL);
AVFrame* frameRGB = av_frame_alloc();
avpicture_alloc((AVPicture*)frameRGB,
AV_PIX_FMT_RGB24,
dec_ctx->width,
dec_ctx->height);
sws_scale(img_convert_ctx,
frame->data,
frame->linesize, 0,
dec_ctx->height,
frameRGB->data,
frameRGB->linesize);
QImage image(frameRGB->data[0],
dec_ctx->width,
dec_ctx->height,
frameRGB->linesize[0],
QImage::Format_RGB888);
image.save("capture.png");
}
Today, I have tested directly pass the image->bit() to swscale and finally it works, so it doesn't need to copy to memory. For example:
/* 1. Get frame and QImage to show */
struct my_frame *frame = get_frame(source);
QImage *myImage = new QImage(dst_w, dst_h, QImage::Format_RGBA8888);
/* 2. Convert and write into image buffer */
uint8_t *dst[] = {myImage->bits()};
int linesizes[4];
av_image_fill_linesizes(linesizes, AV_PIX_FMT_RGBA, frame->width);
sws_scale(myswscontext, frame->data, (const int*)frame->linesize,
0, frame->height, dst, linesizes);
I just discovered that scanLine is just seeking thru the buffer.. all you need is use AV_PIX_FMT_RGB32 for the AVFrame and QImage::FORMAT_RGB32 for the QImage.
Then after decoding just do a memcpy
memcpy(img.scanLine(0), pFrameRGB->data[0], pFrameRGB->linesize[0] * pFrameRGB->height());
I had problems with the other proposed solutions as :
They did not mention freeing either AVFrame, SwsContext or the allocated buffers, which caused massive memory leaks (I had thousands of frames to handle). These problems couldn't all be solved easily as QImage relies on the underlying data, and does not copy it. If freeing the buffer directly, the QImage points to freed data and breaks. This could be solved by using QImage's cleanupFunction to free the buffer once the image is no longer needed, but with other problems it wasn't good anyways.
In some cases one of the suggestions, of passing QImage.bits directly to sws_scale, would not work as QImage are minimum 32 bit aligned. Therefore for certain dimensions it would not match the expected width by sws_scale and output each line shifted a little bit.
A third problem is that they used deprecated AVPicture elements.
I listed the problem in another question Converting an AVFrame to QImage with conversion of pixel format and in the end found a solution using a temporary buffer, which could be copied to the QImage, and then safely freed.
So see my answer for a fully working, efficient, and with no deprecated function calls, implementation : https://stackoverflow.com/a/68212609/7360943

OpenCL random kernel behaviour when certain system size is exceeded

I am having a problem like this. Basically, I have a 2D grid allocated on host:
double* grid = (double*)malloc(sizeof(double)*(ny*nx) * 9);
Folllowing normal openCL procedure to put it on the openCL device:
cl_mem cl_grid = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, sizeof(double) * (ny*nx) * 9, grid, &error);
And Enqueue and launch:
clEnqueueNDRangeKernel(queue, foo, 1, NULL, &global_ws, &local_ws, 0, NULL, NULL);
In the kernel function, simple arithmetic is performed on the 1st column of the grid:
__kernel void foo(__constant ocl_param* params, __global double* grid)
{
const int ii = get_global_id(0);
int jj;
jj=0;
if (ii < params->ny) {
grid[getIndexUsingMacro(ii,jj)] += params->someNumber;
}
}
And finally read back the buffer and check values.
clEnqueueReadBuffer(queue, cl_grid, CL_TRUE, 0, sizeof(double) * 9 * nx * ny, checkGrid, 0, NULL, NULL);
The problem is when the grid size (i.e. nx * ny * 9) exceeds 16384 * 9 * 8 bytes = 1152KB (* 8 since double precision is used).
if using openCL on CPU, an error CL_OUT_OF_RESOURCES is thrown when launching the kernel no matter what I set for global_ws and local_ws (I set them to 1 and the error is still thrown). The CPU is an Intel i5 2415m with 8GB of RAM and 3MB cache.
If using openCL on the GPU (NVIDIA TESLA M2050), no error is thrown. However, when reading back the value from the buffer, the grid is not changed at all. It means it returns the grid whose values are exactly the same as before it is sent to the kernel function.
For e.g. When I set nx = 30, ny = 546, nx*ny = 16380, everything runs fine. The grid returned with the results changed as expected. But when ny = 547, nx* ny = 16410, the problem occurs both on CPU and GPU as described above. The problem is the same if I swap nx and ny, hence, if nx = 547, ny = 30, it still happens. Can you guys suggest what might be the problem here ?
Many thanks
It looks like a synchronization issue. grid[index] += value with the same index value may be executed concurrently by several work items. This operation is not atomic, and all these work items will load grid[index], add their value, and store it back, possibly losing some increments in the process.
To solve this, you can synchronize these work items using barrier if they are in a single work group, or enqueuing more kernels otherwise.
Another possibility is to ensure only one work item is able to modify a given element of the grid (usually the best solution).
If several work items need to work on a common subset of the grid, using local memory and local memory barriers may be useful.

Resources