Can I have boolean buffer in OpenCL and change its value during kernel execution, example to break while loop - opencl

I want to do some experiments in OpenCL and I want to know possibility to change states during kernel execution from host code using buffer.
I attempted to alter the state of a while loop in the kernel code by modifying the buffer value from within the host code, however the execution is hung.
void my_kernel(
__global bool *in,
__global int *out)
{
int i = get_global_id(0);
while(1) {
if(1 == *in) {
printf("while loop is finished");
break;
}
}
printf("out[0] = %d\n", out[0]);
}
I call second time the function clEnqueueWriteBuffer() to change state of input value.
input[0] = 1;
err = clEnqueueWriteBuffer(commands, input_buffer,
CL_TRUE, 0, sizeof(int), (void*)input,
0, NULL,NULL);

At least for OpenCL 1.x, this is not permitted, and any behaviour you may observe in one implementation cannot be relied upon.
See the NOTE in the OpenCL 1.2 specification, section 5.2.2, Reading, Writing and Copying Buffer Objects:
Calling clEnqueueWriteBuffer to update the latest bits in a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being written is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
The host memory region given by (host_ptr + offset, cb) contains the latest bits when the enqueued write command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the write command has finished execution.
The final condition is not met by your code, therefore its behaviour is undefined.
I am not certain if the situation is different with OpenCL 2.x's Shared Virtual Memory (SVM) feature, as I have no practical experience using it, perhaps someone else can contribute an answer for that.

Related

Understanding the method for OpenCL reduction on float

Following this link, I try to understand the operating of kernel code (there are 2 versions of this kernel code, one with volatile local float *source and the other with volatile global float *source, i.e local and global versions). Below I take local version :
float sum=0;
void atomic_add_local(volatile local float *source, const float operand) {
union {
unsigned int intVal;
float floatVal;
} newVal;
union {
unsigned int intVal;
float floatVal;
} prevVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while (atomic_cmpxchg((volatile local unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}
If I understand well, each work-item shares the access to source variable thanks to the qualifier "volatile", doesn't it?
Afterwards, if I take a work-item, the code will add operand value to newVal.floatVal variable. Then, after this operation, I call atomic_cmpxchg function which check if previous assignment (preVal.floatVal = *source; and newVal.floatVal = prevVal.floatVal + operand; ) has been done, i.e by comparing the value stored at address source with the preVal.intVal.
During this atomic operation (which is not uninterruptible by definition), as value stored at source is different from prevVal.intVal, the new value stored at source is newVal.intVal, which is actually a float (because it is coded on 4 bytes like integer).
Can we say that each work-item has a mutex access (I mean a locked access) to value located at source address.
But for each work-item thread, is there only one iteration into the while loop?
I think there will be one iteration because the comparison "*source== prevVal.int ? newVal.intVal : newVal.intVal" will always assign newVal.intVal value to value stored at source address, won't it?
I have not understood all the subtleties of this trick for this kernel code.
Update
Sorry, I almost understand all the subtleties, especially in the while loop :
First case : for a given single thread, before the call of atomic_cmpxchg, if prevVal.floatVal is still equal to *source, then atomic_cmpxchg will change the value contained in source pointer and return the value contained in old pointer, which is equal to prevVal.intVal, so we break from the while loop.
Second case : If between the prevVal.floatVal = *source; instruction and the call of atomic_cmpxchg, the value *source has changed (by another thread ??) then atomic_cmpxchg returns old value which is no more equal to prevVal.floatVal, so the condition into while loop is true and we stay in this loop until previous condition isn't checked any more.
Is my interpretation correct?
If I understand well, each work-item shares the access to source variable thanks to the qualifier "volatile", doesn't it?
volatile is a keyword of the C language that prevents the compiler from optimizing accesses to a specific location in memory (in other words, force a load/store at each read/write of said memory location). It has no impact on the ownership of the underlying storage. Here, it is used to force the compiler to re-read source from memory at each loop iteration (otherwise the compiler would be allowed to move that load outside the loop, which breaks the algorithm).
do {
prevVal.floatVal = *source; // Force read, prevent hoisting outside loop.
newVal.floatVal = prevVal.floatVal + operand;
} while(atomic_cmpxchg((volatile local unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal)
After removing qualifiers (for simplicity) and renaming parameters, the signature of atomic_cmpxchg is the following:
int atomic_cmpxchg(int *ptr, int expected, int new)
What it does is:
atomically {
int old = *ptr;
if (old == expected) {
*ptr = new;
}
return old;
}
To summarize, each thread, individually, does:
Load current value of *source from memory into preVal.floatVal
Compute desired value of *source in newVal.floatVal
Execute the atomic compare-exchange described above (using the type-punned values)
If the result of atomic_cmpxchg == newVal.intVal, it means the compare-exchange was successful, break. Otherwise, the exchange didn't happen, go to 1 and try again.
The above loop eventually terminates, because eventually, each thread succeeds in doing their atomic_cmpxchg.
Can we say that each work-item has a mutex access (I mean a locked access) to value located at source address.
Mutexes are locks, while this is a lock-free algorithm. OpenCL can simulate mutexes with spinlocks (also implemented with atomics) but this is not one.

C++ pointer is initialized to null by the compiler

So I've been stuck on a memory problem for days now.
I have a multi-threaded program running with c++. I initialize a double* pointer.
From what I've read and previous programming experience, a pointer gets initialized to garbage. It will be Null if you initialize it to 0 or if you allocate memory that's too much for the program. For me, my pointer initialization, without allocation, gives me a null pointer.
A parser function I wrote is suppose to return a pointer to the array of parsed information. When I call the function,
double* data;
data = Parser.ReadCoordinates(&storageFilename[0]);
Now the returned pointer to the array should be set to data. Then I try to print something out from the array. I get memory corruption errors. I've ran gdb and it gives me a memory corruption error:
*** glibc detected *** /home/user/kinect/openni/Platform/Linux/Bin/x64-Debug/Sample-NiHandTracker: free(): corrupted unsorted chunks: 0x0000000001387f90 ***
*** glibc detected *** /home/user/kinect/openni/Platform/Linux/Bin/x64-Debug/Sample-NiHandTracker: malloc(): memory corruption: 0x0000000001392670 ***
Can someone explain to me what is going on? I've tried initializing the pointer as a global but that doesn't work either. I've tried to allocate memory but I still get a memory corruption error. The parser works. I've tested it out with a simple program. So I don't understand why it won't work in my other program. What am I doing wrong? I can also provide more info if needed.
Parser code
double* csvParser::ReadCoordinates(char* filename){
int x; //counter
int size=0; //
char* data;
int i = 0; //counter
FILE *fp=fopen(filename, "r");
if (fp == NULL){
perror ("Error opening file");
}
while (( x = fgetc(fp)) != EOF ) { //Returns the character currently pointed by the internal file position indicator
size++; //Number of characters in the csv file
}
rewind(fp); //Sets the position indicator to the beginning of the file
printf("size is %d.\n", size); //print
data = new char[23]; //Each line is 23 bytes (characters) long
size = (size/23) * 2; //number of x, y coordinates
coord = new double[size]; //allocate memory for an array of coordinates, need to be freed somewhere
num_coord = size; //num_coord is public
//fgets (data, size, fp);
//printf("data is %c.\n", *data);
for(x=0; x<size; x++){
fgets (data, size, fp);
coord[i] = atof(&data[0]); //convert string to double
coord[i+1] = atof(&data[11]); //convert string to double
i = i+2;
}
delete[] data;
fclose (fp);
return coord;
}
Corrupt memory occurs when you write outside the bound of an array or vector.
It's called heap underrun and overrun (depends on which side it's on).
The heap's allocation data gets corrupted, so the symptom you see is an exception in free() or new() calls.
You usually don't get an access violation because the memory is allocated and it belongs to you, but it's used by the heap's logic.
Find the place where you might be writing outside the bounds of an array.

Effect of using page-able memory for asynchronous memory copy?

In CUDA C Best Practices Guide Version 5.0, Section 6.1.2, it is written that:
In contrast with cudaMemcpy(), the asynchronous transfer version
requires pinned host memory (see Pinned Memory), and it contains an
additional argument, a stream ID.
It means the cudaMemcpyAsync function should fail if I use simple memory.
But this is not what happened.
Just for testing purpose, I tried the following program:
Kernel:
__global__ void kernel_increment(float* src, float* dst, int n)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid<n)
dst[tid] = src[tid] + 1.0f;
}
Main:
int main()
{
float *hPtr1, *hPtr2, *dPtr1, *dPtr2;
const int n = 1000;
size_t bytes = n * sizeof(float);
cudaStream_t str1, str2;
hPtr1 = new float[n];
hPtr2 = new float[n];
for(int i=0; i<n; i++)
hPtr1[i] = static_cast<float>(i);
cudaMalloc<float>(&dPtr1,bytes);
cudaMalloc<float>(&dPtr2,bytes);
dim3 block(16);
dim3 grid((n + block.x - 1)/block.x);
cudaStreamCreate(&str1);
cudaStreamCreate(&str2);
cudaMemcpyAsync(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice,str1);
kernel_increment<<<grid,block,0,str2>>>(dPtr1,dPtr2,n);
cudaMemcpyAsync(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost,str1);
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaStreamDestroy(str1);
cudaStreamDestroy(str2);
cudaFree(dPtr1);
cudaFree(dPtr2);
for(int i=0; i<n; i++)
std::cout<<hPtr2[i]<<std::endl;
delete[] hPtr1;
delete[] hPtr2;
return 0;
}
The program gave correct output. The array incremented successfully.
How did cudaMemcpyAsync execute without page locked memory?
Am I missing something here?
cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.
Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.
In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:
cudaMemcpy(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice);
kernel_increment<<<grid,block>>>(dPtr1,dPtr2,n);
cudaMemcpy(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost);
with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.
This is absolutely expected behaviour.

Memory object allocation in Opencl for dynamic array in structure

I have created following structure 'data' in C
typedef struct data
{
double *dattr;
int d_id;
int bestCent;
}Data;
The 'dattr' is an array in above structure which is kept dynamic.
Suppose I have to create 10 objects of above structure. i.e.
dataNode = (Data *)malloc (sizeof(Data) * 10);
and for every object of this structure I have to reallocate the memory in C for array 'dattr' using:
for(i=0; i<10; i++)
dataNode[i].dattr = (double *)malloc(sizeof(double) * 3);
What should do to implement the same in OpenCL? How to allocate the memory for array 'dattr' once I allocate the memory for structure objects?
Memory allocation in OpenCL devices (for example, a GPU) must be performed in the host thread using clCreateBuffer (or clCreateImage2D/3D if you wish to use texture memory). These functions allow you automatically copy host data (created with malloc for example) to the device, but I usually prefer to explicitly use clEnqueueWriteBuffer/clEnqueueMapBuffer (or clEnqueueWriteImage/clEnqueueMapImage if using texture memory), so that I can profile the data transfers. Here's an example:
#define DATA_SIZE 1000
typedef struct data {
cl_uint id;
cl_uint x;
cl_uint y;
} Data;
...
// Allocate data array in host
size_t dataSizeInBytes = DATA_SIZE * sizeof(Data);
DATA * dataArrayHost = (DATA *) malloc(dataSizeInBytes);
// Initialize data
...
// Create data array in device
cl_mem dataArrayDevice = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSizeInBytes, NULL, &status );
// Copy data array to device
status = clEnqueueWriteBuffer(queue, dataArrayDevice, CL_TRUE, 0, dataSizeInBytes, &dataArrayHost, 0, NULL, NULL );
// Make sure to pass dataArrayDevice as kernel parameter
// Run kernel
...
What you need to consider is that you need to know the memory requirements of an OpenCL kernel before you execute it. As such memory allocation can be dynamic if performed before kernel execution (i.e. in host). Nothing stops you from calling the kernel several times, and in each of those times adjusting (allocating) the kernel memory requirements.
Having this into account, I advise you to rethink the way your approaching the problem. To begin, it is simpler (but not necessarily more efficient) to work with arrays of structures, than with structures of arrays (in which case, the arrays would have to have a fixed size anyway).
This is just to give you an idea of how OpenCL works. Take a look at Khronos OpenCL resource page, it has plenty of OpenCL tutorials and examples, and Khronos OpenCL page, which has the official OpenCL references, man pages and quick references cards.
As suggested by Faken if you are concern with dynamic memory allocation and you are eager to change the algorithm a little bit, here is some hint:
The following code dynamically allocates local memory space and passes it as the 8th argument to the OpenCL kernel:
int N; //Number_of_data_points, which will keep on changing as per your requirement
size_t localMemSize = ( N* sizeof(int));
...
// Dynamically allocate local memory (allocated per workgroup)
clSetKernelArg(kernel, 8, localMemSize, NULL);

Where's the memory leak?

So I'm learning pointers and having a difficult time identifying the memory leak here. I confess I have never used malloc() before and am new to pointer arithmetic. Thanks in advance.
/*filename: p3.c */
#include <stdio.h>
#include <stdlib.h>
int main()
{
char *buffer;
char *p;
int n;
/* allocate 10 bytes */
buffer = (char *) malloc(10);
p = buffer;
for (n=0; n<=10; n++)
*p++ = '*';
p = buffer;
for (n=0; n <=10; n++)
printf("%c ", *p++);
return 0;
}
The rule is rather simple, for every malloc there must be a free. If you have more mallocs than frees you forgot to de-allocate memory and so you have a memory leak. If you have more frees than mallocs you're trying to de-allocate memory that has already been de-allocated and that's not something you want.
You simply need to free your buffer by using the free() function, when you don't need the buffer anymore:
/* ... */
free( buffer );
return 0;
}
Simply remember to balance each call to malloc with a call to free, when the memory is not used anymore.
The operations on your p variable won't affect buffer. They are two pointers pointing to the same area (at start), but they're still two distinct variables. So incrementing p won't increment buffer.
So nothing wrong with the pointer operations on p, except the fact you are writing out of bounds, as stated by Daniel Fisher in the comments of your question.
Also note that you should also always check for NULL, after the malloc call, as malloc may fail. It's pretty rare nowadays, but if it fails, your program will probably crash, as you will then dereference a NULL pointer:
buffer = malloc( 10 );
if( buffer == NULL )
{
/* Error management - Do not use buffer */
}
The cast to char * is not needed on malloc, unless you are dealing with C++. In C, it's valid to assign a void pointer to another pointer type.
It is not n<=10 you want, but n<10.
You call malloc and never call free. Of course it leaks.
In principle, every single allocation you request from the alloc family of function should be freeed as soon as you are done with them.
Buffers that you continue to use to up to the termination of the program are formally leaks, but not a problem as long as you are allocating a well defined number of them. That includes what you are doing here.

Resources