Memory object allocation in Opencl for dynamic array in structure - opencl

I have created following structure 'data' in C
typedef struct data
{
double *dattr;
int d_id;
int bestCent;
}Data;
The 'dattr' is an array in above structure which is kept dynamic.
Suppose I have to create 10 objects of above structure. i.e.
dataNode = (Data *)malloc (sizeof(Data) * 10);
and for every object of this structure I have to reallocate the memory in C for array 'dattr' using:
for(i=0; i<10; i++)
dataNode[i].dattr = (double *)malloc(sizeof(double) * 3);
What should do to implement the same in OpenCL? How to allocate the memory for array 'dattr' once I allocate the memory for structure objects?

Memory allocation in OpenCL devices (for example, a GPU) must be performed in the host thread using clCreateBuffer (or clCreateImage2D/3D if you wish to use texture memory). These functions allow you automatically copy host data (created with malloc for example) to the device, but I usually prefer to explicitly use clEnqueueWriteBuffer/clEnqueueMapBuffer (or clEnqueueWriteImage/clEnqueueMapImage if using texture memory), so that I can profile the data transfers. Here's an example:
#define DATA_SIZE 1000
typedef struct data {
cl_uint id;
cl_uint x;
cl_uint y;
} Data;
...
// Allocate data array in host
size_t dataSizeInBytes = DATA_SIZE * sizeof(Data);
DATA * dataArrayHost = (DATA *) malloc(dataSizeInBytes);
// Initialize data
...
// Create data array in device
cl_mem dataArrayDevice = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSizeInBytes, NULL, &status );
// Copy data array to device
status = clEnqueueWriteBuffer(queue, dataArrayDevice, CL_TRUE, 0, dataSizeInBytes, &dataArrayHost, 0, NULL, NULL );
// Make sure to pass dataArrayDevice as kernel parameter
// Run kernel
...
What you need to consider is that you need to know the memory requirements of an OpenCL kernel before you execute it. As such memory allocation can be dynamic if performed before kernel execution (i.e. in host). Nothing stops you from calling the kernel several times, and in each of those times adjusting (allocating) the kernel memory requirements.
Having this into account, I advise you to rethink the way your approaching the problem. To begin, it is simpler (but not necessarily more efficient) to work with arrays of structures, than with structures of arrays (in which case, the arrays would have to have a fixed size anyway).
This is just to give you an idea of how OpenCL works. Take a look at Khronos OpenCL resource page, it has plenty of OpenCL tutorials and examples, and Khronos OpenCL page, which has the official OpenCL references, man pages and quick references cards.

As suggested by Faken if you are concern with dynamic memory allocation and you are eager to change the algorithm a little bit, here is some hint:
The following code dynamically allocates local memory space and passes it as the 8th argument to the OpenCL kernel:
int N; //Number_of_data_points, which will keep on changing as per your requirement
size_t localMemSize = ( N* sizeof(int));
...
// Dynamically allocate local memory (allocated per workgroup)
clSetKernelArg(kernel, 8, localMemSize, NULL);

Related

Can I have boolean buffer in OpenCL and change its value during kernel execution, example to break while loop

I want to do some experiments in OpenCL and I want to know possibility to change states during kernel execution from host code using buffer.
I attempted to alter the state of a while loop in the kernel code by modifying the buffer value from within the host code, however the execution is hung.
void my_kernel(
__global bool *in,
__global int *out)
{
int i = get_global_id(0);
while(1) {
if(1 == *in) {
printf("while loop is finished");
break;
}
}
printf("out[0] = %d\n", out[0]);
}
I call second time the function clEnqueueWriteBuffer() to change state of input value.
input[0] = 1;
err = clEnqueueWriteBuffer(commands, input_buffer,
CL_TRUE, 0, sizeof(int), (void*)input,
0, NULL,NULL);
At least for OpenCL 1.x, this is not permitted, and any behaviour you may observe in one implementation cannot be relied upon.
See the NOTE in the OpenCL 1.2 specification, section 5.2.2, Reading, Writing and Copying Buffer Objects:
Calling clEnqueueWriteBuffer to update the latest bits in a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being written is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
The host memory region given by (host_ptr + offset, cb) contains the latest bits when the enqueued write command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the write command has finished execution.
The final condition is not met by your code, therefore its behaviour is undefined.
I am not certain if the situation is different with OpenCL 2.x's Shared Virtual Memory (SVM) feature, as I have no practical experience using it, perhaps someone else can contribute an answer for that.

Is it defined to write to the same buffer from different kernels?

I have OpenCL 1.1, one device, out of order execution command queue,
and want that multiple kernels output their results into one buffer to different, not overlapped, arbitrary, regions.
Is it possible?
cl::CommandQueue commandQueue(context, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);
cl::Buffer buf_as(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &as[0]);
cl::Buffer buf_bs(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &bs[0]);
cl::Buffer buf_rs(context, CL_MEM_WRITE_ONLY, data_size, NULL);
cl::Kernel kernel(program, "dist");
kernel.setArg(0, buf_as);
kernel.setArg(1, buf_bs);
int const N = 4;
int const d = data_size / N;
std::vector<cl::Event> events(N);
for(int i = 0; i != N; ++i) {
int const beg = d * i;
int const len = d;
kernel_leaf.setArg(2, beg);
kernel_leaf.setArg(3, len);
commandQueue.enqueueNDRangeKernel(kernel, NULL, cl::NDRange(block_size_x), cl::NDRange(block_size_x), NULL, &events[i]);
}
commandQueue.enqueueReadBuffer(buf_rs, CL_FALSE, 0, data_size, &rs[0], &events, NULL);
commandQueue.finish();
I wanted to give an official committee response to this. We realise the specification is ambiguous and have made modifications to rectify this.
This is not guaranteed under OpenCL 1.x or indeed 2.0 rules. cl_mem objects are only guaranteed to be consistent at synchronization points, even when processed only on a single device and even when used by OpenCL 2.0 kernels using memory_scope_device.
Multiple child kernels of an OpenCL 2.0 parent kernel can share the parent's cl_mem objects at device scope.
Coarse-grained SVM objects can be shared at device scope between multiple kernels, as long as the memory locations written to are not overlapping.
The writes should work fine if the global memory addresses are non-overlapping as you have described. Just make sure both kernels are finished before reading the results back to the host.
I don't think it is defined. Although you say you are writing to non-overlapping regions at the software level, it is not guaranteed that at the hardware level the accesses won't map onto same cache lines - in which case you'll have multiple modified versions flying around.

In CUDA, how to copy an array of device pointers to device memory?

For example, I allocate these following pointers:
float *data_1, *data_2, *data_3, *data_4;
//Use malloc to allocate memory and fill out some data to these pointers
......
//Filling complete
float *data_d1,*data_d2,*data_d3,*data_d4;
cudaMalloc((void **)&data_d1,size1);
cudaMalloc((void **)&data_d2,size2);
cudaMalloc((void **)&data_d3,size3);
cudaMalloc((void **)&data_d4,size4);
cudaMemcpy(data_d1,data_1,size1,cudaMemcpyHostToDevice);
cudaMemcpy(data_d2,data_2,size2,cudaMemcpyHostToDevice);
cudaMemcpy(data_d3,data_3,size3,cudaMemcpyHostToDevice);
cudaMemcpy(data_d4,data_4,size4,cudaMemcpyHostToDevice);
After this, I should already get 4 device pointers containing the exact data as host pointers do. Now I'd like to store these pointers into one array of pointers as following,
float *ptrs[4];
ptrs[0] = data_d1;
ptrs[1] = data_d2;
ptrs[2] = data_d3;
ptrs[3] = data_d4;
Now I'd like to transfer this array of pointers to CUDA kernel. However, I know that since ptrs[4] is actually on host memory, I need to allocate a new pointer on device. So I did this,
float **ptrs_d;
size_t size = 4 * sizeof(float*);
cudaMalloc((void ***)&ptrs_d,size);
cudaMemcpy(ptrs_d,ptrs,size,cudaMemcpyHostToDevice);
And then invoke the kernel:
kernel_test<<<dimGrid,dimBlock>>>(ptrs_d, ...);
//Declaration should be
//__global__ void kernel_test(float **ptrs_d, ...);
In the kernel_test, load data in the following syntax:
if (threadIdx.x < length_of_data_1d)
{
float element0 = (ptrs[0])[threadIdx.x];
}
Compiling is OKay, but when debugging, it gives an error of access violation.
Perhaps there're a lot of errors in my code. But I just want to figure out why I can't pass device pointers in this way and what is the proper way to access it if it is allowed in CUDA to pass array of device pointers to kernel function.
So how should I fix this issue? Any suggestions are appreciated. Thanks in advance.
One possibility is to allocate a void pointer, like CUDA expects as as standart, too. When passing it into your kernel, you can cast it to float**.
I did it in that way:
void* ptrs_d = 0;
cudaMalloc(&ptrs_d, 4*sizeof(float*));
cudaMemcpy(ptrs_d, ptrs, 4*sizeof(float*), cudaMemcpyHostToDevice);
kernel_test<<<dimGrid, dimBlock>>>((float**)ptrs_d);

Base Address of Memory Object OpenCL

I want to traverse a tree at GPU with OpenCL, so i assemble the tree in a contiguous block at host and i change the addresses of all pointers so as to be consistent at device as follows:
TreeAddressDevice = (size_t)BaseAddressDevice + ((size_t)TreeAddressHost - (size_t)BaseAddressHost);
I want the base address of the memory buffer:
At host i allocate memory for the buffer, as follows:
cl_mem tree_d = clCreateBuffer(...);
The problem is that cl_mems are objects that track an internal representation of the data. Technically they're pointers to an object, but they are not pointers to the data. The only way to access a cl_mem from within a kernel is to pass it in as an argument via setKernelArgs.
Here http://www.proxya.net/browse.php?u=%3A%2F%2Fwww.khronos.org%2Fmessage_boards%2Fviewtopic.php%3Ff%3D37%26amp%3Bt%3D2900&b=28 i found the following solution, but it doesnot work:
__kernel void getPtr( __global void *ptr, __global void *out )
{
*out = ptr;
}
that can be invoked as follows
Code:
...
cl_mem auxBuf = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(void*), NULL, NULL );
void *gpuPtr;
clSetKernelArg( getterKernel, 0, sizeof(cl_mem), &myBuf );
clSetKernelArg( getterKernel, 1, sizeof(cl_mem), &auxBuf );
clEnqueueTask( commandQueue, getterKernel, 0, NULL, NULL );
clEnqueueReadBuffer( commandQueue, auxBuf, CL_TRUE, 0, sizeof(void*), &gpuPtr, 0, NULL, NULL );
clReleaseMemObject(auxBuf);
...
Now "gpuPtr" should contain the address of the beginning of "myBuf" in GPU memory space.
The solution is obvious and i can't find it? How can I get back a pointer to device memory when creating buffers?
It's because in the OpenCL model, host memory and device memory are disjoint. A pointer in device memory will have no meaning on the host.
You can map a device buffer to host memory using clEnqueueMapBuffer. The mapping will synchronize device to host, and unmapping will synchronize back host to device.
Update. As you explain in the comments, you want to send a tree structure to the GPU. One solution would be to store all tree nodes inside an array, replacing pointers to nodes with indices in the array.
As Eric pointed out, there are two sets of memory to consider: host memory and device memory. Basically, OpenCL tries to hide the gritty details of this interaction by introducing the buffer object for us to interact with in our program on the host side. Now, as you noted, the problem with this methodology is that it hides away the details of our device when we want to do something trickier than the OpenCL developers intended or allowed in their scope. The solution here is to remember that OpenCL kernels use C99 and that the language allows us to access pointers without any issue. With this in mind, we can just demand the pointer be stored in an unsigned integer variable to be referenced later.
Your implementation was on the right track, but it needed a little bit more C syntax to finish up the transfer.
OpenCL Kernel:
// Kernel used to obtain pointer from target buffer
__kernel void mem_ptr(__global char * buffer, __global ulong * ptr)
{
ptr[0] = &buffer[0];
}
// Kernel to demonstrate how to use that pointer again after we extract it.
__kernel void use_ptr(__global ulong * ptr)
{
char * print_me = (char *)ptr[0];
/* Code that uses all of our hard work */
/* ... */
}
Host Program:
// Create the buffer that we want the device pointer from (target_buffer)
// and a place to store it (ptr_buffer).
cl_mem target_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
MEM_SIZE * sizeof(char), NULL, &ret);
cl_mem ptr_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
1 * sizeof(cl_ulong), NULL, &ret);
/* Setup the rest of our OpenCL program */
/* .... */
// Setup our kernel arguments from the host...
ret = clSetKernelArg(kernel_mem_ptr, 0, sizeof(cl_mem), (void *)&target_buffer);
ret = clSetKernelArg(kernel_mem_ptr, 1, sizeof(cl_mem), (void *)&ptr_buffer);
ret = clEnqueueTask(command_queue, kernel_mem_ptr, 0, NULL, NULL);
// Now it's just a matter of storing the pointer where we want to use it for later.
ret = clEnqueueCopyBuffer(command_queue, ptr_buffer, dst_buffer, 0, 1 * sizeof(cl_ulong),
sizeof(cl_ulong), 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, ptr_buffer, CL_TRUE, 0,
1 * sizeof(cl_ulong), buffer_ptrs, 0, NULL, NULL);
There you have it. Now, keep in mind that you don't have to use the char variables I used; it works for any type. However, I'd recommend using cl_ulong for the storing of pointers. This shouldn't matter for devices with less than 4GB of accessible memory. But for devices with a larger address space, you have to use cl_ulong. If you absolutely NEED to save space on your device but have a device whose memory > 4GB, then you might be able to create a struct that can store the lower 32 LSB of the address into a uint type, with the MSB's being stored in a small type.

Effect of using page-able memory for asynchronous memory copy?

In CUDA C Best Practices Guide Version 5.0, Section 6.1.2, it is written that:
In contrast with cudaMemcpy(), the asynchronous transfer version
requires pinned host memory (see Pinned Memory), and it contains an
additional argument, a stream ID.
It means the cudaMemcpyAsync function should fail if I use simple memory.
But this is not what happened.
Just for testing purpose, I tried the following program:
Kernel:
__global__ void kernel_increment(float* src, float* dst, int n)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid<n)
dst[tid] = src[tid] + 1.0f;
}
Main:
int main()
{
float *hPtr1, *hPtr2, *dPtr1, *dPtr2;
const int n = 1000;
size_t bytes = n * sizeof(float);
cudaStream_t str1, str2;
hPtr1 = new float[n];
hPtr2 = new float[n];
for(int i=0; i<n; i++)
hPtr1[i] = static_cast<float>(i);
cudaMalloc<float>(&dPtr1,bytes);
cudaMalloc<float>(&dPtr2,bytes);
dim3 block(16);
dim3 grid((n + block.x - 1)/block.x);
cudaStreamCreate(&str1);
cudaStreamCreate(&str2);
cudaMemcpyAsync(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice,str1);
kernel_increment<<<grid,block,0,str2>>>(dPtr1,dPtr2,n);
cudaMemcpyAsync(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost,str1);
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaStreamDestroy(str1);
cudaStreamDestroy(str2);
cudaFree(dPtr1);
cudaFree(dPtr2);
for(int i=0; i<n; i++)
std::cout<<hPtr2[i]<<std::endl;
delete[] hPtr1;
delete[] hPtr2;
return 0;
}
The program gave correct output. The array incremented successfully.
How did cudaMemcpyAsync execute without page locked memory?
Am I missing something here?
cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.
Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.
In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:
cudaMemcpy(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice);
kernel_increment<<<grid,block>>>(dPtr1,dPtr2,n);
cudaMemcpy(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost);
with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.
This is absolutely expected behaviour.

Resources