Is it defined to write to the same buffer from different kernels? - opencl

I have OpenCL 1.1, one device, out of order execution command queue,
and want that multiple kernels output their results into one buffer to different, not overlapped, arbitrary, regions.
Is it possible?
cl::CommandQueue commandQueue(context, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);
cl::Buffer buf_as(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &as[0]);
cl::Buffer buf_bs(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &bs[0]);
cl::Buffer buf_rs(context, CL_MEM_WRITE_ONLY, data_size, NULL);
cl::Kernel kernel(program, "dist");
kernel.setArg(0, buf_as);
kernel.setArg(1, buf_bs);
int const N = 4;
int const d = data_size / N;
std::vector<cl::Event> events(N);
for(int i = 0; i != N; ++i) {
int const beg = d * i;
int const len = d;
kernel_leaf.setArg(2, beg);
kernel_leaf.setArg(3, len);
commandQueue.enqueueNDRangeKernel(kernel, NULL, cl::NDRange(block_size_x), cl::NDRange(block_size_x), NULL, &events[i]);
}
commandQueue.enqueueReadBuffer(buf_rs, CL_FALSE, 0, data_size, &rs[0], &events, NULL);
commandQueue.finish();

I wanted to give an official committee response to this. We realise the specification is ambiguous and have made modifications to rectify this.
This is not guaranteed under OpenCL 1.x or indeed 2.0 rules. cl_mem objects are only guaranteed to be consistent at synchronization points, even when processed only on a single device and even when used by OpenCL 2.0 kernels using memory_scope_device.
Multiple child kernels of an OpenCL 2.0 parent kernel can share the parent's cl_mem objects at device scope.
Coarse-grained SVM objects can be shared at device scope between multiple kernels, as long as the memory locations written to are not overlapping.

The writes should work fine if the global memory addresses are non-overlapping as you have described. Just make sure both kernels are finished before reading the results back to the host.

I don't think it is defined. Although you say you are writing to non-overlapping regions at the software level, it is not guaranteed that at the hardware level the accesses won't map onto same cache lines - in which case you'll have multiple modified versions flying around.

Related

Send discontiguous memory to a buffer in OpenCL

Suppose I have an array A[200][200].
If I want to send A[0:100][0:200] to GPU buffer,
I just call
clEnqueueWriteBuffer(queue, buffer, CL_TRUE, 0, 100 * 200 * sizeof(float), A, 0, NULL, NULL);
But if I want to send A[0:200][0:100] to GPU buffer, I cannot call the above function because A[0:200][0:100] is discontiguous.
Is there any wise way to send the above data?
You could use clEnqueueWriteBufferRect.
cl_int clEnqueueWriteBufferRect(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
const size_t buffer_origin[3],
const size_t host_origin[3],
const size_t region[3],
size_t buffer_row_pitch,
size_t buffer_slice_pitch,
size_t host_row_pitch,
size_t host_slice_pitch,
void *ptr,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event
)
In your case, the most relevant parameters are the host_origin, region and host_row_pitch.
host_row_pitch: the byte size of each row in memory.
The array float A[200][200] describes a row major 2D array with a row pitch of 200 * sizeof(float).
host_origin: where the host data you wish to send to the device starts. assuming ND array. your origin is simply size_t[3] {0,0,0}
host_region: the ND region you wish to copy from the array.
your region is size_t[3] {100,200,1}.
I would advise you to read the documentation very carefully. It is easy to make small mistakes.
Also note, that it might be more efficient to first arrange the host data to a continuous array prior to sending it. clEnqueueWriteBuffer will probably initiate a DMA transfer which will be more efficient for large continuous blocks of memory.

OpenCL 'non-blocking' reads have higher cost than expected

Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include <chrono>
#include <stdio.h>
static const int size = 100000;
int host_buf[size];
int main() {
cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
std::vector<cl::Device> devices;
ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
printf("Using OpenCL devices: \n");
for (auto &dev : devices) {
std::string dev_name = dev.getInfo<CL_DEVICE_NAME>();
printf(" %s\n", dev_name.c_str());
}
cl::CommandQueue queue(ctx);
cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);
std::vector<int> values(size);
// Warmup
queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
queue.finish();
// Run from 1 to 100000 sized chunks
for (int k = 1; k <= size; k *= 10) {
auto cstart = std::chrono::high_resolution_clock::now();
for (int j = 0; j < k; j++)
queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
queue.finish();
auto cend = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration<double>(cend - cstart).count() * 1000000.0;
printf("%8d: %8.02f us\n", k, time);
}
return 0;
}
As always, there is some random variation but the typical output for me is like this:
1: 10.03 us
10: 107.93 us
100: 794.54 us
1000: 8301.35 us
10000: 83741.06 us
100000: 981607.26 us
Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.
Is this expected behavior?
Do other GPUs behave the same way?
Is there any workaround other than always doing random-access reads from inside a kernel?
You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.
The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working.
In your code, you use clFinish which blocks until all reads are done.
So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.
As long as you create an in-order command queue (the default), other GPUs will behave the same.
If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.
Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.
To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector. Pinned host buffers are usually created via clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag.

Simple Vector Geometric Progression Design in OpenCL

I'm new to OpenCL and in order to get a better grasp of a few concepts I contrived a simple example of a geometric progression as follows (emphasis on contrived):
An array of N values and N coefficients (whose values could be
anything, but in the example they all are the same) are allocated.
M steps are performed in sequence where each value in the values array
is multiplied by its corresponding coefficient in the coefficients
array and assigned as the new value in the values array. Each step needs to fully complete before the next step can complete. I know this part is a bit contrived, but this is a requirement I want to enforce to help my understanding of OpenCL.
I'm only interested in the values in the values array after the final step has completed.
Here is the very simple OpenCL kernel (MultiplyVectors.cl):
__kernel void MultiplyVectors (__global float4* x, __global float4* y, __global float4* result)
{
int i = get_global_id(0);
result[i] = x[i] * y[i];
}
And here is the host program (main.cpp):
#include <CL/cl.hpp>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
int main ()
{
auto context = cl::Context (CL_DEVICE_TYPE_GPU);
auto *sourceFile = fopen("MultiplyVectors.cl", "r");
if (sourceFile == nullptr)
{
perror("Couldn't open the source file");
return 1;
}
fseek(sourceFile, 0, SEEK_END);
const auto sourceSize = ftell(sourceFile);
auto *sourceBuffer = new char [sourceSize + 1];
sourceBuffer[sourceSize] = '\0';
rewind(sourceFile);
fread(sourceBuffer, sizeof(char), sourceSize, sourceFile);
fclose(sourceFile);
auto program = cl::Program (context, cl::Program::Sources {std::make_pair (sourceBuffer, sourceSize + 1)});
delete[] sourceBuffer;
const auto devices = context.getInfo<CL_CONTEXT_DEVICES> ();
program.build (devices);
auto kernel = cl::Kernel (program, "MultiplyVectors");
const size_t vectorSize = 1024;
float coeffs[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
coeffs[i] = 1.000001;
}
auto coeffsBuffer = cl::Buffer (context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof (coeffs), coeffs);
float values[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
values[i] = static_cast<float> (i);
}
auto valuesBuffer = cl::Buffer (context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof (values), values);
kernel.setArg (0, coeffsBuffer);
kernel.setArg (1, valuesBuffer);
kernel.setArg (2, valuesBuffer);
auto commandQueue = cl::CommandQueue (context, devices[0]);
for (size_t i = 0; i < 1000000; ++i)
{
commandQueue.enqueueNDRangeKernel (kernel, cl::NDRange (0), cl::NDRange (vectorSize / 4), cl::NullRange);
}
printf ("All kernels enqueued. Waiting to read buffer after last kernel...");
commandQueue.enqueueReadBuffer (valuesBuffer, CL_TRUE, 0, sizeof (values), values);
return 0;
}
What I'm basically asking is for advice on how to best optimize this OpenCL program to run on a GPU. I have the following questions based on my limited OpenCL experience to get the conversation going:
Could I be handling the buffers better? I'd like to minimize any
unnecessary ferrying of data between the host and the GPU.
What's the optimal work group configuration (in general at least, I
know this can very by GPU)? I'm not actually sharing any data
between work items and it doesn't seem like I'd benefit from work
groups much here, but just in case.
Should I be allocating and loading anything into local memory for a
work group (if that would at all makes sense)?
I'm currently enqueing one kernel for each step, which will create a
work item for each 4 floats to take advantage of a hypothetical GPU with a SIMD
width of 128 bits. I'm attempting to enqueue all of this
asynchronously (although I'm noticing the Nvidia implementation I have
seems to block each enqueue until the kernel is complete) at once
and then wait on the final one to complete. Is there a whole better
approach to this that I'm missing?
Is there a design that would allow for only one call to
enqueueNDRangeKernel (instead of one call per step) while
maintaining the ability for each step to be efficiently processed in
parallel?
Obviously I know that the example problem I'm solving can be done in much better ways, but I wanted to have as simple of an example as possible that illustrated a vector of values being operated on in a series of steps where each step has to be completed fully before the next. Any help and pointers on how to best go about this would be greatly appreciated.
Thanks!

Base Address of Memory Object OpenCL

I want to traverse a tree at GPU with OpenCL, so i assemble the tree in a contiguous block at host and i change the addresses of all pointers so as to be consistent at device as follows:
TreeAddressDevice = (size_t)BaseAddressDevice + ((size_t)TreeAddressHost - (size_t)BaseAddressHost);
I want the base address of the memory buffer:
At host i allocate memory for the buffer, as follows:
cl_mem tree_d = clCreateBuffer(...);
The problem is that cl_mems are objects that track an internal representation of the data. Technically they're pointers to an object, but they are not pointers to the data. The only way to access a cl_mem from within a kernel is to pass it in as an argument via setKernelArgs.
Here http://www.proxya.net/browse.php?u=%3A%2F%2Fwww.khronos.org%2Fmessage_boards%2Fviewtopic.php%3Ff%3D37%26amp%3Bt%3D2900&b=28 i found the following solution, but it doesnot work:
__kernel void getPtr( __global void *ptr, __global void *out )
{
*out = ptr;
}
that can be invoked as follows
Code:
...
cl_mem auxBuf = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(void*), NULL, NULL );
void *gpuPtr;
clSetKernelArg( getterKernel, 0, sizeof(cl_mem), &myBuf );
clSetKernelArg( getterKernel, 1, sizeof(cl_mem), &auxBuf );
clEnqueueTask( commandQueue, getterKernel, 0, NULL, NULL );
clEnqueueReadBuffer( commandQueue, auxBuf, CL_TRUE, 0, sizeof(void*), &gpuPtr, 0, NULL, NULL );
clReleaseMemObject(auxBuf);
...
Now "gpuPtr" should contain the address of the beginning of "myBuf" in GPU memory space.
The solution is obvious and i can't find it? How can I get back a pointer to device memory when creating buffers?
It's because in the OpenCL model, host memory and device memory are disjoint. A pointer in device memory will have no meaning on the host.
You can map a device buffer to host memory using clEnqueueMapBuffer. The mapping will synchronize device to host, and unmapping will synchronize back host to device.
Update. As you explain in the comments, you want to send a tree structure to the GPU. One solution would be to store all tree nodes inside an array, replacing pointers to nodes with indices in the array.
As Eric pointed out, there are two sets of memory to consider: host memory and device memory. Basically, OpenCL tries to hide the gritty details of this interaction by introducing the buffer object for us to interact with in our program on the host side. Now, as you noted, the problem with this methodology is that it hides away the details of our device when we want to do something trickier than the OpenCL developers intended or allowed in their scope. The solution here is to remember that OpenCL kernels use C99 and that the language allows us to access pointers without any issue. With this in mind, we can just demand the pointer be stored in an unsigned integer variable to be referenced later.
Your implementation was on the right track, but it needed a little bit more C syntax to finish up the transfer.
OpenCL Kernel:
// Kernel used to obtain pointer from target buffer
__kernel void mem_ptr(__global char * buffer, __global ulong * ptr)
{
ptr[0] = &buffer[0];
}
// Kernel to demonstrate how to use that pointer again after we extract it.
__kernel void use_ptr(__global ulong * ptr)
{
char * print_me = (char *)ptr[0];
/* Code that uses all of our hard work */
/* ... */
}
Host Program:
// Create the buffer that we want the device pointer from (target_buffer)
// and a place to store it (ptr_buffer).
cl_mem target_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
MEM_SIZE * sizeof(char), NULL, &ret);
cl_mem ptr_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
1 * sizeof(cl_ulong), NULL, &ret);
/* Setup the rest of our OpenCL program */
/* .... */
// Setup our kernel arguments from the host...
ret = clSetKernelArg(kernel_mem_ptr, 0, sizeof(cl_mem), (void *)&target_buffer);
ret = clSetKernelArg(kernel_mem_ptr, 1, sizeof(cl_mem), (void *)&ptr_buffer);
ret = clEnqueueTask(command_queue, kernel_mem_ptr, 0, NULL, NULL);
// Now it's just a matter of storing the pointer where we want to use it for later.
ret = clEnqueueCopyBuffer(command_queue, ptr_buffer, dst_buffer, 0, 1 * sizeof(cl_ulong),
sizeof(cl_ulong), 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, ptr_buffer, CL_TRUE, 0,
1 * sizeof(cl_ulong), buffer_ptrs, 0, NULL, NULL);
There you have it. Now, keep in mind that you don't have to use the char variables I used; it works for any type. However, I'd recommend using cl_ulong for the storing of pointers. This shouldn't matter for devices with less than 4GB of accessible memory. But for devices with a larger address space, you have to use cl_ulong. If you absolutely NEED to save space on your device but have a device whose memory > 4GB, then you might be able to create a struct that can store the lower 32 LSB of the address into a uint type, with the MSB's being stored in a small type.

Memory object allocation in Opencl for dynamic array in structure

I have created following structure 'data' in C
typedef struct data
{
double *dattr;
int d_id;
int bestCent;
}Data;
The 'dattr' is an array in above structure which is kept dynamic.
Suppose I have to create 10 objects of above structure. i.e.
dataNode = (Data *)malloc (sizeof(Data) * 10);
and for every object of this structure I have to reallocate the memory in C for array 'dattr' using:
for(i=0; i<10; i++)
dataNode[i].dattr = (double *)malloc(sizeof(double) * 3);
What should do to implement the same in OpenCL? How to allocate the memory for array 'dattr' once I allocate the memory for structure objects?
Memory allocation in OpenCL devices (for example, a GPU) must be performed in the host thread using clCreateBuffer (or clCreateImage2D/3D if you wish to use texture memory). These functions allow you automatically copy host data (created with malloc for example) to the device, but I usually prefer to explicitly use clEnqueueWriteBuffer/clEnqueueMapBuffer (or clEnqueueWriteImage/clEnqueueMapImage if using texture memory), so that I can profile the data transfers. Here's an example:
#define DATA_SIZE 1000
typedef struct data {
cl_uint id;
cl_uint x;
cl_uint y;
} Data;
...
// Allocate data array in host
size_t dataSizeInBytes = DATA_SIZE * sizeof(Data);
DATA * dataArrayHost = (DATA *) malloc(dataSizeInBytes);
// Initialize data
...
// Create data array in device
cl_mem dataArrayDevice = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSizeInBytes, NULL, &status );
// Copy data array to device
status = clEnqueueWriteBuffer(queue, dataArrayDevice, CL_TRUE, 0, dataSizeInBytes, &dataArrayHost, 0, NULL, NULL );
// Make sure to pass dataArrayDevice as kernel parameter
// Run kernel
...
What you need to consider is that you need to know the memory requirements of an OpenCL kernel before you execute it. As such memory allocation can be dynamic if performed before kernel execution (i.e. in host). Nothing stops you from calling the kernel several times, and in each of those times adjusting (allocating) the kernel memory requirements.
Having this into account, I advise you to rethink the way your approaching the problem. To begin, it is simpler (but not necessarily more efficient) to work with arrays of structures, than with structures of arrays (in which case, the arrays would have to have a fixed size anyway).
This is just to give you an idea of how OpenCL works. Take a look at Khronos OpenCL resource page, it has plenty of OpenCL tutorials and examples, and Khronos OpenCL page, which has the official OpenCL references, man pages and quick references cards.
As suggested by Faken if you are concern with dynamic memory allocation and you are eager to change the algorithm a little bit, here is some hint:
The following code dynamically allocates local memory space and passes it as the 8th argument to the OpenCL kernel:
int N; //Number_of_data_points, which will keep on changing as per your requirement
size_t localMemSize = ( N* sizeof(int));
...
// Dynamically allocate local memory (allocated per workgroup)
clSetKernelArg(kernel, 8, localMemSize, NULL);

Resources