Using async_work_group_copy() with pointer? - opencl

__kernel void kmp(__global char pattern[1*4], __global char* string, __global int failure[1*4], __global int ret[1], int g_length, int l_length, int thread_num){
int pattern_num = 1;
int pattern_size = 4;
int gid = get_group_id(0);
int glid = get_global_id(0);
int lid = get_local_id(0);
int i, j, x = 0;
__local char *tmp_string;
event_t event;
if(l_length < pattern_size){
return;
}
event = async_work_group_copy(tmp_string, string+gid*g_length, g_length, 0);
wait_group_events(1, &event);
Those are some part of my code.
I want to find the matched pattern in the text.
First, initialize all my patterns and string(I read string from text and experimentally use one pattern only) on CPU side.
Second, transfer them to kernel named kmp.
(parameters l_length and g_length are the size of string which will be copied to lid and glid each. In other words, the pieces of string)
And lastly, I want to copy the divided string to local memory.
But there is a problem. I cannot get any valid result when I copy them using async_work_group_copy().
When I change __local char*tmp_string to array, the problem still remains.
What I want to do is 1)divide the string 2)copy them to each thread 3)and compute the matching number.
I wonder what's wrong in this code. Thanks!

OpenCL spec has this:
The async copy is performed by all work-items in a work-group and this
built-in function must therefore be encountered by all work-items in a
work-group executing the kernel with the same argument values;
otherwise the results are undefined.
so you shouldn't return early for any workitems in a group. Early return is better suited to CPU anyway. If this is GPU, just compute the last overflowing part using augmented/padded input-output buffers.
Otherwise, you can early return whole group(this should work since no workitem hitting any async copy instruction) and do the remaining work on the cpu, unless the device doesn't use any workitems(but a dedicated secret pipeline) for the async copy operation.
Maybe you can enqueue a second kernel(in another queue concurrently) to compute remaining latest items with workgroupsize=remaining_size instead of having extra buffer size or control logic.
tmp_string needs to be initialized/allocated if you are going to copy something to/from it. So you probably will need the array version of it.
async_work_group_copy is not a synchronization point so needs a barrier before it to get latest bits of local memory to use for async copy to global.
__kernel void foo(__global int *a, __global int *b)
{
int i=get_global_id(0);
int g=get_group_id(0);
int l=get_local_id(0);
int gs=get_local_size(0);
__local int tmp[256];
event_t evt=async_work_group_copy(tmp,&a[g*gs],gs,0);
// compute foobar here in async to copies
wait_group_events(1,&evt);
tmp[l]=tmp[l]+3; // compute foobar2 using local memory
barrier(CLK_LOCAL_MEM_FENCE);
event_t evt2=async_work_group_copy(&b[g*gs],tmp,gs,0);
// compute foobar3 here in async to copies
wait_group_events(1,&evt2);
}

Related

Static variable in OpenCL C

I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...
There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};

Understanding Performance Behavior of Random Writes to Global Memory

I'm running experiments aiming to understand the behavior of random read and write access to global memory.
The following kernel reads from an input vector (groupColumn) with a coalesced access pattern and reads random entries from a hash table in global memory.
struct Entry {
uint group;
uint payload;
};
typedef struct Entry Entry;
__kernel void global_random_write_access(__global const uint* restrict groupColumn,
__global Entry* globalHashTable,
__const uint HASH_TABLE_SIZE,
__const uint HASH_TABLE_SIZE_BITS,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
// hash keys are pre-computed
uint hash_key = groupColumn[idx]; // coalesced read access
__global Entry* entry = &globalHashTable[hash_key]; // pointer arithmetic
sum += entry->payload; // random read
}
if (local_id < HASH_TABLE_SIZE) {
globalHashTable[local_id].payload = sum; // rare coalesced write
}
}
I ran this kernel on a NVIDIA V100 card with multiple iterations. The variance of the results is very low, thus, I only plot one dot per group configuration. The input data size is 1 GiB and each thread processes 128 entries (BATCH = 128). Here are the results:
So far so good. The V100 has a max memory bandwidth of roughly 840GiB/sec and the measurements are close enough, given the fact that there are random memory reads involved.
Now I'm testing random writes to global memory with the following kernel:
__kernel void global_random_write_access(__global const uint* restrict groupColumn,
__global Entry* globalHashTable,
__const uint HASH_TABLE_SIZE,
__const uint HASH_TABLE_SIZE_BITS,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
// hash keys are pre-computed
uint hash_key = groupColumn[idx]; // coalesced read access
__global Entry* entry = &globalHashTable[hash_key]; // pointer arithmetic
sum += i;
entry->payload = sum; // random write
}
if (local_id < HASH_TABLE_SIZE) {
globalHashTable[local_id].payload = sum; // rare coalesced write
}
}
Godbolt: OpenCL -> PTX
The performance drops significantly to a few GiB/sec for few groups.
I can't make any sense of the behavior. As soon as the hash table reaches the size of L1 the performance seems to be limited by L2. For fewer groups the performance is way lower. I don't really understand what the limiting factors are.
The CUDA documentation doesn't say much about how store instructions are handled internally. The only thing I could find is that the st.wb PTX instruction (Cache Operations) might cause a hit on stale L1 cache if another thread would try to read the same addess via ld.ca. However, there are no reads to the hash table involved here.
Any hints or links to understanding the performance behavior are much appreciated.
Edit:
I actually found a bug in my code that didn't pre-compute the hash keys. The access to global memory wasn't random, but actually coalesced due to how I generated the values. I further simplified my experiments by removing the hash table. Now I only have one integer input column and one interger output column. Again, I want to see how the writes to global memory actually behave for different memory ranges. Ultimately, I want to understand which hardware properties influence the performance of writes to global memory and see if I can predict based on the code what performance to expect.
I tested this with two kernels that do the following:
Read from input, write to output
Read from input, read from output and write to output
I also applied two different access patterns, by generating the values in the group column:
SEQUENTIAL: sequentially increasing numbers until current group's size is reached. This pattern leads to a coalesced memory access when reading and writing from the output column.
RANDOM: uni-distributed random numbers within the current group's size. This pattern leads to a misaligned memory access when reading and writing from the output column.
(1) Read & Write
__kernel void global_write_access(__global const uint* restrict groupColumn,
__global uint *restrict output,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
uint group = groupColumn[idx]; // coalesced read access
sum += i;
output[group] = sum; // write (coalesced | random)
}
}
PTX Code: https://godbolt.org/z/19nTdK
(2) Read, Read & Write
__kernel void global_read_write_access(__global const uint* restrict groupColumn,
__global uint *restrict output,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
uint group = groupColumn[idx]; // coalesced read access
output[group] += 1; // read & write (coalesced | random)
}
}
PTX Code: https://godbolt.org/z/b647cz
As ProjectPhysX pointed out, the access pattern makes a huge difference. However, for small groups the performance is quite similar for both access patterns. In general, I would like to better understand the shape of the curves and which hardware properties, architectural features etc. influence this shape.
From the cuda programming guide I learned that global memory accesses are conducted via 32-, 64-, or 128-byte transactions. Accesses to L2 are done via 32-byte transactions. So up to 8 integer words can be accessed via a single transaction. This might explain the plateau with a bump at 8 groups at the beginning of the curve. After that more transactions are needed and performance drops.
One cache line is 128 bytes long (both on L1 and L2), hence, 32 intergers fit into a single cache line. For more groups more cache lines are required which can be potentially processed in parallel by more memory controllers. That might be the reason for the performance to increase here. 8 controllers are available on the V100 So I would expect the performance to peak at 256 groups. Though, it doesn't. Instead it will steadily increase performance until reaching 4096 groups and plateau there with roughly 750 GiB/sec.
The plateauing in your second performane plot is GPU saturation: For only a few work groups, the GPU is partly idle and the latencies involved in launching the kernel significantly reduce performance. Above 8192 groups, the GPU fully saturates its memory bandwidth. The plateau only is at ~520GB/s because of the misaligned writes (have low performance on the V100) and also the "rare coalesced write" in the if-block, which happens at least once per group. For branching within the group, all other threads have to wait for the single write operation to finish. Also this write is not coalesced, because it is not happening for every thread in the group. On the V100, misaligned write performance is very poor at max. ~120GB/s, see the benchmark here.
Note that if you would comment the if-part, the compiler sees that you do not do anything with sum and optimizes everything out, leaving you with a blank kernel in PTX.
The first performance graph to me is a bit more confusing. The only difference in the first kernel to the second is that the random wrtite in the loop is replaced by a random read. Generally, read performance on the V100 is much better (~840GB/s, regardless of coalesced/misaligned) than misaligned write performance, so performance is expected to be much better overall and indeed it is. However I can't make sense of the performance dropping for more groups, where saturation should theoretically be better. But the performance drop isn't really that significant at ~760GB/s vs. 730GB/s.
To summarize, you are observing that the performance penalty for misaligned writes (~120GB/s vs. ~900GB/s for coalesced writes) is much larger than for reads, where performance is about the same for coalesced/misaligned at ~840GB/s. This is common thing for GPUs, with some variance of course between microarchitectures. Typically there is at least some performance penalty for misaligned reads, but not as large as for misaligned writes.

fetch size of Kernel Opencl

I have two questions about memory fetch size of a OpenCL kernel.
In the kernel, I just put a for-loop statement and a printf statement inside the for-loop.
Currently, I am passing two arguments, a float array and a integer number that decides a number of looping of the for-loop.
However, when I passed 0 for the number of looping (In other word, I made the kernel doing nothing), the kernel fetched 0.3125 kilobytes.
Why does the kernel fetch that amount of memory?
My another question is
when there is a cache miss, occurred, why dose the kernel fetches memory less than a size of cache line?
Thanks!
Edit
Here is my kernel
__kernel void sequence(__global float *input, const unsigned int N) {
size_t id = get_global_id(0);
for(int i = 0; i < N; i++) {
printf("Val: %f \n", input[i]);
}
}

Basic OpenCL Mutex Implementation (Currently Hanging)

I am trying to write a mutex for OpenCL. The idea is for every single individual work item to be able to proceed atomically. Currently, I believe the problem may be that thread warps are unable to proceed when one thread in a warp gets the lock.
My current simple kernel below, for summing numbers. "numbers" is an array of floats as input. "sum" is a one element array for the result, and "semaphore" is a one element array for holding the semaphore. I based it heavily off the example here.
void acquire(__global int* semaphore) {
int occupied;
do {
occupied = atom_xchg(semaphore, 1);
} while (occupied>0);
}
void release(__global int* semaphore) {
atom_xchg(semaphore, 0); //the previous value, which is returned, is ignored
}
__kernel void test_kernel(__global float* numbers, __global float* sum, __global int* semaphore) {
int i = get_global_id(0);
acquire(semaphore);
*sum += numbers[i];
release(semaphore);
}
I am calling the kernel effectively like:
int numof_dimensions = 1;
size_t offset_global[1] = {0};
size_t size_global[1] = {4000}; //the length of the numbers array
size_t* size_local = NULL;
clEnqueueNDRangeKernel(command_queue, kernel, numof_dimensions,offset_global,size_global,size_local, 0,NULL, NULL);
As above, when running, the graphics card hangs, and the driver restarts itself. How can I fix it so that it doesn't?
What you are trying to do is not possible because of the GPU execution model, where all threads on a "processor" share the instruction pointer, even in branches. Here is a post that explains the problem in detail: http://vansa.ic.cz/author/admin/.
BTW, the example code that you found has the exact same problem and would never work.
The answer to this might seem obvious in retrospect, but it's not unless you thought of it.
Basically, the GPU's prediction of the ideal local group size (size of a thread warp) is greater than 1, and so thread warps lock up. To fix it, you just need to specify it to be 1 (i.e. "size_t size_local[1] = {1};"). Doing this produces a correct result.

OpenCL, is out of bound checks important in kernels

I have seen solutions like this:
kernel dp_square (const float *a,
float *result)
{
int id = get_global_id(0);
result[id] = a[id] * a[id];
}
and
kernel dp_square (const float *a,
float *result, const unsigned int count)
{
int id = get_global_id(0);
if(id < count)
result[id] = a[id] * a[id];
}
Is the check for id< count important, what happens if a kernel work item tries to process an item not avalible?
Can the reason for it not being there in the first example be that programmer just ensures that the global size is equal the number of elements to be processed ( is this normal) ?
This is often done for two reasons --
To ensure that a developer-error doesn't kill the code or read bad memory
Because sometimes it is optimal to run more work-items than there are data points. For example, if the optimal work-group size for my device is 32 (not uncommon), and I have an array of 61 pieces of data, I'll run 64-work items, and the last three will simply "play dead."
In order to not include this check, you'd have to use a work-group size that divides the total number of work-items. In this case, that would leave you with a work-group size of 1 (as 61 is prime), which would be very slow!

Resources