Why is this simple OpenCL code not vectorised? - opencl

The code below does not vectorise. With 'istart = n * 1;' instead of the 'istart = n * niters;' it does. With 'istart = n * 2;' it again does not.
// Kernel for ERIAS_critical_code.py
__kernel void pi(
int niters,
__global float* A_d,
__global float* S_d,
__global float* B_d)
{
int num_wrk_items = get_local_size(0);
int local_id = get_local_id(0); // work item id
int group_id = get_group_id(0); // work group id
float accum = 0.0f;
int i, istart, iend, n;
n= group_id * num_wrk_items + local_id;
istart = n * niters;
iend = istart + niters;
for (i= istart; i< iend; i++){
accum += A_d[i] * S_d[i];
}
B_d[n] = accum;
barrier(CLK_LOCAL_MEM_FENCE); // test: result is correct without this statement
}
If the code cannot be vectorised I get:
Kernel was not vectorized
If it can be:
Kernel was successfully vectorized (8)
Any idea why it is not vectorised?

When niters is 1, it makes the for loop cycle only once. This means every workitem computes its own element, in a coalesced access to memory.
Coalesced access is one of conditions to have N neighboring threads/workitems mapped to a SIMD hardware such as with width 8.
When niters is greater than 1, every workitem works only with strides of niters between neighboring workitems. This means SIMD hardware is useless. Only 1 memory cell per workitem is used at a time.
When niters is 2, at least only 2-fold memory bank collision happens. But with very big niters value, memory bank collisions happen more, making it very slow. Using SIMD or not doesn't matter (vectorized or not) as its performance will be locked into the serialized memory read/write latencies.
That for loop is doing a reduction serially. You should make it parallel. There are many examples out there, pick one and apply to your algorithm. For example, have each workitem compute a sum between id and id+niters/2 then reduce them on id and id+niters/4 and continue like this until at last only 1 workitem does final summation of id and id+1 elements.
If the reduction is a global version, then you can do local reduction per workgroup then apply their results same way on another kernel to do the global reduction.
Since you are making only partial sums per workitem, you could do "strided sum per workitem" such that each workitem using same for loop but leaping by M elements where M is something wont disturb the SIMD mapping on kernel workitems. Maybe M could be 1/100 of global number of elements(N) and for loop would cycle for 100 times (or N/M times). Something like this:
time 1 time 2 time 3 time 4
workitem 1 0 15 30 45
workitem 2 1 16 31 46
workitem 3 2 17 32 47
...
workitem 15 14 29 44 59
coalesced coalesced coalesced coalesced
to complete 15 partial sums for 60 elements using 15 workitems. If SIMD length can fit this 15 workitems, it is good.
Lastly, the barrier operation is not needed since kernel ending is an implicit synchronization point globally for all workitems in it. Barrier only needed when you need to use those written results on another workitem in same kernel.

Related

Generating an item from an ordered sequence of exponentials

I am writing a solution for the following problem.
A is a list containing all elements 2^I * 3^Q where I and Q are integers in an ascending order.
Write a function f such that:
f(N) returns A[N]
The first few elements are:
A[0] = 1
A[1] = 2
A[2] = 3
A[3] = 4
A[4] = 6
A[5] = 8
A[6] = 9
A[7] = 12
My solution was to generate a list containing the first 225 elements by double looping in 15 times each, then sorted this list and return A[N]
Is there any way to generate the N'th element of this sequence without creating and sorting a list first?
Here are two approaches that solve your problem without creating such a large list. Each approach has its disadvantages.
First, you could set a counter to 0. Then scan all the integers from 1 on up. For each integer, divide out all the multiples of 2 and of 3 in its factorization. If 1 remains, increment the counter; otherwise, leave the counter unchanged. When the counter reaches N you have found the value of A[N]. For example, you increase the counter for integers 1, 2, 3, and 4, but not for 5. This approach uses very little memory but would take much time.
A second approach uses a min priority queue, such as Python's heapq. Again, set a counter to zero, but also initialize the priority queue to hold only the number 1 and note that the highest power of 3 seen so far is also 1. Increment the counter then peek at the minimum number in the queue. If the counter is N you just got your value of A[N]. Otherwise, pop that min value and immediately push double its value. (The pop and push can be done in one operation in many priority queues.) If that value was a the highest power of 3 seen so far, also push three times its value and note that this new value is now the highest power of 3.
This second approach uses a priority queue that takes some memory but the largest size will only be on the order of the square root of N. I would expect the time to be roughly equal to your sorting the large list, but I am not sure. This approach has the most complicated code and requires you to have a min priority queue.
Your algorithm has the advantage of simplicity and the disadvantage of a large list. In fact, given N it is not at all obvious the maximum powers of 2 and of 3 are, so you would be required to make the list much larger than needed. For example, your case of calculating "the first 225 elements by double looping in 15 times each" actually only works up to N=82.
Below I have Python code for all three approaches. Using timeit for N=200 I got these timings:
1.19 ms for sorting a long list (your approach) (powerof2=powerof3=30)
8.44 s for factoring increasing integers
88 µs for the min priority queue (maximum size of the queue was 17)
The priority queue wins by a large margin--much larger than I expected. Here is the Python 3.6.4 code for all three approaches:
"""A is a list containing all elements 2^I * 3^Q where I and Q are
integers in an ascending order. Write a function f such that
f(N) returns A[N].
Do this without actually building the list A.
Based on the question <https://stackoverflow.com/questions/49615681/
generating-an-item-from-an-ordered-sequence-of-exponentials>
"""
import heapq # min priority queue
def ordered_exponential_0(N, powerof2, powerof3):
"""Return the Nth (zero-based) product of powers of 2 and 3.
This uses the questioner's algorithm
"""
A = [2**p2 * 3**p3 for p2 in range(powerof2) for p3 in range(powerof3)]
A.sort()
return A[N]
def ordered_exponential_1(N):
"""Return the Nth (zero-based) product of powers of 2 and 3.
This uses the algorithm of factoring increasing integers.
"""
i = 0
result = 1
while i < N:
result += 1
num = result
while num % 2 == 0:
num //= 2
while num % 3 == 0:
num //= 3
if num == 1:
i += 1
return result
def ordered_exponential_2(N):
"""Return the Nth (zero-based) product of powers of 2 and 3.
This uses the algorithm using a priority queue.
"""
i = 0
powerproducts = [1] # initialize min priority queue to only 1
highestpowerof3 = 1
while i < N:
powerproduct = powerproducts[0] # next product of powers of 2 & 3
heapq.heapreplace(powerproducts, 2 * powerproduct)
if powerproduct == highestpowerof3:
highestpowerof3 *= 3
heapq.heappush(powerproducts, highestpowerof3)
i += 1
return powerproducts[0]

OpenCL: multiple work items saving results to the same global memory address

I'm trying to do a reduce-like cumulative calculation where 4 different values need to be stored depending on certain conditions. My kernel receives long arrays as input and needs to store only 4 values, which are "global sums" obtained from each data point on the input. For example, I need to store the sum of all the data values satisfying certain condition, and the number of data points that satisfy said condition. The kernel is below to make it clearer:
__kernel void photometry(__global float* stamp,
__constant float* dark,
__global float* output)
{
int x = get_global_id(0);
int s = n * n;
if(x < s){
float2 curr_px = (float2)((x / n), (x % n));
float2 center = (float2)(centerX, centerY);
int dist = (int)fast_distance(center, curr_px);
if(dist < aperture){
output[0] += stamp[x]-dark[x];
output[1]++;
}else if (dist > sky_inner && dist < sky_outer){
output[2] += stamp[x]-dark[x];
output[3]++;
}
}
}
All the values not declared in the kernel are previously defined by macros. s is the length of the input arrays stamp and dark, which are nxn matrices flattened down to 1D.
I get results but they are different from my CPU version of this. Of course I am wondering: is this the right way to do what I'm trying to do? Can I be sure that each pixel data is only being added once? I can't think of any other way to save the cumulative result values.
Atomic operation is needed in your case, otherwise data races will cause the results unpredictable.
The problem is here:
output[0] += stamp[x]-dark[x];
output[1]++;
You can imagine that threads in the same wave might still follow the same step, therefore, it might be OK for threads inside the same wave. Since they read the same output[0] value using a global load instruction (broadcasting). Then, when they finish the computation and try to store data into the same memory address (output[0]), the writing operations will be serialized. To this point, you may still get the correct results (for the work items inside the same wave).
However, since it is highly likely that your program launches more than one wave (in most applications, this is the case). Different waves may execute in an unknown order; then, when they access the same memory address, the behavior becomes more complicated. For example, wave0 and wave1 may access output[0] in the beginning before any other computation happens, that means they fetch the same value from output[0]; then they start the computation. After computation, they save their accumulative results into output[0]; apparently, result from one of the waves will be overwritten by another one, as if only the one who writes memory later got executed. Just imagine that you have much more waves in a real application, so it is not strange to have a wrong result.
You can do this in O(log2(n)) concurrently. a concept idea:
You have 16 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) inputs and you want to have the sum of these inputs concurrently.
you can concurrently sum 1 in 2, 3 in 4, 5 in 6, 7 in 8, 9 in 10, 11 in 12, 13 in 14, 15 in 16
then you sum concurrently 2 in 4, 6 in 8, 10 in 12, 14 in 16
then always concurrently 4 in 8, 10 in 16
and finally 8 in 16
everything done in O(log2(n)) in our case in 4 passages.

How to get a sum array from array

I'm a new for new for openCL.
I know how to sum a 1D array. But my question is how to get a sum array from 1 1D array in openCL.
int a[1000];
int b[1000];
.... //save data to a
for(int i = 0 ;i < 1000; i ++){
int sum = 0;
for(int j = 0 ;j < i; j ++){
sum += a[j];
}
b[i] = sum;
}
Any suggestion is welcome.
As others have mentioned - what you want to do is use inclusive parallel prefix sum. If you're allowed to use OpenCL 2, they have a workgroup function for it - they should have had it in there from the start because of how often it is used - so now we have everybody implementing it themselves, often poorly in one way or another.
See Parallel Prefix Sum (Scan) with CUDA for the typical algorithms for teaching this.
At the number you mention really it makes no sense to use multiple compute units meaning you will attack it with a single compute unit - so just repeat the loop twice or so - at 64-256, you'll have the sum of so many elements very quickly. Building on workgroup functions to get the generic reduction functions for any size is an exercise to the reader.
This is a sequential problem. Expressed in another way
b[1] = a[0]
b[2] = b[1] + a[1]
b[3] = b[2] + a[2]
...
b[1000] = b[9999] + a[999]
Therefore, having multiple threads wont help you at all.
The most optimal way of doing that is using a single CPU. And not OpenCL/CUDA/OpenMP...
This problem is completely different from a reduction, were every step can be divided in 2 smaller steps that can be run in parallel.

Kernel attributes in the two-stage reduction example in OpenCL proposed by AMD

I have some problems understanding the two-stage reduction algorithm described here.
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
I understand the basic idea, but I am not sure about the while-loop. As far as I inderstand, the attribute length specifies the number of elements in the buffer, i.e. how many elements do I want to process at all. But get_global_size returns the global number of work-items. Aren't length and get_global_size equal then? This would mean that the while-loop condition wil be satisfied only once. Shouldn't we use get_local_size instead of get_global_size?
Aren't length and get_global_size equal then?
Not necessarily. It is common to launch less work items than there are data elements, and have each work item process more than one element. This way, you can decouple your input data size from the number of work items.
In this case, the following:
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
Performs a min-reduction of an array that resides in global memory. Basically, the work group will "slide" over the input vector, and at every iteration, each work item will update its minimum.
Here's a fictitious numerical example where we launch 2 work groups of 4 work-items over an array of 20 elements. xN represents the Nth element from the input array, aN and bN represent the Nth work item from work group a and b, respectively. Therefore the while condition is met between 2 to 3 times, depending on the work item id:
length: 20
get_global_size(): 8
get_local_size(): 4
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 Input array
--------------------------------------------------------------------- Iterations
a0 a1 a2 a3 b0 b1 b2 b3 0
a0 a1 a2 a3 b0 b1 b2 b3 1
a0 a1 a2 a3 2
When the while loop finishes, every work item will have computed a minimum over a subset of the input array. For example, a0 will have computed min(x0, x8, x16) and b0 will have computed min(x4, x12).
Then the work items write the minimum they computed to local memory, and the work groups proceed to do a min-reduction (with a reduction tree) in local memory. Their result is written back to global memory, and presumably the kernel is called again with result as the new array to min-reduce, until the final result is a single element.
The global size may be larger than the length because in OpenCL 1.x the global size must be a whole number multiple of the work group size. Therefore the global size might have been rounded up from the data size (length). For example, if length was 1000 but the work group size was 128 then the global size would be 1024.
[FULL Description]
Overview:
This is a Two Stage reduction, which outperform the recursive multistage reduction by reducing synchronizations/barrier and overheads, and keeping all the computing unit as busy as possible. Before understand the kernel, it is important to understand work items and work groups configuration set by the host program and the parameters of the kernel. In this example, the task was to find the min value of N float numbers. The configurations are given below.
Setup:
The work group configuration are, the host sets up K number of work items (K < N) and P work groups. Each work group will have Q work items where K=P*Q. It is preferable that N%K==0, but not necessary.
The parameters and dimensions of the kernel are: 1) The first argument is an N size array contains N data elements (candidate data for finding min); 2) The second argument is an empty array of size Q; 3) The value of length is equal to N;and 4) the result is an array of size P.
Workflow: Stage 1
The work flow is as given below:
If N%K== 0, each work item initially find the minimum value among N/K data elements, where the data elements are apart from each other by K items. The while loop does this task. If N%K != 0, some of the work item calculate min of ceil(N/K) elements and the rest of the work items find min of floor(N/K) elements.(as explained in the above answer by Kretab Chabawenizc).
The findings of each of these work items are initially stored in the local variable accumulator and then finally saved into the local array scratch. Once all the work items are done with this part of work (ensured by the barrier(CLK_LOCAL_MEM_FENCE)) the kernel start acting as a recursive parallel reduction kernel. Work items from a specific work group consider scratchpad as the data items array and each of the work items then reduce it by iteration (the for loop does this. Read the actual AMD documentation to get more explanation).
Finally the first P elements of result will contain the minimum value find by each of the P work groups.
Workflow: Stage 2
Now the second stage starts; and in this stage the same kernel can be invoked for P work items and 1 work group. The result array will be the first argument of the kernel this time and an one element array will be the last argument of the kernel to receive the final result.
In this run, the while loop will not do anything significant but just copy the values from the buffer to scratch. Thus you can come up with a more optimized kernel and use that for the second stage.

Do global_work_size and local_work_size have any effect on application logic?

I am trying to understand how all of the different parameters for dimensions fit together in OpenCL. If my question isn't clear that's partly because a well formed question requires bits of the answer which I don't have.
How do work_dim, global_work_size, and local_work_size work together to create the execution space that you use in a kernel? For example, if I make work_dim 2 then I can
get_global_id(0);
get_global_id(1);
I can divide those two dimensions up into n Work Groups using global_work_size, right? So if I make the global_work_size like so
size_t global_work_size[] = { 4 };
Then each dimension would have 4 work groups for a total of 8? But, as a beginner, I am only using global_id for my indices so only the global id's matter anyway. As you can tell I am pretty confused about all of this so any help you can offer would ...help.
image i made to try to understand this question
image decribing work groups i found on google
Since you stated yourself that you are a bit confused about the concepts involved in the execution space, I'm gonna try to summary them before answering your question and give some examples.
The threads/workitems are organized in a NDRange which can be viewed as a grid of 1, 2, 3 dims.
The NDRange is mainly used to map each thread to the piece of data each of them will have to manipulate. Therefore each thread should be uniquely identified and a thread should know which one it is and where it stands in the NDRange. And there come the Work-Item Built-in Functions. These functions can be called by all threads to give them info about themself and the NDRange where they stand.
The dimensions:
As already stated, an NDRange can have up to 3 dimensions. So if you set the dimensions this way:
size_t global_work_size[2] = { 4, 4 };
It doesn't mean that each dimension would have 4 work groups for a total of 8, but that you'll have 4 * 4 i.e. 16 threads in your NDRange. These threads will be arranged in a "square" with sides of 4 units. The workitems can know how many dimensions the NDRange is made of, using the uint get_work_dim () function.
The global size:
Threads can also query how big is the NDRange for a specific dimension with size_t get_global_size (uint D). Therefore they can know how big is the "line/square/rectangle/cube" NDRange.
The global unique identifiers:
Thanks to that organization, each thread can be uniquely identified with indexes corresponding to the specific dimensions. Hence the thread (2, 1) refers to a thread that is in the 3rd column and the second row of a 2D range. The function size_t get_global_id (uint D) is used in the kernel to query the id of the threads.
The workgroup (or local) size:
The NDRange can be split in smaller groups called workgroups. This is the local_work_size you were referring to which has also (and logically) up to 3 dimensions. Note that for OpenCL version below 2.0, the NDRange size in a given dimension must be a multiple of the workgroup size in that dimension. so to keep your example, since in the dimension 0 we have 4 threads, the workgroup size in the dimension 0 can be 1, 2, 4 but not 3. Similarly to the global size, threads can query the local size with size_t get_local_size (uint D).
The local unique identifiers:
Sometime it is important that a thread can be uniquely identified within a workgroup. Hence the function size_t get_local_id (uint D). Note the "within" in the previous sentence. a thread with a local id (1, 0) will be the only one to have this id in its workgroup (of 2D). But there will be as many threads with a local id (1, 0) as there will be workgroups in the NDRange.
The number of groups:
Speaking of groups sometime a thread might need to know how many groups there are. That's why the function size_t get_num_groups (uint D) exists. Note that again you have to pass as parameter the dimension you are interested in.
Each group has also an id:
...that you can query within a kernel with the function size_t get_group_id (uint D). Note that the format of the group ids will be similar to those of the threads: tuples of up to 3 elements.
Summary:
To wrap things up a bit, if you have a 2D NDRange of a global work size of (4, 6) and a local work size of (2, 2) it means that:
the global size in the dimension 0 will be 4
the global size in the dimension 1 will be 6
the local size (or workgroup size) in the dimension 0 will be 2
the local size (or workgroup size) in the dimension 1 will be 2
the thread global ids in the dimension 0 will range from 0 to 3
the thread global ids in the dimension 1 will range from 0 to 5
the thread local ids in the dimension 0 will range from 0 to 1
the thread local ids in the dimension 1 will range from 0 to 1
The total number of threads in the NDRange will be 4 * 6 = 24
The total number of threads in a workgroup will be 2 * 2 = 4
The total number of workgroups will be (4/2) * (6/2) = 6
the group ids in the dimension 0 will range from 0 to 1
the group ids in the dimension 1 will range from 0 to 2
there will be only one thread will the global id (0, 0) but there will be 6 threads with the local id (0, 0) because there are 6 groups.
Example:
Here is a dummy example to use all these concepts together (note that performance would be terrible, it's just a stupid example).
Let's say you have a 2D array of 6 rows and 4 columns of int. You want to group these elements in square of 2 by 2 elements and sum them up in such a way that for instance, the elements (0, 0), (0, 1), (1, 0), (1, 1) will be in one group (hope it's clear enough). Because you'll have 6 "squares" you'll have 6 results for the sums, so you'll need an array of 6 elements to store these results.
To solve this, you use our 2D NDRange detailed just above. Each thread will fetch from the global memory one element, and will store it in the local memory. Then after a synchronization, only one thread per workgroup, let say each local(0, 0) threads will sum the elements (in local) up and then store the result at a specific place in a 6 elements array (in global).
//in is a 24 int array, result is a 6 int array, temp is a 4 int array
kernel void foo(global int *in, global int *result, local int *temp){
//use vectors for conciseness
int2 globalId = (int2)(get_global_id(0), get_global_id(1));
int2 localId = (int2)(get_local_id(0), get_local_id(1));
int2 groupId = (int2)(get_group_id (0), get_group_id (1));
int2 globalSize = (int2)(get_global_size(0), get_global_size(1));
int2 locallSize = (int2)(get_local_size(0), get_local_size(1));
int2 numberOfGrp = (int2)(get_num_groups (0), get_num_groups (1));
//Read from global and store to local
temp[localId.x + localId.y * localSize.x] = in[globalId.x + globalId.y * globalSize.x];
//Sync
barrier(CLK_LOCAL_MEM_FENCE);
//Only the threads with local id (0, 0) sum elements up
if(localId.x == 0 && localId.y == 0){
int sum = 0;
for(int i = 0; i < locallSize.x * locallSize.y ; i++){
sum += temp[i];
}
//store result in global
result[groupId.x + numberOfGrp.x * groupId.y] = sum;
}
}
And finally to answer to your question: Do global_work_size and local_work_size have any effect on application logic?
Usually yes because it's part of the way you design you algo. Note that the size of the workgroup is not taken randomly but matches my need (here 2 by 2 square).
Note also that if you decide to use a NDRange of 1 dimension with a size of 24 and a local size of 4 in 1 dim, it'll screw things up too because the kernel was designed to use 2 dimensions.

Resources