OpenCL benchmark - advice about parameters to vary - opencl

I would like to perform runtime benchmark about the two-stage Sum reduction with OpenCL ( from this AMD link) on a radeon HD 7970 Tahiti XT.
Initially, I took a first version of code where I didn't use the first loop which performs a reduction from an input array of size N to an output array of size NworkItems. Here's this first loop into kernel code :
int global_index = get_global_id(0);
float accumulator = 0;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator += element;
global_index += get_global_size(0);
}
So with this first version, I have measured the runtime as a function of input array size (which is equal to the total number of threads) and for different sizes of work group. Here's the results :
Now, I would like to do a benchmark where I use this initial loop above. But I don't know which parameters that I have to vary.
From this link, one says that AMD recommends a multiple of 64 for the size of a WorkGroup (32 for NVIDIA).
Moreover, from last comment on this other link, it is recommended to set the work group size like : WorkGroup size = (Number of total threads) / (Compute Units).
On my GPU card, I have 32 compute units.
So I would like to get advices for knowing which parameters would be interesting to vary in order to compare runtimes in this second version (with the first reduction loop). For example, I may take different values for the ratio (N size of input array) / (total NworkItems) and a fixed value for WorkGroup size (see expression above),
or do on the contrary, i.e should I vary the value for WorkGroup size and fix the ratio (N size of input array) / (total NworkItems) ?

You should sum local data instead of spread out data, to aid memory transfer (coalesced data access). So use this instead:
int chunk_size = length/get_global_size(0)+(length%get_global_size(0) > 0); //Will give how many items each work item needs to process
int global_index = get_group_id(0)*get_local_size(0)*chunk_size + get_local_id(0); //Start at this address for this work item
float accumulator = 0;
for(int i=0; i<chunk_size; i++)
// Loop sequentially over chunks of input vector
if (global_index < length) {
float element = buffer[global_index];
accumulator += element;
global_index += get_local_size(0);
}
}
Also you should use sizes that are powers of two, to help caching.

Related

OpenCl global work-item operation priority

I want to know the priority of index counting for the following code snippets (simple 2 dimensional matrix multiplication routine).
kernel void mmul(
const int N,
global float* A,
global float* B,
global float* C)
{
int k;
int i = get_global_id(0);
int j = get_global_id(1);
float tmp;
if ((i < N) && (j < N))
{
tmp = 0.0f;
for (k = 0; k < N; k++)
tmp += A[i*N+k] * B[k*N+j];
C[i*N+j] = tmp;
}
}
If you look inside the for loop with 'k' counter you can see global work-item 'i' and 'j' placed in the same line. I want to know which of them have priority in terms of counting the indexes (eg. 1,2,3,4, ... , n) of 'i' and 'j'. I don't understand how this would work as I am new to OpenCl and I would use nested for loop, if I am just using normal C or Python, for this type of operation.
Can someone explain how the global work-item work?
Thank you.
You should focus more on memory read/write priorities than workitem issuing order. To enforce a priority/order on memory operations, use mem_fence(in-workitem) , barrier(in-workgroup) and even kernels(all workitems sync point). Using deliberate empty for-loops or atomic functions cannot guarantee a memory-write/read priority. Only memory fences/barriers/kernels can.
There is no priority for any workitem(to start/end running) but they are grouped and executed on compute units which have many threads to run them. There is no guarantee that workitem i,j will execute before i+1,j+1 but there is a guarantee they will be executed in same compute unit(with cores sharing L1 cache) if they are in same workgroup(with size of 16,16 for example) when using Nvidia and Amd gpus.
Being executed in same compute unit increases chances of being issued at the same time which is not a priority ofcourse but sharing resources like L1 cache means high performance.
Even in same workgroup, there is no guarantee if a local workitem is issued before some other workitem but they are more likely happening at the same time if they are on same SIMD unit(such as 16-wide parts in Amd gpu).

OpenCL: 3D array processing - Globale size limit

I'm working with an 3D array of dimension xdim=49, ydim=1024 and zdim=64. my DEVICE_MAX_WORK_ITEM_SIZES is only 512/512/512. If I declare my
size_t global_work_size = {xdim, ydim, zdim}; and launch an 3D kernel,
I'm getting wrong results since my ydim > 512. If all my dimensions are below 512, I'm getting the expected results. Please let me know if there's an alternative for this?
CL_DEVICE_MAX_WORK_ITEM_SIZES only limits the size of work groups, not the global work item size (yea, it's a terrible name for the constant). You are much more tightly restricted by CL_DEVICE_MAX_WORK_GROUP_SIZE which is the total number of items allowed in a work group (you'd typically hit this far sooner than CL_DEVICE_MAX_WORK_ITEM_SIZES because of multiplication.
So go ahead an launch your global work size of 49, 1024, 64. It should work. If it's not, you're using get_local_id instead of get_global_id or have some other bug. We regularly launch 2D kernels with 4096 x 4096 global work size.
See also Questions about global and local work size
If you don't use shared local memory, you don't need to worry about local work group sizes. In fact, you can pass NULL instead of a pointer to an array of sizes for local_work_size and let the runtime pick something (it helps if your global dimensions are easily divisible by small numbers).
Assuming the dimensions you provided are the size of your data, you can decrease the global work size by making each GPU thread calculate more data. What I mean is, every thread in your case will do one calculation and if you change your kernels to do let's say 2 calculations in y dimension, than you could cut the number of threads you are firing into half. The global_work_size decides how many threads in each direction you are executing. Let me give you an example:
Let's assume you have an array you want to do some calculations with and the array size you have is 2048. If you write your kernel in the following way, you are going to need 2048 as the global_work_size:
__kernel void calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
B[i] = A[i] * 5;
}
The global work size in this case will be:
size_t global_work_size = {2048, 1, 1};
However, if you change your kernel into the following kernel, you can lower your global work size as well: ()
__kernel void new_calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
for (int ind = 0; ind < 8; ind++)
B[i*8 + ind] = A[i*8 + ind] * 5;
}
Then this way, you can use global size as:
size_t global_work_size = {256, 1, 1};
Also with the second kernel, each of your threads will execute more work, resulting in more utilisation.

OpenCL histogram with many bins

I am using the code presented in Chapter 14 of the OpenCL Progamming Guide to calculate a histogram. It works fine for 256 bins, but unfortunately I need 65536 bins for my application. This leads to the problem that if I use this approach, the local array gets too big.
local uint tmp_histogram[256 * 256];
As a result, the program is not built (CL_BUILD_PROGRAM_FAILURE).
Do you have any ideas how this issue can be solved? I thought of using multiple kernels to compute the values for the different bins (i.e. to split the histogram, so that I first compute the values for the bins 0-255, then for 256-511, etc.). However, in this case I will have to check if a value is within that range before incrementing, which means that I will need conditionals...
Using global memory would solve the problem, but would not result in a very fast kernel. I suggest creating multiple work groups, and using each group to count a range of values only.
#define RANGE_SIZE 8192
kernel void histo(__global uint data,__constant int dataSize){
int wid = get_local_id(0);
int wSize = get_local_size(0);
int gid = get_group_id(0);
int numGroups = get_num_groups(0);
int rangeStart = gid * RANGE_SIZE / numGroups;
int rangeEnd = (gid+1) * RANGE_SIZE / numGroups;
local uint tmp_histogram[RANGE_SIZE];
uint value;
for(int i=wid; i< dataSize; i+= wSize){
value = data[i];
if(value >= rangeStart && value < rangeEnd){
atomic_inc(tmp_histogram[value - rangeStart]);
}
}
//barrier...
//use the local data here
}
Assumes 32kb local memory available. If you reduce RANGE_SIZE, it does not have to be a power of two, but you do need to make sure you are calling the kernel with enough work groups to hit all values up to 64k.
Move your histogram to global storage.
A further solution could be to use unsigned short, if your application suits this size.
At last you could run your code twice. first time for lower 32000 values, second time for the upper half.

speedup when using float4, opencl

I have the following opencl kernel function to get the column sum of a image.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols)
{
int srcIdx = x ;
int dstIdx = x ;
float sum = 0;
for (int y = 0; y < srcRows; ++y)
{
sum += src[srcIdx];
dst[dstIdx] = sum;
srcIdx += srcStep;
dstIdx += dstStep;
}
}
}
I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.
I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols/4)
{
int srcIdx = x ;
int dstIdx = x ;
float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for (int y = 0; y < srcRows; ++y)
{
float4 temp2;
temp2 = vload4(0, &src[4 * srcIdx]);
sum = sum + temp2;
vstore4(sum, 0, &dst[4 * dstIdx]);
srcIdx += (srcStep/4);
dstIdx += (dstStep/4);
}
}
}
In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T
OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.
You are directly reading the values from "src" array (global memory). Which typically is 400 times slower than private memory. Your bottleneck is definitelly the memory access, not the "add" operation itself.
When you move from float to float4, the vector operation (add/multiply/...) is more efficient thanks to the ability of the GPU to operate with vectors. However, the read/write to global memory remains the same.
And since that is the main bottleneck, you will not see any speedup at all.
If you want to speed your algorithm, you should move to local memory. However you have to manually resolve the memory management, and the proper block size.
which architecture do you use?
Using float4 has higher instruction level parallelism (and then require 4 times less threads) so theoretically should be faster (see http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf)
However did i understand correctly in you kernel you are doing prefix-sum (you store the partial sum at every iteration of y)? If so, because of the stores the bottleneck is at the memory writes.
I think on the GPU float4 is not a SIMD operation in OpenCL. In other words if you add two float4 values the sum is done in four steps rather than all at once. Floatn is really designed for the CPU. On the GPU floatn serves only as a convenient syntax, at least on Nvidia cards. Each thread on the GPU acts as if it is scalar processor without SIMD. But the threads in a warp are not independent like they are on the CPU. The right way to think of the GPGPU models is Single Instruction Multiple Threads (SIMT).
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Have you tried running your code on the CPU? I think the code with float4 should run quicker (potentially four times quicker) than the scalar code on the CPU. Also if you have a CPU with AVX then you should try float8. If the float4 code is faster on the CPU than float8 should be even faster on a CPU with AVX.
try to define __ attribute __ to kernel and see changes in run timing
for example try to define:
__ kernel void __ attribute__((vec_type_hint(int)))
or
__ kernel void __ attribute__((vec_type_hint(int4)))
or some floatN as you want
read more:
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/functionQualifiers.html

OpenCL, is out of bound checks important in kernels

I have seen solutions like this:
kernel dp_square (const float *a,
float *result)
{
int id = get_global_id(0);
result[id] = a[id] * a[id];
}
and
kernel dp_square (const float *a,
float *result, const unsigned int count)
{
int id = get_global_id(0);
if(id < count)
result[id] = a[id] * a[id];
}
Is the check for id< count important, what happens if a kernel work item tries to process an item not avalible?
Can the reason for it not being there in the first example be that programmer just ensures that the global size is equal the number of elements to be processed ( is this normal) ?
This is often done for two reasons --
To ensure that a developer-error doesn't kill the code or read bad memory
Because sometimes it is optimal to run more work-items than there are data points. For example, if the optimal work-group size for my device is 32 (not uncommon), and I have an array of 61 pieces of data, I'll run 64-work items, and the last three will simply "play dead."
In order to not include this check, you'd have to use a work-group size that divides the total number of work-items. In this case, that would leave you with a work-group size of 1 (as 61 is prime), which would be very slow!

Resources