OpenCL autocorrelation kernel - opencl

I have written a simple program that does autocorrelation as follows...I've used pgi accelerator directives to move the computation to GPUs.
//autocorrelation
void autocorr(float *restrict A, float *restrict C, int N)
{
int i, j;
float sum;
#pragma acc region
{
for (i = 0; i < N; i++) {
sum = 0.0;
for (j = 0; j < N; j++) {
if ((i+j) < N)
sum += A[j] * A[i+j];
else
continue;
}
C[i] = sum;
}
}
}
I wrote a similar program in OpenCL, but I am not getting correct results. The program is as follows...I am new to GPU programming, so apart from hints that could fix my error, any other advices are welcome.
__kernel void autocorrel1D(__global double *Vol_IN, __global double *Vol_AUTOCORR, int size)
{
int j, gid = get_global_id(0);
double sum = 0.0;
for (j = 0; j < size; j++) {
if ((gid+j) < size)
{
sum += Vol_IN[j] * Vol_IN[gid+j];
}
else
continue;
}
barrier(CLK_GLOBAL_MEM_FENCE);
Vol_AUTOCORR[gid] = sum;
}
Since I have passed the dimension to be 1, so I am considering my get_global_size(0) call would give me the id of the current block, which is used to access the input 1d array.
Thanks,
Sayan

The code is correct. As far as I know, that should run fine and give corret results.
barrier(CLK_GLOBAL_MEM_FENCE); is not needed. You'll get more speed without that sentence.
Your problem should be outside the kernel, check that you a re passing correctly the input, and you are taking out of GPU the correct data.
BTW, I supose you are using a double precision suported GPU as you are doing double calcs.
Check that you are passing also double values. Remember you CAN't point a float pointer to a double value, and viceversa. That will give you wrong results.

Related

Why iteration is so much more time-consuming than recursion?

Today when I am solving Fibonacci arrays, I meet with a very strange thing. Recursion only takes 16ms, but iteration takes 80ms. I have tried to optimize my iteration (such as I use a vector container to fulfill my stack) but iteration is still much slower than recursion. It doesn't make sense because recursion still builds a stack at OS level, which is more time-consuming than iteration.
Here is my iteration code:
class Solution {
public:
int fib(int n) {
std::stack<int, std::vector<int>> st;
st.push(n);
int result = 0;
int temp = 0;
while(!st.empty()) {
temp = st.top(); st.pop();
if(temp == 1) result++;
else if(temp == 0) continue;
else {
st.push(temp - 1);
st.push(temp - 2);
}
}
return result;
}
};
Here is my recursion code
class Solution {
public:
int fib(int n) {
if(n == 0) return 0;
if(n == 1) return 1;
else return fib(n - 1) + fib(n - 2);
}
};
Well, I have searched for the reason. According to Is recursion ever faster than looping?, recursion is more time-consuming than iteration in an imperative language. But C++ is one of the imperative languages, it is not convincing.
I think I find the reason. You can help me check if there is any incorrect in my analysis?
The reason why recursion is faster than iteration is that if you use an STL container as a stack, it would be allocated in heap space.
When the PC pointer wants to access the stack, cache missing might happen, which is greatly expensive as for a small scale problem.
However, as for the Fibonacci solution, the code length is not very long. So the PC pointer can easily jump to the function's beginning. If you use a static int array, the result is satisfying.
Here is the code:
class Solution {
public:
int fib(int n) {
int arr[1000];
arr[0] = n;
int s = 1;
int result = 0;
int temp;
while (s) {
temp = arr[s-1];
s--;
switch (temp) {
case 1:
result++;
break;
case 0:
continue;
break;
default:
arr[s++] = temp - 1;
arr[s++] = temp - 2;
}
}
return result;
}
};

Multiple read-write synchronization issues in opencl local and global memories

I have an opencl kernel that finds the maximum ASCII character in a string.
The problem is I cannot synchronize the multiple read-writes to global and local memories.
I am trying to update a local_maximum character in shared memory, and at the end of the workgroup (last thread), the global_maximum character, by comparing it with the local_maximum. The threads are writing one over another, I guess.
eg: Input string: "pirates of the carribean".
Output String: 'r' (but it should be 's').
Please have a look at the code and give a solution as to what I can do to get everything synchronized. I am sure people having sound knowledge can understand the code. Optimization tips are welcome.
The code is below:
__kernel void find_highest_ascii( __global const char* data, __global char* result, unsigned int size, __local char* localMaxC )
{
//creating variables and initialising..
unsigned int i, localSize, globalSize, j;
char privateMaxC,temp,temp1;
i = get_global_id(0);
localSize = get_local_size(0);
globalSize = get_global_size(0);
privateMaxC = '\0';
if(i<size){
if(i == 0)
read_mem_fence( CLK_LOCAL_MEM_FENCE );
*localMaxC = '\0';
mem_fence( CLK_LOCAL_MEM_FENCE);
////////////////////////////////////////////////////
/////UPDATING PRIVATE MAX CHARACTER/////////////////
////////////////////////////////////////////////////
for( j = i; j<size; j+=globalSize )
{
if( data[j] > privateMaxC )
{
privateMaxC = data[j];
}
}
///////////////////////////////////////////////////
///////////////////////////////////////////////////
////UPDATING SHARED MAX CHARACTER//////////////////
///////////////////////////////////////////////////
temp = *localMaxC;
read_mem_fence( CLK_LOCAL_MEM_FENCE );
if(privateMaxC>temp)
{
*localMaxC = privateMaxC;
write_mem_fence( CLK_LOCAL_MEM_FENCE );
temp = privateMaxC;
}
//////////////////////////////////////////////////
//UPDATING GLOBAL MAX CHARACTER.
temp1 = *result;
if(( (i+1)%localSize == 0 || i==size-1) && (temp > temp1 ))
{
read_mem_fence( CLK_GLOBAL_MEM_FENCE );
*result = temp;
write_mem_fence( CLK_GLOBAL_MEM_FENCE );
}
}
}
You are correct that threads will be overwriting each other's values, since your code is riddled with race conditions. In OpenCL, there is no way to synchronise between work-items that are in different work-groups. Instead of trying to achieve this kind of synchronisation with explicit fences, you can make your code much simpler by using the built-in atomic functions instead. In particular, there is an atomic_max built-in which solves your problem perfectly.
So, instead of the code you currently have to update both your local and global memory maximum values, just do something like this:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
atomic_max(localMax, privateMax);
barrier(CLK_LOCAL_MEM_FENCE);
// Global reduction
if (l == 0)
{
atomic_max(output, *localMax);
}
}
This will require you to update your local memory scratch space and final result to use 32-bit integer values, but on the whole is a significantly cleaner approach to solving this problem (not to mention it actually works).
NON-ATOMIC SOLUTION
If you really don't want to use atomics, then you can implement a bog-standard reduction using local memory and work-group barriers. Here's an example:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
localMax[l] = privateMax;
for (int offset = get_local_size(0)/2; offset > 1; offset>>=1)
{
barrier(CLK_LOCAL_MEM_FENCE);
if (l < offset)
{
localMax[l] = max(localMax[l], localMax[l+offset]);
}
}
// Store work-group result in global memory
if (l == 0)
{
output[get_group_id(0)] = max(localMax[0], localMax[1]);
}
}
This compares pairs of elements at a time using local memory as a scratch space. Each work-group will produce a single result, which is stored in global memory. If your data-set is small, you could run this with a single work-group (i.e. make global and local sizes the same), and this will work just fine. If it is larger, you could run a two-stage reduction by running this kernel twice, e.g.:
size_t N = ...; // something big
size_t local = 128;
size_t global = local*local; // Must result in at most 'local' number of work-groups
// First pass - run many work-groups using temporary buffer as output
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_temp);
clEnqueueNDRangeKernel(..., &global, &local, ...);
// Second pass - run one work-group with temporary buffer as input
global = local;
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_temp);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_output);
clEnqueueNDRangeKernel(..., &global, &local, ...);
I'll leave it to you to run them and decide which approach would be best for your own data-set.

Base case condition in quick sort algorithm

For the quick sort algorithm(recursive), every time when it calls itself, it have the condition if(p < r). Please correct me if I am wrong: as far as I know, for every recursive algorithm, it has a condition as the time when it entered the routine, and this condition is used to get the base case. But I still cannot understand how to correctly set and test this condition ?
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
For my entire code, please refer to the following:
/*
filename : main.c
description: quickSort algorithm
*/
#include<iostream>
using namespace std;
void exchange(int* val1, int* val2)
{
int temp = *val1;
*val1 = *val2;
*val2 = temp;
}
int partition(int* arr, int p, int r)
{
int x = arr[r];
int j = p;
int i = j-1;
while(j<=r-1)
{
if(arr[j] <= x)
{
i++;
// exchange arr[r] with arr[j]
exchange(&arr[i],&arr[j]);
}
j++;
}
exchange(&arr[i+1],&arr[r]);
return i+1;
}
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
// driver program to test the quick sort algorithm
int main(int argc, const char* argv[])
{
int arr1[] = {13,19,9,5,12,8,7,4,21,2,6,11};
cout <<"The original array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
quickSort(arr1,0,11);
//print out the sorted array
cout <<"The sorted array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
cin.get();
return 0;
}
Your question is not quite clear, but I will try to answer.
Quicksort works by sorting smaller and smaller arrays. The base case is an array with less than 2 elements because no sorting would be required.
At each step it finds a partition value and makes it true that all the values to the left of the partition value are smaller and all values to the right of the partition value are larger. In other words, it puts the partition value in the correct place. Then it recursively sorts the array to the left of the partition and the array to right of the partition.
The base case of quicksort is an array with one element because a one element array requires no sorting. In your code, p is the index of the first element and r is the index of the last element. The predicate p < r is only true for an array of at least size 2. In other words, if p >= r then you have an array of size 1 (or zero, or nonsense) and there is no work to do.

Access vector type OpenCL

I have a variable whithin a kernel like:
int16 element;
I would like to know if there is a way to adress the third int in element like
element[2] so that i would be as same as writing element.s2
So how can i do something like:
int16 element;
int vector[100] = rand() % 16;
for ( int i=0; i<100; i++ )
element[ vector[i] ]++;
The way i did was:
int temp[16] = {0};
int16 element;
int vector[100] = rand() % 16;
for ( int i=0; i<100; i++ )
temp[ vector[i] ]++;
element = (int16)(temp[0],temp[1],temp[2],temp[3],temp[4],temp[5],temp[6],temp[7],temp[8],temp[9],temp[10],temp[11],temp[12],temp[13],temp[14],temp[15]);
I know this is terrible, but it works, ;-)
Well there is still dirtier way :), I hope OpenCL provides better way of traversing vector elements.
Here is my way of doing it.
union
{
int elarray[16];
int16 elvector;
} element;
//traverse the elements
for ( i = 0; i < 16; i++)
element.elarray[i] = temp[vector[i]]++;
Btw rand() function is not available in OpenCL kernel, how did you make it work ??
Using pointers is a very easy solution
float4 f4 = (float4)(1.0f, 2.0f, 3.0f, 4.0f);
int gid = get_global_id(0);
float *p = &f4;
result[gid]=p[3];
AMD recommends getting vector components this way:
Put the array of masks into an OpenCl constant buffer:
cl_uint const_masks[4][4] =
{
{0xffffffff, 0, 0, 0},
{0, 0xffffffff, 0, 0},
{0, 0, 0xffffffff, 0},
{0, 0, 0, 0xffffffff},
}
Inside the kernel write something like this:
uint getComponent(uint4 a, int index, __constant uint4 * const_masks)
{
uint b;
uint4 masked_a = a & const_masks[index];
b = masked_a.s0 + masked_a.s1 + masked_a.s2 + masked_a.s3;
return (b);
}
__kernel void foo(…, __constant uint4 * const_masks, …)
{
uint4 a = ….;
int index = …;
uint b = getComponent(a, index, const_masks);
}
It is possible, but it not as efficient as direct array accessing.
float index(float4 v, int i) {
if (i==0) return v.x;
if (i==1) return v.y;
if (i==2) return v.z;
if (i==3) return v.w;
}
But of course, if you need component-wise access this way, then chances are that you're better off not using vectors.
I use this workaround, hoping that compilers are smart enough to see what I mean (I think that element access is a serious omission form the standard):
int16 vec;
// access i-th element:
((int*)vec)[i]=...;
No that's not possible. At least not dynamically at runtime. But you can use an "compile-time"-index to access a component:
float4 v;
v.s0 == v.x; // is true
v.s01 == v.xy // also true
See http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf Section 6.1.7

Runtime allocation of multidimensional array

So far I thought that the following syntax was invalid,
int B[ydim][xdim];
But today I tried and it worked! I ran it many times to make sure it did not work by chance, even valgrind didn't report any segfault or memory leak!! I am very surprised. Is it a new feature introduced in g++? I always have used 1D arrays to store matrices by indexing them with correct strides as done with A in the program below. But this new method, as with B, is so simple and elegant that I have always wanted. Is it really safe to use? See the sample program.
PS. I am compiling it with g++-4.4.3, if that matters.
#include <cstdlib>
#include <iostream>
int test(int ydim, int xdim) {
// Allocate 1D array
int *A = new int[xdim*ydim](); // with C++ new operator
// int *A = (int *) malloc(xdim*ydim * sizeof(int)); // or with C style malloc
if (A == NULL)
return EXIT_FAILURE;
// Declare a 2D array of variable size
int B[ydim][xdim];
// populate matrices A and B
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++) {
A[y*xdim + x] = y*xdim + x;
B[y][x] = y*xdim + x;
}
}
// read out matrix A
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++)
std::cout << A[y*xdim + x] << " ";
std::cout << std::endl;
}
std::cout << std::endl;
// read out matrix B
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++)
std::cout << B[y][x] << " ";
std::cout << std::endl;
}
delete []A;
// free(A); // or in C style
return EXIT_SUCCESS;
}
int main() {
return test(5, 8);
}
int b[ydim][xdim] is declaring a 2-d array on the stack. new, on the other hand, allocates the array on the heap.
For any non-trivial array size, it's almost certainly better to have it on the heap, lest you run yourself out of stack space, or if you want to pass the array back to something outside the current scope.
This is a C99 'variable length array' or VLA. If they are supported by g++ too, then I believe it is an extension of the C++ standard.
Nice, aren't they?

Resources