Access vector type OpenCL - vector

I have a variable whithin a kernel like:
int16 element;
I would like to know if there is a way to adress the third int in element like
element[2] so that i would be as same as writing element.s2
So how can i do something like:
int16 element;
int vector[100] = rand() % 16;
for ( int i=0; i<100; i++ )
element[ vector[i] ]++;
The way i did was:
int temp[16] = {0};
int16 element;
int vector[100] = rand() % 16;
for ( int i=0; i<100; i++ )
temp[ vector[i] ]++;
element = (int16)(temp[0],temp[1],temp[2],temp[3],temp[4],temp[5],temp[6],temp[7],temp[8],temp[9],temp[10],temp[11],temp[12],temp[13],temp[14],temp[15]);
I know this is terrible, but it works, ;-)

Well there is still dirtier way :), I hope OpenCL provides better way of traversing vector elements.
Here is my way of doing it.
union
{
int elarray[16];
int16 elvector;
} element;
//traverse the elements
for ( i = 0; i < 16; i++)
element.elarray[i] = temp[vector[i]]++;
Btw rand() function is not available in OpenCL kernel, how did you make it work ??

Using pointers is a very easy solution
float4 f4 = (float4)(1.0f, 2.0f, 3.0f, 4.0f);
int gid = get_global_id(0);
float *p = &f4;
result[gid]=p[3];

AMD recommends getting vector components this way:
Put the array of masks into an OpenCl constant buffer:
cl_uint const_masks[4][4] =
{
{0xffffffff, 0, 0, 0},
{0, 0xffffffff, 0, 0},
{0, 0, 0xffffffff, 0},
{0, 0, 0, 0xffffffff},
}
Inside the kernel write something like this:
uint getComponent(uint4 a, int index, __constant uint4 * const_masks)
{
uint b;
uint4 masked_a = a & const_masks[index];
b = masked_a.s0 + masked_a.s1 + masked_a.s2 + masked_a.s3;
return (b);
}
__kernel void foo(…, __constant uint4 * const_masks, …)
{
uint4 a = ….;
int index = …;
uint b = getComponent(a, index, const_masks);
}

It is possible, but it not as efficient as direct array accessing.
float index(float4 v, int i) {
if (i==0) return v.x;
if (i==1) return v.y;
if (i==2) return v.z;
if (i==3) return v.w;
}
But of course, if you need component-wise access this way, then chances are that you're better off not using vectors.

I use this workaround, hoping that compilers are smart enough to see what I mean (I think that element access is a serious omission form the standard):
int16 vec;
// access i-th element:
((int*)vec)[i]=...;

No that's not possible. At least not dynamically at runtime. But you can use an "compile-time"-index to access a component:
float4 v;
v.s0 == v.x; // is true
v.s01 == v.xy // also true
See http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf Section 6.1.7

Related

Sizeof pointer of pointer in C [duplicate]

First off, here is some code:
int main()
{
int days[] = {1,2,3,4,5};
int *ptr = days;
printf("%u\n", sizeof(days));
printf("%u\n", sizeof(ptr));
return 0;
}
Is there a way to find out the size of the array that ptr is pointing to (instead of just giving its size, which is four bytes on a 32-bit system)?
No, you can't. The compiler doesn't know what the pointer is pointing to. There are tricks, like ending the array with a known out-of-band value and then counting the size up until that value, but that's not using sizeof().
Another trick is the one mentioned by Zan, which is to stash the size somewhere. For example, if you're dynamically allocating the array, allocate a block one int bigger than the one you need, stash the size in the first int, and return ptr+1 as the pointer to the array. When you need the size, decrement the pointer and peek at the stashed value. Just remember to free the whole block starting from the beginning, and not just the array.
The answer is, "No."
What C programmers do is store the size of the array somewhere. It can be part of a structure, or the programmer can cheat a bit and malloc() more memory than requested in order to store a length value before the start of the array.
For dynamic arrays (malloc or C++ new) you need to store the size of the array as mentioned by others or perhaps build an array manager structure which handles add, remove, count, etc. Unfortunately C doesn't do this nearly as well as C++ since you basically have to build it for each different array type you are storing which is cumbersome if you have multiple types of arrays that you need to manage.
For static arrays, such as the one in your example, there is a common macro used to get the size, but it is not recommended as it does not check if the parameter is really a static array. The macro is used in real code though, e.g. in the Linux kernel headers although it may be slightly different than the one below:
#if !defined(ARRAY_SIZE)
#define ARRAY_SIZE(x) (sizeof((x)) / sizeof((x)[0]))
#endif
int main()
{
int days[] = {1,2,3,4,5};
int *ptr = days;
printf("%u\n", ARRAY_SIZE(days));
printf("%u\n", sizeof(ptr));
return 0;
}
You can google for reasons to be wary of macros like this. Be careful.
If possible, the C++ stdlib such as vector which is much safer and easier to use.
There is a clean solution with C++ templates, without using sizeof(). The following getSize() function returns the size of any static array:
#include <cstddef>
template<typename T, size_t SIZE>
size_t getSize(T (&)[SIZE]) {
return SIZE;
}
Here is an example with a foo_t structure:
#include <cstddef>
template<typename T, size_t SIZE>
size_t getSize(T (&)[SIZE]) {
return SIZE;
}
struct foo_t {
int ball;
};
int main()
{
foo_t foos3[] = {{1},{2},{3}};
foo_t foos5[] = {{1},{2},{3},{4},{5}};
printf("%u\n", getSize(foos3));
printf("%u\n", getSize(foos5));
return 0;
}
Output:
3
5
As all the correct answers have stated, you cannot get this information from the decayed pointer value of the array alone. If the decayed pointer is the argument received by the function, then the size of the originating array has to be provided in some other way for the function to come to know that size.
Here's a suggestion different from what has been provided thus far,that will work: Pass a pointer to the array instead. This suggestion is similar to the C++ style suggestions, except that C does not support templates or references:
#define ARRAY_SZ 10
void foo (int (*arr)[ARRAY_SZ]) {
printf("%u\n", (unsigned)sizeof(*arr)/sizeof(**arr));
}
But, this suggestion is kind of silly for your problem, since the function is defined to know exactly the size of the array that is passed in (hence, there is little need to use sizeof at all on the array). What it does do, though, is offer some type safety. It will prohibit you from passing in an array of an unwanted size.
int x[20];
int y[10];
foo(&x); /* error */
foo(&y); /* ok */
If the function is supposed to be able to operate on any size of array, then you will have to provide the size to the function as additional information.
For this specific example, yes, there is, IF you use typedefs (see below). Of course, if you do it this way, you're just as well off to use SIZEOF_DAYS, since you know what the pointer is pointing to.
If you have a (void *) pointer, as is returned by malloc() or the like, then, no, there is no way to determine what data structure the pointer is pointing to and thus, no way to determine its size.
#include <stdio.h>
#define NUM_DAYS 5
typedef int days_t[ NUM_DAYS ];
#define SIZEOF_DAYS ( sizeof( days_t ) )
int main() {
days_t days;
days_t *ptr = &days;
printf( "SIZEOF_DAYS: %u\n", SIZEOF_DAYS );
printf( "sizeof(days): %u\n", sizeof(days) );
printf( "sizeof(*ptr): %u\n", sizeof(*ptr) );
printf( "sizeof(ptr): %u\n", sizeof(ptr) );
return 0;
}
Output:
SIZEOF_DAYS: 20
sizeof(days): 20
sizeof(*ptr): 20
sizeof(ptr): 4
There is no magic solution. C is not a reflective language. Objects don't automatically know what they are.
But you have many choices:
Obviously, add a parameter
Wrap the call in a macro and automatically add a parameter
Use a more complex object. Define a structure which contains the dynamic array and also the size of the array. Then, pass the address of the structure.
You can do something like this:
int days[] = { /*length:*/5, /*values:*/ 1,2,3,4,5 };
int *ptr = days + 1;
printf("array length: %u\n", ptr[-1]);
return 0;
My solution to this problem is to save the length of the array into a struct Array as a meta-information about the array.
#include <stdio.h>
#include <stdlib.h>
struct Array
{
int length;
double *array;
};
typedef struct Array Array;
Array* NewArray(int length)
{
/* Allocate the memory for the struct Array */
Array *newArray = (Array*) malloc(sizeof(Array));
/* Insert only non-negative length's*/
newArray->length = (length > 0) ? length : 0;
newArray->array = (double*) malloc(length*sizeof(double));
return newArray;
}
void SetArray(Array *structure,int length,double* array)
{
structure->length = length;
structure->array = array;
}
void PrintArray(Array *structure)
{
if(structure->length > 0)
{
int i;
printf("length: %d\n", structure->length);
for (i = 0; i < structure->length; i++)
printf("%g\n", structure->array[i]);
}
else
printf("Empty Array. Length 0\n");
}
int main()
{
int i;
Array *negativeTest, *days = NewArray(5);
double moreDays[] = {1,2,3,4,5,6,7,8,9,10};
for (i = 0; i < days->length; i++)
days->array[i] = i+1;
PrintArray(days);
SetArray(days,10,moreDays);
PrintArray(days);
negativeTest = NewArray(-5);
PrintArray(negativeTest);
return 0;
}
But you have to care about set the right length of the array you want to store, because the is no way to check this length, like our friends massively explained.
This is how I personally do it in my code. I like to keep it as simple as possible while still able to get values that I need.
typedef struct intArr {
int size;
int* arr;
} intArr_t;
int main() {
intArr_t arr;
arr.size = 6;
arr.arr = (int*)malloc(sizeof(int) * arr.size);
for (size_t i = 0; i < arr.size; i++) {
arr.arr[i] = i * 10;
}
return 0;
}
No, you can't use sizeof(ptr) to find the size of array ptr is pointing to.
Though allocating extra memory(more than the size of array) will be helpful if you want to store the length in extra space.
int main()
{
int days[] = {1,2,3,4,5};
int *ptr = days;
printf("%u\n", sizeof(days));
printf("%u\n", sizeof(ptr));
return 0;
}
Size of days[] is 20 which is no of elements * size of it's data type.
While the size of pointer is 4 no matter what it is pointing to.
Because a pointer points to other element by storing it's address.
In strings there is a '\0' character at the end so the length of the string can be gotten using functions like strlen. The problem with an integer array, for example, is that you can't use any value as an end value so one possible solution is to address the array and use as an end value the NULL pointer.
#include <stdio.h>
/* the following function will produce the warning:
* ‘sizeof’ on array function parameter ‘a’ will
* return size of ‘int *’ [-Wsizeof-array-argument]
*/
void foo( int a[] )
{
printf( "%lu\n", sizeof a );
}
/* so we have to implement something else one possible
* idea is to use the NULL pointer as a control value
* the same way '\0' is used in strings but this way
* the pointer passed to a function should address pointers
* so the actual implementation of an array type will
* be a pointer to pointer
*/
typedef char * type_t; /* line 18 */
typedef type_t ** array_t;
int main( void )
{
array_t initialize( int, ... );
/* initialize an array with four values "foo", "bar", "baz", "foobar"
* if one wants to use integers rather than strings than in the typedef
* declaration at line 18 the char * type should be changed with int
* and in the format used for printing the array values
* at line 45 and 51 "%s" should be changed with "%i"
*/
array_t array = initialize( 4, "foo", "bar", "baz", "foobar" );
int size( array_t );
/* print array size */
printf( "size %i:\n", size( array ));
void aprint( char *, array_t );
/* print array values */
aprint( "%s\n", array ); /* line 45 */
type_t getval( array_t, int );
/* print an indexed value */
int i = 2;
type_t val = getval( array, i );
printf( "%i: %s\n", i, val ); /* line 51 */
void delete( array_t );
/* free some space */
delete( array );
return 0;
}
/* the output of the program should be:
* size 4:
* foo
* bar
* baz
* foobar
* 2: baz
*/
#include <stdarg.h>
#include <stdlib.h>
array_t initialize( int n, ... )
{
/* here we store the array values */
type_t *v = (type_t *) malloc( sizeof( type_t ) * n );
va_list ap;
va_start( ap, n );
int j;
for ( j = 0; j < n; j++ )
v[j] = va_arg( ap, type_t );
va_end( ap );
/* the actual array will hold the addresses of those
* values plus a NULL pointer
*/
array_t a = (array_t) malloc( sizeof( type_t *) * ( n + 1 ));
a[n] = NULL;
for ( j = 0; j < n; j++ )
a[j] = v + j;
return a;
}
int size( array_t a )
{
int n = 0;
while ( *a++ != NULL )
n++;
return n;
}
void aprint( char *fmt, array_t a )
{
while ( *a != NULL )
printf( fmt, **a++ );
}
type_t getval( array_t a, int i )
{
return *a[i];
}
void delete( array_t a )
{
free( *a );
free( a );
}
#include <stdio.h>
#include <string.h>
#include <stddef.h>
#include <stdlib.h>
#define array(type) struct { size_t size; type elem[0]; }
void *array_new(int esize, int ecnt)
{
size_t *a = (size_t *)malloc(esize*ecnt+sizeof(size_t));
if (a) *a = ecnt;
return a;
}
#define array_new(type, count) array_new(sizeof(type),count)
#define array_delete free
#define array_foreach(type, e, arr) \
for (type *e = (arr)->elem; e < (arr)->size + (arr)->elem; ++e)
int main(int argc, char const *argv[])
{
array(int) *iarr = array_new(int, 10);
array(float) *farr = array_new(float, 10);
array(double) *darr = array_new(double, 10);
array(char) *carr = array_new(char, 11);
for (int i = 0; i < iarr->size; ++i) {
iarr->elem[i] = i;
farr->elem[i] = i*1.0f;
darr->elem[i] = i*1.0;
carr->elem[i] = i+'0';
}
array_foreach(int, e, iarr) {
printf("%d ", *e);
}
array_foreach(float, e, farr) {
printf("%.0f ", *e);
}
array_foreach(double, e, darr) {
printf("%.0lf ", *e);
}
carr->elem[carr->size-1] = '\0';
printf("%s\n", carr->elem);
return 0;
}
#define array_size 10
struct {
int16 size;
int16 array[array_size];
int16 property1[(array_size/16)+1]
int16 property2[(array_size/16)+1]
} array1 = {array_size, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
#undef array_size
array_size is passing to the size variable:
#define array_size 30
struct {
int16 size;
int16 array[array_size];
int16 property1[(array_size/16)+1]
int16 property2[(array_size/16)+1]
} array2 = {array_size};
#undef array_size
Usage is:
void main() {
int16 size = array1.size;
for (int i=0; i!=size; i++) {
array1.array[i] *= 2;
}
}
Most implementations will have a function that tells you the reserved size for objects allocated with malloc() or calloc(), for example GNU has malloc_usable_size()
However, this will return the size of the reversed block, which can be larger than the value given to malloc()/realloc().
There is a popular macro, which you can define for finding number of elements in the array (Microsoft CRT even provides it OOB with name _countof):
#define countof(x) (sizeof(x)/sizeof((x)[0]))
Then you can write:
int my_array[] = { ... some elements ... };
printf("%zu", countof(my_array)); // 'z' is correct type specifier for size_t

Multiple read-write synchronization issues in opencl local and global memories

I have an opencl kernel that finds the maximum ASCII character in a string.
The problem is I cannot synchronize the multiple read-writes to global and local memories.
I am trying to update a local_maximum character in shared memory, and at the end of the workgroup (last thread), the global_maximum character, by comparing it with the local_maximum. The threads are writing one over another, I guess.
eg: Input string: "pirates of the carribean".
Output String: 'r' (but it should be 's').
Please have a look at the code and give a solution as to what I can do to get everything synchronized. I am sure people having sound knowledge can understand the code. Optimization tips are welcome.
The code is below:
__kernel void find_highest_ascii( __global const char* data, __global char* result, unsigned int size, __local char* localMaxC )
{
//creating variables and initialising..
unsigned int i, localSize, globalSize, j;
char privateMaxC,temp,temp1;
i = get_global_id(0);
localSize = get_local_size(0);
globalSize = get_global_size(0);
privateMaxC = '\0';
if(i<size){
if(i == 0)
read_mem_fence( CLK_LOCAL_MEM_FENCE );
*localMaxC = '\0';
mem_fence( CLK_LOCAL_MEM_FENCE);
////////////////////////////////////////////////////
/////UPDATING PRIVATE MAX CHARACTER/////////////////
////////////////////////////////////////////////////
for( j = i; j<size; j+=globalSize )
{
if( data[j] > privateMaxC )
{
privateMaxC = data[j];
}
}
///////////////////////////////////////////////////
///////////////////////////////////////////////////
////UPDATING SHARED MAX CHARACTER//////////////////
///////////////////////////////////////////////////
temp = *localMaxC;
read_mem_fence( CLK_LOCAL_MEM_FENCE );
if(privateMaxC>temp)
{
*localMaxC = privateMaxC;
write_mem_fence( CLK_LOCAL_MEM_FENCE );
temp = privateMaxC;
}
//////////////////////////////////////////////////
//UPDATING GLOBAL MAX CHARACTER.
temp1 = *result;
if(( (i+1)%localSize == 0 || i==size-1) && (temp > temp1 ))
{
read_mem_fence( CLK_GLOBAL_MEM_FENCE );
*result = temp;
write_mem_fence( CLK_GLOBAL_MEM_FENCE );
}
}
}
You are correct that threads will be overwriting each other's values, since your code is riddled with race conditions. In OpenCL, there is no way to synchronise between work-items that are in different work-groups. Instead of trying to achieve this kind of synchronisation with explicit fences, you can make your code much simpler by using the built-in atomic functions instead. In particular, there is an atomic_max built-in which solves your problem perfectly.
So, instead of the code you currently have to update both your local and global memory maximum values, just do something like this:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
atomic_max(localMax, privateMax);
barrier(CLK_LOCAL_MEM_FENCE);
// Global reduction
if (l == 0)
{
atomic_max(output, *localMax);
}
}
This will require you to update your local memory scratch space and final result to use 32-bit integer values, but on the whole is a significantly cleaner approach to solving this problem (not to mention it actually works).
NON-ATOMIC SOLUTION
If you really don't want to use atomics, then you can implement a bog-standard reduction using local memory and work-group barriers. Here's an example:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
localMax[l] = privateMax;
for (int offset = get_local_size(0)/2; offset > 1; offset>>=1)
{
barrier(CLK_LOCAL_MEM_FENCE);
if (l < offset)
{
localMax[l] = max(localMax[l], localMax[l+offset]);
}
}
// Store work-group result in global memory
if (l == 0)
{
output[get_group_id(0)] = max(localMax[0], localMax[1]);
}
}
This compares pairs of elements at a time using local memory as a scratch space. Each work-group will produce a single result, which is stored in global memory. If your data-set is small, you could run this with a single work-group (i.e. make global and local sizes the same), and this will work just fine. If it is larger, you could run a two-stage reduction by running this kernel twice, e.g.:
size_t N = ...; // something big
size_t local = 128;
size_t global = local*local; // Must result in at most 'local' number of work-groups
// First pass - run many work-groups using temporary buffer as output
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_temp);
clEnqueueNDRangeKernel(..., &global, &local, ...);
// Second pass - run one work-group with temporary buffer as input
global = local;
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_temp);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_output);
clEnqueueNDRangeKernel(..., &global, &local, ...);
I'll leave it to you to run them and decide which approach would be best for your own data-set.

FFMpeg with X265

I am currently trying to encode raw RGB24 images via x265. I already successfully did this with the x264 library, but a few things have changed as compared to the x265 library.
Here the problem in short: I want to convert the image I have from RGB24 to YUV 4:2:0 via the sws_scale function of FFMPEG. The prototype of the function is:
int sws_scale(SwsContext *c, uint8_t* src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t* dst[], int dstStride[])
Assuming image contains my raw image, srcstride and `m_height' the corresponding RGB stride and height of my image, I made the following call with x264
sws_scale(convertCtx, &image, &srcstride, 0, m_height, pic_in.img.plane, pic_in.img.i_stride);
pic_in is of type x264_picture_t which looks (brief) as follows
typedef struct
{
...
x264_image_t img;
} x264_picture_t;
with x264_image_t
typedef struct
{
...
int i_stride[4];
uint8_t *plane[4];
} x264_image_t;
Now, in x265 the structures have slightly changed to
typedef struct x265_picture
{
...
void* planes[3];
int stride[3];
} x265_picture;
And I am now not quite sure how to call the same function
sws_scale(convertCtx, &image, &srcstride, 0, m_height, ????, pic_in.stride);
I tried creating a temporary array, and then copying back and recasting the array items, but it doesnt seem to work
pic.planes[i] = reinterpret_cast<void*>(tmp[i]) ;
Can someone help me out?
Thanks a lot :)
Edit
I figured it out now
outputSlice = sws_scale(convertCtx, &image, &srcstride, 0, m_height, reinterpret_cast<uint8_t**>(pic_in.planes), pic_in.stride);
This seems to do the trick :)
And btw, for other people who are experiment with x265:in x264 there was a x264_picture_alloc function which I didn't manage to find in x265. So here is a function which I used in my application and which does the trick.
void x265_picture_alloc_custom( x265_picture *pic, int csp, int width, int height, uint32_t depth) {
x265_picture_init(&mParam, pic);
pic->colorSpace = csp;
pic->bitDepth = depth;
pic->sliceType = X265_TYPE_AUTO;
uint32_t pixelbytes = depth > 8 ? 2 : 1;
uint32_t framesize = 0;
for (int i = 0; i < x265_cli_csps[csp].planes; i++)
{
uint32_t w = width >> x265_cli_csps[csp].width[i];
uint32_t h = height >> x265_cli_csps[csp].height[i];
framesize += w * h * pixelbytes;
}
pic->planes[0] = new char[framesize];
pic->planes[1] = (char*)(pic->planes[0]) + width * height * pixelbytes;
pic->planes[2] = (char*)(pic->planes[1]) + ((width * height * pixelbytes) >> 2);
pic->stride[0] = width;
pic->stride[1] = pic->stride[2] = pic->stride[0] >> 1;
}
And I am now not quite sure how to call the same function
sws_scale(convertCtx, &image, &srcstride, 0, m_height, ????,
pic_in.stride);
tried with?:
sws_scale(convertCtx, &image, &srcstride, 0, m_height, pic_in.planes,pic_in.stride);
what error do you have? have you initialized memory of x265_picture?

Base case condition in quick sort algorithm

For the quick sort algorithm(recursive), every time when it calls itself, it have the condition if(p < r). Please correct me if I am wrong: as far as I know, for every recursive algorithm, it has a condition as the time when it entered the routine, and this condition is used to get the base case. But I still cannot understand how to correctly set and test this condition ?
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
For my entire code, please refer to the following:
/*
filename : main.c
description: quickSort algorithm
*/
#include<iostream>
using namespace std;
void exchange(int* val1, int* val2)
{
int temp = *val1;
*val1 = *val2;
*val2 = temp;
}
int partition(int* arr, int p, int r)
{
int x = arr[r];
int j = p;
int i = j-1;
while(j<=r-1)
{
if(arr[j] <= x)
{
i++;
// exchange arr[r] with arr[j]
exchange(&arr[i],&arr[j]);
}
j++;
}
exchange(&arr[i+1],&arr[r]);
return i+1;
}
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
// driver program to test the quick sort algorithm
int main(int argc, const char* argv[])
{
int arr1[] = {13,19,9,5,12,8,7,4,21,2,6,11};
cout <<"The original array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
quickSort(arr1,0,11);
//print out the sorted array
cout <<"The sorted array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
cin.get();
return 0;
}
Your question is not quite clear, but I will try to answer.
Quicksort works by sorting smaller and smaller arrays. The base case is an array with less than 2 elements because no sorting would be required.
At each step it finds a partition value and makes it true that all the values to the left of the partition value are smaller and all values to the right of the partition value are larger. In other words, it puts the partition value in the correct place. Then it recursively sorts the array to the left of the partition and the array to right of the partition.
The base case of quicksort is an array with one element because a one element array requires no sorting. In your code, p is the index of the first element and r is the index of the last element. The predicate p < r is only true for an array of at least size 2. In other words, if p >= r then you have an array of size 1 (or zero, or nonsense) and there is no work to do.

OpenCL autocorrelation kernel

I have written a simple program that does autocorrelation as follows...I've used pgi accelerator directives to move the computation to GPUs.
//autocorrelation
void autocorr(float *restrict A, float *restrict C, int N)
{
int i, j;
float sum;
#pragma acc region
{
for (i = 0; i < N; i++) {
sum = 0.0;
for (j = 0; j < N; j++) {
if ((i+j) < N)
sum += A[j] * A[i+j];
else
continue;
}
C[i] = sum;
}
}
}
I wrote a similar program in OpenCL, but I am not getting correct results. The program is as follows...I am new to GPU programming, so apart from hints that could fix my error, any other advices are welcome.
__kernel void autocorrel1D(__global double *Vol_IN, __global double *Vol_AUTOCORR, int size)
{
int j, gid = get_global_id(0);
double sum = 0.0;
for (j = 0; j < size; j++) {
if ((gid+j) < size)
{
sum += Vol_IN[j] * Vol_IN[gid+j];
}
else
continue;
}
barrier(CLK_GLOBAL_MEM_FENCE);
Vol_AUTOCORR[gid] = sum;
}
Since I have passed the dimension to be 1, so I am considering my get_global_size(0) call would give me the id of the current block, which is used to access the input 1d array.
Thanks,
Sayan
The code is correct. As far as I know, that should run fine and give corret results.
barrier(CLK_GLOBAL_MEM_FENCE); is not needed. You'll get more speed without that sentence.
Your problem should be outside the kernel, check that you a re passing correctly the input, and you are taking out of GPU the correct data.
BTW, I supose you are using a double precision suported GPU as you are doing double calcs.
Check that you are passing also double values. Remember you CAN't point a float pointer to a double value, and viceversa. That will give you wrong results.

Resources