opencl : atomic function slow the execution time

opencl : atomic function slow the execution time - opencl

I am trying to detect a circle in binary image using hough transform.
the kernel code is very slow when execute. the wait time in atomic function
kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
{
sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int gid0 = get_global_id(0);
int gid1 = get_global_id(1);
uint4 pixel;
int x0=0,y0=0,r;
int maxval=0;
pixel=read_imageui(imageIn,sampler,(int2)(gid0,gid1));
if(pixel.x==255)
{
for(int r=20;r<150;r+=2)
{
// int r=100;
for(int theta=0; theta<360;theta+=2)
{
x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));
if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
atom_inc(&in[w_hough*y0+x0]);
}
if(maxval<in[w_hough*y0+x0])
{
maxval=in[w_hough*y0+x0];
circle[0]=gid0;
circle[1]=gid1;
circle[2]=r;
}
}
}
}
can any one help me what change i do to avoid that
the code main.cpp and kernel.cl compress in rar file http://www.files.com/set/527152684017e use opencv lib for read and display img

Atomic operations by their nature are very slow compared to non-atomic equivalents. You should not use them in an inner-loop.
Try to do as much as you can with local memory, and only use atomics to store final results.
Search for "parallel reduction" because this transform has a lot in common with a reduction (the output of each pixel is a sum of weights).

Related

OpenCL sum `cl_khr_fp64` double values into a single number

From this question and this question I managed to compile a minimal example of summing a vector into a single double inside OpenCL 1.2.
/* https://suhorukov.blogspot.com/2011/12/opencl-11-atomic-operations-on-floating.html */
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned int intVal; double floatVal; } prevVal, newVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while( atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal );
}
void kernel cost_function(__constant double* inputs, __global double* outputs){
int index = get_global_id(0);
if(0 == error_index){ outputs[0] = 0.0; }
barrier(CLK_GLOBAL_MEM_FENCE);
AtomicAdd(&outputs[0], inputs[index]); /* (1) */
//AtomicAdd(&outputs[0], 5.0); /* (2) */
}
As in fact this solution is incorrect because the result is always 0 when the buffer is accessed. What might the problem with this?
the code at /* (1) */ doesn't work, and neither does the code at /* (2) */, which is only there to test the logic independent of any inputs.
Is barrier(CLK_GLOBAL_MEM_FENCE); used correctly here to reset the output before any calculations are done to it?
According to the specs in OpenCL 1.2 single precision floating point numbers are supported by atomic operations, is this(AtomicAdd) a feasible method of extending the support to double precision numbers or am I missing something?
Of course the device I am testing with supports cl_khr_fp64˙of course.

Your AtomicAdd is incorrect. Namely, the 2 errors are:
In the union, intVal must be a 64-bit integer and not 32-bit integer.
Use the 64-bit atom_cmpxchg function and not the 32-bit atomic_cmpxchg function.
The correct implementation is:
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned ulong u64; double f64; } prevVal, newVal;
do {
prevVal.f64 = *source;
newVal.f64 = prevVal.f64 + operand;
} while(atom_cmpxchg((volatile __global ulong*)source, prevVal.u64, newVal.u64) != prevVal.u64);
}
barrier(CLK_GLOBAL_MEM_FENCE); is used correctly here. Note that a barrier must not be in an if- or else-branch.
UPDATE: According to STREAMHPC, the original implementation you use is not guaranteed to produce correct results. There is an improved implementation:
void __attribute__((always_inline)) atomic_add_f(volatile global float* addr, const float val) {
union {
uint u32;
float f32;
} next, expected, current;
current.f32 = *addr;
do {
next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
} while(current.u32!=expected.u32);
}
#ifdef cl_khr_int64_base_atomics
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void __attribute__((always_inline)) atomic_add_d(volatile global double* addr, const double val) {
union {
ulong u64;
double f64;
} next, expected, current;
current.f64 = *addr;
do {
next.f64 = (expected.f64=current.f64)+val; // ...*val for atomic_mul_d()
current.u64 = atom_cmpxchg((volatile global ulong*)addr, expected.u64, next.u64);
} while(current.u64!=expected.u64);
}
#endif

How can I write the memory pointer in CUDA [duplicate]

This question already has an answer here:
Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA
(1 answer)
Closed 5 years ago.
I declared two GPU memory pointers, and allocated the GPU memory, transfer data and launch the kernel in the main:
// declare GPU memory pointers
char * gpuIn;
char * gpuOut;
// allocate GPU memory
cudaMalloc(&gpuIn, ARRAY_BYTES);
cudaMalloc(&gpuOut, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(gpuIn, currIn, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
role<<<dim3(1),dim3(40,20)>>>(gpuOut, gpuIn);
// copy back the result array to the CPU
cudaMemcpy(currOut, gpuOut, ARRAY_BYTES, cudaMemcpyDeviceToHost);
cudaFree(gpuIn);
cudaFree(gpuOut);
And this is my code inside the kernel:
__global__ void role(char * gpuOut, char * gpuIn){
int idx = threadIdx.x;
int idy = threadIdx.y;
char live = '0';
char dead = '.';
char f = gpuIn[idx][idy];
if(f==live){
gpuOut[idx][idy]=dead;
}
else{
gpuOut[idx][idy]=live;
}
}
But here are some errors, I think here are some errors on the pointers. Any body can give a help?

The key concept is the storage order of multidimensional arrays in memory -- this is well described here. A useful abstraction is to define a simple class which encapsulates a pointer to a multidimensional array stored in linear memory and provides an operator which gives something like the usual a[i][j] style access. Your code could be modified something like this:
template<typename T>
struct array2d
{
T* p;
size_t lda;
__device__ __host__
array2d(T* _p, size_t _lda) : p(_p), lda(_lda) {};
__device__ __host__
T& operator()(size_t i, size_t j) {
return p[j + i * lda];
}
__device__ __host__
const T& operator()(size_t i, size_t j) const {
return p[j + i * lda];
}
};
__global__ void role(array2d<char> gpuOut, array2d<char> gpuIn){
int idx = threadIdx.x;
int idy = threadIdx.y;
char live = '0';
char dead = '.';
char f = gpuIn(idx,idy);
if(f==live){
gpuOut(idx,idy)=dead;
}
else{
gpuOut(idx,idy)=live;
}
}
int main()
{
const int rows = 5, cols = 6;
const size_t ARRAY_BYTES = sizeof(char) * size_t(rows * cols);
// declare GPU memory pointers
char * gpuIn;
char * gpuOut;
char currIn[rows][cols], currOut[rows][cols];
// allocate GPU memory
cudaMalloc(&gpuIn, ARRAY_BYTES);
cudaMalloc(&gpuOut, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(gpuIn, currIn, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
role<<<dim3(1),dim3(rows,cols)>>>(array2d<char>(gpuOut, cols), array2d<char>(gpuIn, cols));
// copy back the result array to the CPU
cudaMemcpy(currOut, gpuOut, ARRAY_BYTES, cudaMemcpyDeviceToHost);
cudaFree(gpuIn);
cudaFree(gpuOut);
return 0;
}
The important point here is that a two dimensional C or C++ array stored in linear memory can be addressed as col + row * number of cols. The class in the code above is just a convenient way of expressing this.

Arduino Dynamic Two-dimensional array

I'm working on an Arduino project where I need to build (and work with) a two-dimensional array at runtime. I've been poking around looking for a solution, but I've had no luck. I found an example of a dynamic one-dimentional array helper here: http://playground.arduino.cc/Code/DynamicArrayHelper, so i've been trying to adopt that code for my use. I created a library using the following code:
My Header file:
#ifndef Dynamic2DArray_h
#define Dynamic2DArray_h
#include "Arduino.h"
class Dynamic2DArray
{
public:
Dynamic2DArray( bool sorted );
//Add an integer pair to the array
bool add( int v1, int v2);
//Clear out (empty) the array
bool clear();
//Get the array item in the specified row, column
int getValue(int row, int col);
//Get the number of rows in the array
int length();
private:
int _rows;
void * _slots;
bool _sorted;
void _sort();
};
#endif
The library's code:
#include "Arduino.h"
#include "Dynamic2DArray.h"
#define ARRAY_COLUMNS 2
int _rows;
void * _slots;
bool _sorted;
Dynamic2DArray::Dynamic2DArray(bool sorted) {
//Set our local value indicating where we're supposed to
//sort or not
_sorted = sorted;
//Initialize the row count so it starts at zero
_rows = 0;
}
bool Dynamic2DArray::add( int v1, int v2) {
//Add the values to the array
//implementation adapted from http://playground.arduino.cc/Code/DynamicArrayHelper
//Allocate memory based on the size of the current array rows plus one (the new row)
int elementSize = sizeof(int) * ARRAY_COLUMNS;
//calculate how much memory the current array is using
int currentBufferSize = elementSize * _rows;
//calculate how much memory the new array will use
int newBufferSize = elementSize * (_rows + 1);
//allocate memory for the new array (which should be bigger than the old one)
void * newArray = malloc ( newBufferSize );
//Does newArray not point to something (a memory address)?
if (newArray == 0) {
//Then malloc failed, so return false
return false;
}
// copy the data from the old array, to the new array
for (int idx = 0; idx < currentBufferSize ; idx++)
{
((byte*)newArray)[idx] = ((byte *)_slots)[idx];
}
// free the original array
if (_slots != NULL)
{
free(_slots);
}
// clear the newly allocated memory space (the new row)
for (int idx = currentBufferSize; idx < newBufferSize; idx++)
{
((byte *)newArray)[idx] = 0;
}
// Store the number of rows the memory is allocated for
_rows = ++_rows;
// set the array to the newly created array
_slots = newArray;
//Free up the memory used by the new array
free(newArray);
//If the array's supposed to be sorted,
//then sort it
if (_sorted) {
_sort();
}
// success
return true;
};
int Dynamic2DArray::length() {
return _rows;
};
bool Dynamic2DArray::clear() {
//Free up the memory allocated to the _slots array
free(_slots);
//And zero out the row count
_rows = 0;
};
int Dynamic2DArray::getValue(int row, int col) {
//do we have a valid row/col?
if ((row < _rows) && (col < ARRAY_COLUMNS)) {
//Return the array value at that row/col
return _slots[row][col];
} else {
//No? Then there's nothing we can do here
return -1;
}
};
//Sorted probably doesn't matter, I can probably ignore this one
void _sort() {
}
The initial assignment of the _slots value is giving me problems, I don't know how to define it so this code builds. The _slots variable is supposed to point to the dynamic array, but I've got it wrong.
When I try to compile the code into my project's code, I get the following:
Arduino: 1.8.0 (Windows 10), Board: "Pro Trinket 3V/12MHz (USB)"
sketch\Dynamic2DArray.cpp: In member function 'int Dynamic2DArray::getValue(int, int)':
sketch\Dynamic2DArray.cpp:83:22: warning: pointer of type 'void *' used in arithmetic [-Wpointer-arith]
return _slots[row][col];
^
Dynamic2DArray.cpp:83: error: 'void*' is not a pointer-to-object type
Can someone please help me fix this code? I've posted the files to https://github.com/johnwargo/Arduino-Dynamic-2D-Array-Lib.

The code you took was for a 1D dynamic array; the modifications for a 2D array are too tricky. Give up these horrors.
I think there is no reason you use dynamic array. You can assume that size max is ROW_MAX * COL_MAX, so you can define a static array int array[ROW_MAX][COL_MAX].
on one hand if you defined a dynamic array, you could free space when you dont use it anymore and take advantage of it for other work. I dont know if this is your case.
on the other hand if you define a static array (on UNO), you have 32kB available on program space, instead of 2kB available on RAM.
Because of the difference 32kB / 2kB, there are very few chances you can get bigger array with dynamic allocation.

Multiple read-write synchronization issues in opencl local and global memories

I have an opencl kernel that finds the maximum ASCII character in a string.
The problem is I cannot synchronize the multiple read-writes to global and local memories.
I am trying to update a local_maximum character in shared memory, and at the end of the workgroup (last thread), the global_maximum character, by comparing it with the local_maximum. The threads are writing one over another, I guess.
eg: Input string: "pirates of the carribean".
Output String: 'r' (but it should be 's').
Please have a look at the code and give a solution as to what I can do to get everything synchronized. I am sure people having sound knowledge can understand the code. Optimization tips are welcome.
The code is below:
__kernel void find_highest_ascii( __global const char* data, __global char* result, unsigned int size, __local char* localMaxC )
{
//creating variables and initialising..
unsigned int i, localSize, globalSize, j;
char privateMaxC,temp,temp1;
i = get_global_id(0);
localSize = get_local_size(0);
globalSize = get_global_size(0);
privateMaxC = '\0';
if(i<size){
if(i == 0)
read_mem_fence( CLK_LOCAL_MEM_FENCE );
*localMaxC = '\0';
mem_fence( CLK_LOCAL_MEM_FENCE);
////////////////////////////////////////////////////
/////UPDATING PRIVATE MAX CHARACTER/////////////////
////////////////////////////////////////////////////
for( j = i; j<size; j+=globalSize )
{
if( data[j] > privateMaxC )
{
privateMaxC = data[j];
}
}
///////////////////////////////////////////////////
///////////////////////////////////////////////////
////UPDATING SHARED MAX CHARACTER//////////////////
///////////////////////////////////////////////////
temp = *localMaxC;
read_mem_fence( CLK_LOCAL_MEM_FENCE );
if(privateMaxC>temp)
{
*localMaxC = privateMaxC;
write_mem_fence( CLK_LOCAL_MEM_FENCE );
temp = privateMaxC;
}
//////////////////////////////////////////////////
//UPDATING GLOBAL MAX CHARACTER.
temp1 = *result;
if(( (i+1)%localSize == 0 || i==size-1) && (temp > temp1 ))
{
read_mem_fence( CLK_GLOBAL_MEM_FENCE );
*result = temp;
write_mem_fence( CLK_GLOBAL_MEM_FENCE );
}
}
}

You are correct that threads will be overwriting each other's values, since your code is riddled with race conditions. In OpenCL, there is no way to synchronise between work-items that are in different work-groups. Instead of trying to achieve this kind of synchronisation with explicit fences, you can make your code much simpler by using the built-in atomic functions instead. In particular, there is an atomic_max built-in which solves your problem perfectly.
So, instead of the code you currently have to update both your local and global memory maximum values, just do something like this:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
atomic_max(localMax, privateMax);
barrier(CLK_LOCAL_MEM_FENCE);
// Global reduction
if (l == 0)
{
atomic_max(output, *localMax);
}
}
This will require you to update your local memory scratch space and final result to use 32-bit integer values, but on the whole is a significantly cleaner approach to solving this problem (not to mention it actually works).
NON-ATOMIC SOLUTION
If you really don't want to use atomics, then you can implement a bog-standard reduction using local memory and work-group barriers. Here's an example:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
localMax[l] = privateMax;
for (int offset = get_local_size(0)/2; offset > 1; offset>>=1)
{
barrier(CLK_LOCAL_MEM_FENCE);
if (l < offset)
{
localMax[l] = max(localMax[l], localMax[l+offset]);
}
}
// Store work-group result in global memory
if (l == 0)
{
output[get_group_id(0)] = max(localMax[0], localMax[1]);
}
}
This compares pairs of elements at a time using local memory as a scratch space. Each work-group will produce a single result, which is stored in global memory. If your data-set is small, you could run this with a single work-group (i.e. make global and local sizes the same), and this will work just fine. If it is larger, you could run a two-stage reduction by running this kernel twice, e.g.:
size_t N = ...; // something big
size_t local = 128;
size_t global = local*local; // Must result in at most 'local' number of work-groups
// First pass - run many work-groups using temporary buffer as output
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_temp);
clEnqueueNDRangeKernel(..., &global, &local, ...);
// Second pass - run one work-group with temporary buffer as input
global = local;
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_temp);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_output);
clEnqueueNDRangeKernel(..., &global, &local, ...);
I'll leave it to you to run them and decide which approach would be best for your own data-set.

OpenCL autocorrelation kernel

I have written a simple program that does autocorrelation as follows...I've used pgi accelerator directives to move the computation to GPUs.
//autocorrelation
void autocorr(float *restrict A, float *restrict C, int N)
{
int i, j;
float sum;
#pragma acc region
{
for (i = 0; i < N; i++) {
sum = 0.0;
for (j = 0; j < N; j++) {
if ((i+j) < N)
sum += A[j] * A[i+j];
else
continue;
}
C[i] = sum;
}
}
}
I wrote a similar program in OpenCL, but I am not getting correct results. The program is as follows...I am new to GPU programming, so apart from hints that could fix my error, any other advices are welcome.
__kernel void autocorrel1D(__global double *Vol_IN, __global double *Vol_AUTOCORR, int size)
{
int j, gid = get_global_id(0);
double sum = 0.0;
for (j = 0; j < size; j++) {
if ((gid+j) < size)
{
sum += Vol_IN[j] * Vol_IN[gid+j];
}
else
continue;
}
barrier(CLK_GLOBAL_MEM_FENCE);
Vol_AUTOCORR[gid] = sum;
}
Since I have passed the dimension to be 1, so I am considering my get_global_size(0) call would give me the id of the current block, which is used to access the input 1d array.
Thanks,
Sayan

The code is correct. As far as I know, that should run fine and give corret results.
barrier(CLK_GLOBAL_MEM_FENCE); is not needed. You'll get more speed without that sentence.
Your problem should be outside the kernel, check that you a re passing correctly the input, and you are taking out of GPU the correct data.
BTW, I supose you are using a double precision suported GPU as you are doing double calcs.
Check that you are passing also double values. Remember you CAN't point a float pointer to a double value, and viceversa. That will give you wrong results.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

opencl : atomic function slow the execution time - opencl

Related

OpenCL sum `cl_khr_fp64` double values into a single number

How can I write the memory pointer in CUDA [duplicate]

Arduino Dynamic Two-dimensional array

Multiple read-write synchronization issues in opencl local and global memories

OpenCL autocorrelation kernel

Categories

Resources