How can I write the memory pointer in CUDA [duplicate] - pointers

This question already has an answer here:
Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA
(1 answer)
Closed 5 years ago.
I declared two GPU memory pointers, and allocated the GPU memory, transfer data and launch the kernel in the main:
// declare GPU memory pointers
char * gpuIn;
char * gpuOut;
// allocate GPU memory
cudaMalloc(&gpuIn, ARRAY_BYTES);
cudaMalloc(&gpuOut, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(gpuIn, currIn, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
role<<<dim3(1),dim3(40,20)>>>(gpuOut, gpuIn);
// copy back the result array to the CPU
cudaMemcpy(currOut, gpuOut, ARRAY_BYTES, cudaMemcpyDeviceToHost);
cudaFree(gpuIn);
cudaFree(gpuOut);
And this is my code inside the kernel:
__global__ void role(char * gpuOut, char * gpuIn){
int idx = threadIdx.x;
int idy = threadIdx.y;
char live = '0';
char dead = '.';
char f = gpuIn[idx][idy];
if(f==live){
gpuOut[idx][idy]=dead;
}
else{
gpuOut[idx][idy]=live;
}
}
But here are some errors, I think here are some errors on the pointers. Any body can give a help?

The key concept is the storage order of multidimensional arrays in memory -- this is well described here. A useful abstraction is to define a simple class which encapsulates a pointer to a multidimensional array stored in linear memory and provides an operator which gives something like the usual a[i][j] style access. Your code could be modified something like this:
template<typename T>
struct array2d
{
T* p;
size_t lda;
__device__ __host__
array2d(T* _p, size_t _lda) : p(_p), lda(_lda) {};
__device__ __host__
T& operator()(size_t i, size_t j) {
return p[j + i * lda];
}
__device__ __host__
const T& operator()(size_t i, size_t j) const {
return p[j + i * lda];
}
};
__global__ void role(array2d<char> gpuOut, array2d<char> gpuIn){
int idx = threadIdx.x;
int idy = threadIdx.y;
char live = '0';
char dead = '.';
char f = gpuIn(idx,idy);
if(f==live){
gpuOut(idx,idy)=dead;
}
else{
gpuOut(idx,idy)=live;
}
}
int main()
{
const int rows = 5, cols = 6;
const size_t ARRAY_BYTES = sizeof(char) * size_t(rows * cols);
// declare GPU memory pointers
char * gpuIn;
char * gpuOut;
char currIn[rows][cols], currOut[rows][cols];
// allocate GPU memory
cudaMalloc(&gpuIn, ARRAY_BYTES);
cudaMalloc(&gpuOut, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(gpuIn, currIn, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
role<<<dim3(1),dim3(rows,cols)>>>(array2d<char>(gpuOut, cols), array2d<char>(gpuIn, cols));
// copy back the result array to the CPU
cudaMemcpy(currOut, gpuOut, ARRAY_BYTES, cudaMemcpyDeviceToHost);
cudaFree(gpuIn);
cudaFree(gpuOut);
return 0;
}
The important point here is that a two dimensional C or C++ array stored in linear memory can be addressed as col + row * number of cols. The class in the code above is just a convenient way of expressing this.

Related

OpenCL sum `cl_khr_fp64` double values into a single number

From this question and this question I managed to compile a minimal example of summing a vector into a single double inside OpenCL 1.2.
/* https://suhorukov.blogspot.com/2011/12/opencl-11-atomic-operations-on-floating.html */
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned int intVal; double floatVal; } prevVal, newVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while( atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal );
}
void kernel cost_function(__constant double* inputs, __global double* outputs){
int index = get_global_id(0);
if(0 == error_index){ outputs[0] = 0.0; }
barrier(CLK_GLOBAL_MEM_FENCE);
AtomicAdd(&outputs[0], inputs[index]); /* (1) */
//AtomicAdd(&outputs[0], 5.0); /* (2) */
}
As in fact this solution is incorrect because the result is always 0 when the buffer is accessed. What might the problem with this?
the code at /* (1) */ doesn't work, and neither does the code at /* (2) */, which is only there to test the logic independent of any inputs.
Is barrier(CLK_GLOBAL_MEM_FENCE); used correctly here to reset the output before any calculations are done to it?
According to the specs in OpenCL 1.2 single precision floating point numbers are supported by atomic operations, is this(AtomicAdd) a feasible method of extending the support to double precision numbers or am I missing something?
Of course the device I am testing with supports cl_khr_fp64˙of course.
Your AtomicAdd is incorrect. Namely, the 2 errors are:
In the union, intVal must be a 64-bit integer and not 32-bit integer.
Use the 64-bit atom_cmpxchg function and not the 32-bit atomic_cmpxchg function.
The correct implementation is:
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned ulong u64; double f64; } prevVal, newVal;
do {
prevVal.f64 = *source;
newVal.f64 = prevVal.f64 + operand;
} while(atom_cmpxchg((volatile __global ulong*)source, prevVal.u64, newVal.u64) != prevVal.u64);
}
barrier(CLK_GLOBAL_MEM_FENCE); is used correctly here. Note that a barrier must not be in an if- or else-branch.
UPDATE: According to STREAMHPC, the original implementation you use is not guaranteed to produce correct results. There is an improved implementation:
void __attribute__((always_inline)) atomic_add_f(volatile global float* addr, const float val) {
union {
uint u32;
float f32;
} next, expected, current;
current.f32 = *addr;
do {
next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
} while(current.u32!=expected.u32);
}
#ifdef cl_khr_int64_base_atomics
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void __attribute__((always_inline)) atomic_add_d(volatile global double* addr, const double val) {
union {
ulong u64;
double f64;
} next, expected, current;
current.f64 = *addr;
do {
next.f64 = (expected.f64=current.f64)+val; // ...*val for atomic_mul_d()
current.u64 = atom_cmpxchg((volatile global ulong*)addr, expected.u64, next.u64);
} while(current.u64!=expected.u64);
}
#endif

How to retrieve an int array that is stored in a table using PROGMEM?

I'm new to Arduino and currently learn to use PROGMEM to store variables so that I can save dynamic memory. I have 13 variables including these three below that I store using PROGMEM.
Here are some of example of variables that I store and use it in my functions :-
const unsigned int raw_0[62] PROGMEM = {2600,850,400,500,400,500,450,850,450,850,1350,850,450,450,400,500,400,450,450,400,450,450,450,450,400,450,900,850,900,850,900,450,450,850,900,850,900,850,450,450,900,450,400,450,400,900,450,450,450,400,450,450,450,450,400,450,450,450,450,400,450,};
const unsigned int raw_1[60] PROGMEM = {2600,850,450,450,450,450,450,850,450,850,1350,850,500,400,450,400,450,450,450,450,400,450,450,450,400,450,900,850,900,900,850,450,450,850,850,900,900,900,400,450,900,450,450,400,450,850,450,450,450,450,400,450,450,450,450,400,450,450,850,};
const unsigned int raw_a[100] PROGMEM = {3500,1700,400,450,450,1250,450,400,450,400,450,400,500,400,450,400,450,400,450,400,450,450,400,400,500,400,450,400,450,1300,400,450,450,400,450,400,450,400,450,400,450,400,500,350,500,400,450,400,450,1300,400,400,500,400,450,400,450,400,450,450,400,450,450,400,450,400,450,400,450,400,450,450,400,450,450,400,450,1250,450,400,450,400,500,400,450,400,450,400,450,400,450,400,450,1300,450,400,450,1250,450,};
Here is the table that store the variables. I learn this approach from Arduino website; https://www.arduino.cc/en/Reference/PROGMEM .
const unsigned int* const myTable[13] PROGMEM = {
raw_0,
raw_1,
raw_2,
raw_3,
raw_4,
raw_5,
raw_6,
raw_7,
raw_8,
raw_9,
raw_a,
raw_b,
raw_c};
My problem is, how do I retrieve these variables using PROGMEM such as raw_1 and raw_a ?
This is what I did but it did not work :-
unsigned int * ptr = (unsigned int *) pgm_read_word (&myTable [1]);
irsend.sendRaw(ptr,62,38);
Most of examples that I found, they use String or char datatype but in my case, I use array integer.
The ptr is also pointer to PROGMEM, so you have to read the value (or values in this case) by pgm_read_word. The IR library doesn't support that at all (I hope it's the correct one).
Anyway sendRaw implementation is this:
void IRsend::sendRaw (const unsigned int buf[], unsigned int len, unsigned int hz)
{
// Set IR carrier frequency
enableIROut(hz);
for (unsigned int i = 0; i < len; i++) {
if (i & 1) space(buf[i]) ;
else mark (buf[i]) ;
}
space(0); // Always end with the LED off
}
And all used methods are public, so you can implement your own function to do the same:
void mySendRaw (IRsend & dev, const unsigned int buf[], unsigned int len, unsigned int khz)
{
// Set IR carrier frequency
dev.devenableIROut(khz);
for (unsigned int i = 0; i < len; i++) {
if (i & 1) dev.space(pgm_read_word(buf+i));
else dev.mark (pgm_read_word(buf+i));
}
dev.space(0); // Always end with the LED off
}
// And usage:
mySendRaw(irsend, (const uint16_t*)pgm_read_word(myTable+1), 62, 38);
However the size of arrays should be stored somewhere too, so you can use something like:
byte cmd = 1;
mySendRaw(irsend, (const uint16_t*)pgm_read_word(myTable+cmd), pgm_read_word(myTableLenghts+cmd), 38);

Sizeof pointer of pointer in C [duplicate]

First off, here is some code:
int main()
{
int days[] = {1,2,3,4,5};
int *ptr = days;
printf("%u\n", sizeof(days));
printf("%u\n", sizeof(ptr));
return 0;
}
Is there a way to find out the size of the array that ptr is pointing to (instead of just giving its size, which is four bytes on a 32-bit system)?
No, you can't. The compiler doesn't know what the pointer is pointing to. There are tricks, like ending the array with a known out-of-band value and then counting the size up until that value, but that's not using sizeof().
Another trick is the one mentioned by Zan, which is to stash the size somewhere. For example, if you're dynamically allocating the array, allocate a block one int bigger than the one you need, stash the size in the first int, and return ptr+1 as the pointer to the array. When you need the size, decrement the pointer and peek at the stashed value. Just remember to free the whole block starting from the beginning, and not just the array.
The answer is, "No."
What C programmers do is store the size of the array somewhere. It can be part of a structure, or the programmer can cheat a bit and malloc() more memory than requested in order to store a length value before the start of the array.
For dynamic arrays (malloc or C++ new) you need to store the size of the array as mentioned by others or perhaps build an array manager structure which handles add, remove, count, etc. Unfortunately C doesn't do this nearly as well as C++ since you basically have to build it for each different array type you are storing which is cumbersome if you have multiple types of arrays that you need to manage.
For static arrays, such as the one in your example, there is a common macro used to get the size, but it is not recommended as it does not check if the parameter is really a static array. The macro is used in real code though, e.g. in the Linux kernel headers although it may be slightly different than the one below:
#if !defined(ARRAY_SIZE)
#define ARRAY_SIZE(x) (sizeof((x)) / sizeof((x)[0]))
#endif
int main()
{
int days[] = {1,2,3,4,5};
int *ptr = days;
printf("%u\n", ARRAY_SIZE(days));
printf("%u\n", sizeof(ptr));
return 0;
}
You can google for reasons to be wary of macros like this. Be careful.
If possible, the C++ stdlib such as vector which is much safer and easier to use.
There is a clean solution with C++ templates, without using sizeof(). The following getSize() function returns the size of any static array:
#include <cstddef>
template<typename T, size_t SIZE>
size_t getSize(T (&)[SIZE]) {
return SIZE;
}
Here is an example with a foo_t structure:
#include <cstddef>
template<typename T, size_t SIZE>
size_t getSize(T (&)[SIZE]) {
return SIZE;
}
struct foo_t {
int ball;
};
int main()
{
foo_t foos3[] = {{1},{2},{3}};
foo_t foos5[] = {{1},{2},{3},{4},{5}};
printf("%u\n", getSize(foos3));
printf("%u\n", getSize(foos5));
return 0;
}
Output:
3
5
As all the correct answers have stated, you cannot get this information from the decayed pointer value of the array alone. If the decayed pointer is the argument received by the function, then the size of the originating array has to be provided in some other way for the function to come to know that size.
Here's a suggestion different from what has been provided thus far,that will work: Pass a pointer to the array instead. This suggestion is similar to the C++ style suggestions, except that C does not support templates or references:
#define ARRAY_SZ 10
void foo (int (*arr)[ARRAY_SZ]) {
printf("%u\n", (unsigned)sizeof(*arr)/sizeof(**arr));
}
But, this suggestion is kind of silly for your problem, since the function is defined to know exactly the size of the array that is passed in (hence, there is little need to use sizeof at all on the array). What it does do, though, is offer some type safety. It will prohibit you from passing in an array of an unwanted size.
int x[20];
int y[10];
foo(&x); /* error */
foo(&y); /* ok */
If the function is supposed to be able to operate on any size of array, then you will have to provide the size to the function as additional information.
For this specific example, yes, there is, IF you use typedefs (see below). Of course, if you do it this way, you're just as well off to use SIZEOF_DAYS, since you know what the pointer is pointing to.
If you have a (void *) pointer, as is returned by malloc() or the like, then, no, there is no way to determine what data structure the pointer is pointing to and thus, no way to determine its size.
#include <stdio.h>
#define NUM_DAYS 5
typedef int days_t[ NUM_DAYS ];
#define SIZEOF_DAYS ( sizeof( days_t ) )
int main() {
days_t days;
days_t *ptr = &days;
printf( "SIZEOF_DAYS: %u\n", SIZEOF_DAYS );
printf( "sizeof(days): %u\n", sizeof(days) );
printf( "sizeof(*ptr): %u\n", sizeof(*ptr) );
printf( "sizeof(ptr): %u\n", sizeof(ptr) );
return 0;
}
Output:
SIZEOF_DAYS: 20
sizeof(days): 20
sizeof(*ptr): 20
sizeof(ptr): 4
There is no magic solution. C is not a reflective language. Objects don't automatically know what they are.
But you have many choices:
Obviously, add a parameter
Wrap the call in a macro and automatically add a parameter
Use a more complex object. Define a structure which contains the dynamic array and also the size of the array. Then, pass the address of the structure.
You can do something like this:
int days[] = { /*length:*/5, /*values:*/ 1,2,3,4,5 };
int *ptr = days + 1;
printf("array length: %u\n", ptr[-1]);
return 0;
My solution to this problem is to save the length of the array into a struct Array as a meta-information about the array.
#include <stdio.h>
#include <stdlib.h>
struct Array
{
int length;
double *array;
};
typedef struct Array Array;
Array* NewArray(int length)
{
/* Allocate the memory for the struct Array */
Array *newArray = (Array*) malloc(sizeof(Array));
/* Insert only non-negative length's*/
newArray->length = (length > 0) ? length : 0;
newArray->array = (double*) malloc(length*sizeof(double));
return newArray;
}
void SetArray(Array *structure,int length,double* array)
{
structure->length = length;
structure->array = array;
}
void PrintArray(Array *structure)
{
if(structure->length > 0)
{
int i;
printf("length: %d\n", structure->length);
for (i = 0; i < structure->length; i++)
printf("%g\n", structure->array[i]);
}
else
printf("Empty Array. Length 0\n");
}
int main()
{
int i;
Array *negativeTest, *days = NewArray(5);
double moreDays[] = {1,2,3,4,5,6,7,8,9,10};
for (i = 0; i < days->length; i++)
days->array[i] = i+1;
PrintArray(days);
SetArray(days,10,moreDays);
PrintArray(days);
negativeTest = NewArray(-5);
PrintArray(negativeTest);
return 0;
}
But you have to care about set the right length of the array you want to store, because the is no way to check this length, like our friends massively explained.
This is how I personally do it in my code. I like to keep it as simple as possible while still able to get values that I need.
typedef struct intArr {
int size;
int* arr;
} intArr_t;
int main() {
intArr_t arr;
arr.size = 6;
arr.arr = (int*)malloc(sizeof(int) * arr.size);
for (size_t i = 0; i < arr.size; i++) {
arr.arr[i] = i * 10;
}
return 0;
}
No, you can't use sizeof(ptr) to find the size of array ptr is pointing to.
Though allocating extra memory(more than the size of array) will be helpful if you want to store the length in extra space.
int main()
{
int days[] = {1,2,3,4,5};
int *ptr = days;
printf("%u\n", sizeof(days));
printf("%u\n", sizeof(ptr));
return 0;
}
Size of days[] is 20 which is no of elements * size of it's data type.
While the size of pointer is 4 no matter what it is pointing to.
Because a pointer points to other element by storing it's address.
In strings there is a '\0' character at the end so the length of the string can be gotten using functions like strlen. The problem with an integer array, for example, is that you can't use any value as an end value so one possible solution is to address the array and use as an end value the NULL pointer.
#include <stdio.h>
/* the following function will produce the warning:
* ‘sizeof’ on array function parameter ‘a’ will
* return size of ‘int *’ [-Wsizeof-array-argument]
*/
void foo( int a[] )
{
printf( "%lu\n", sizeof a );
}
/* so we have to implement something else one possible
* idea is to use the NULL pointer as a control value
* the same way '\0' is used in strings but this way
* the pointer passed to a function should address pointers
* so the actual implementation of an array type will
* be a pointer to pointer
*/
typedef char * type_t; /* line 18 */
typedef type_t ** array_t;
int main( void )
{
array_t initialize( int, ... );
/* initialize an array with four values "foo", "bar", "baz", "foobar"
* if one wants to use integers rather than strings than in the typedef
* declaration at line 18 the char * type should be changed with int
* and in the format used for printing the array values
* at line 45 and 51 "%s" should be changed with "%i"
*/
array_t array = initialize( 4, "foo", "bar", "baz", "foobar" );
int size( array_t );
/* print array size */
printf( "size %i:\n", size( array ));
void aprint( char *, array_t );
/* print array values */
aprint( "%s\n", array ); /* line 45 */
type_t getval( array_t, int );
/* print an indexed value */
int i = 2;
type_t val = getval( array, i );
printf( "%i: %s\n", i, val ); /* line 51 */
void delete( array_t );
/* free some space */
delete( array );
return 0;
}
/* the output of the program should be:
* size 4:
* foo
* bar
* baz
* foobar
* 2: baz
*/
#include <stdarg.h>
#include <stdlib.h>
array_t initialize( int n, ... )
{
/* here we store the array values */
type_t *v = (type_t *) malloc( sizeof( type_t ) * n );
va_list ap;
va_start( ap, n );
int j;
for ( j = 0; j < n; j++ )
v[j] = va_arg( ap, type_t );
va_end( ap );
/* the actual array will hold the addresses of those
* values plus a NULL pointer
*/
array_t a = (array_t) malloc( sizeof( type_t *) * ( n + 1 ));
a[n] = NULL;
for ( j = 0; j < n; j++ )
a[j] = v + j;
return a;
}
int size( array_t a )
{
int n = 0;
while ( *a++ != NULL )
n++;
return n;
}
void aprint( char *fmt, array_t a )
{
while ( *a != NULL )
printf( fmt, **a++ );
}
type_t getval( array_t a, int i )
{
return *a[i];
}
void delete( array_t a )
{
free( *a );
free( a );
}
#include <stdio.h>
#include <string.h>
#include <stddef.h>
#include <stdlib.h>
#define array(type) struct { size_t size; type elem[0]; }
void *array_new(int esize, int ecnt)
{
size_t *a = (size_t *)malloc(esize*ecnt+sizeof(size_t));
if (a) *a = ecnt;
return a;
}
#define array_new(type, count) array_new(sizeof(type),count)
#define array_delete free
#define array_foreach(type, e, arr) \
for (type *e = (arr)->elem; e < (arr)->size + (arr)->elem; ++e)
int main(int argc, char const *argv[])
{
array(int) *iarr = array_new(int, 10);
array(float) *farr = array_new(float, 10);
array(double) *darr = array_new(double, 10);
array(char) *carr = array_new(char, 11);
for (int i = 0; i < iarr->size; ++i) {
iarr->elem[i] = i;
farr->elem[i] = i*1.0f;
darr->elem[i] = i*1.0;
carr->elem[i] = i+'0';
}
array_foreach(int, e, iarr) {
printf("%d ", *e);
}
array_foreach(float, e, farr) {
printf("%.0f ", *e);
}
array_foreach(double, e, darr) {
printf("%.0lf ", *e);
}
carr->elem[carr->size-1] = '\0';
printf("%s\n", carr->elem);
return 0;
}
#define array_size 10
struct {
int16 size;
int16 array[array_size];
int16 property1[(array_size/16)+1]
int16 property2[(array_size/16)+1]
} array1 = {array_size, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
#undef array_size
array_size is passing to the size variable:
#define array_size 30
struct {
int16 size;
int16 array[array_size];
int16 property1[(array_size/16)+1]
int16 property2[(array_size/16)+1]
} array2 = {array_size};
#undef array_size
Usage is:
void main() {
int16 size = array1.size;
for (int i=0; i!=size; i++) {
array1.array[i] *= 2;
}
}
Most implementations will have a function that tells you the reserved size for objects allocated with malloc() or calloc(), for example GNU has malloc_usable_size()
However, this will return the size of the reversed block, which can be larger than the value given to malloc()/realloc().
There is a popular macro, which you can define for finding number of elements in the array (Microsoft CRT even provides it OOB with name _countof):
#define countof(x) (sizeof(x)/sizeof((x)[0]))
Then you can write:
int my_array[] = { ... some elements ... };
printf("%zu", countof(my_array)); // 'z' is correct type specifier for size_t

How to dynamically fill the structure which is a pointer to pointer of arrays in C++ implementing xfs

Structure 1:
typedef struct _wfs_cdm_cu_info
{
USHORT usTellerID;
USHORT usCount;
LPWFSCDMCASHUNIT * lppList;
} WFSCDMCUINFO, * LPWFSCDMCUINFO;
Structure 2:
typedef struct _wfs_cdm_cashunit
{
USHORT usNumber;
USHORT usType;
LPSTR lpszCashUnitName;
CHAR cUnitID[5];
CHAR cCurrencyID[3];
ULONG ulValues;
ULONG ulInitialCount;
ULONG ulCount;
ULONG ulRejectCount;
ULONG ulMinimum;
ULONG ulMaximum;
BOOL bAppLock;
USHORT usStatus;
USHORT usNumPhysicalCUs;
LPWFSCDMPHCU * lppPhysical;
} WFSCDMCASHUNIT, * LPWFSCDMCASHUNIT;
Structure 3:
typedef struct _wfs_cdm_physicalcu
{
LPSTR lpPhysicalPositionName;
CHAR cUnitID[5];
ULONG ulInitialCount;
ULONG ulCount;
ULONG ulRejectCount;
ULONG ulMaximum;
USHORT usPStatus;
BOOL bHardwareSensor;
} WFSCDMPHCU, * LPWFSCDMPHCU;
The code:
LPWFSCDMCUINFO lpWFSCDMCuinf = NULL;
LPWFSCDMCASHUNIT lpWFSCDMCashUnit = NULL;
LPWFSCDMPHCU lpWFSCDMPhcu = NULL;
int i=0;
try
{
hResult = WFMAllocateBuffer(sizeof(WFSCDMCUINFO),WFS_MEM_ZEROINIT|WFS_MEM_SHARE,(void**)&lpWFSCDMCuinf);
lpWFSCDMCuinf->usCount =7;
lpWFSCDMCuinf->usTellerID = 0;
hResult = WFMAllocateMore(7*sizeof(LPWFSCDMCASHUNIT),lpWFSCDMCuinf,(void**)&lpWFSCDMCuinf->lppList);
for(i=0;i<7;i++)
{
LPWFSCDMCASHUNIT lpWFSCDMCashUnit = NULL;
hResult = WFMAllocateMore(sizeof(WFSCDMCASHUNIT), lpWFSCDMCuinf, (void**)&lpWFSCDMCashUnit);
lpWFSCDMCuinf->lppList[i] = lpWFSCDMCashUnit;//store the pointer
//FILLING CASH UNIT
-----------------------------
lpWFSCDMCashUnit->ulValues =50;
-----------------------------
WFMAllocateMore(1* sizeof(LPWFSCDMPHCU), lpWFSCDMCuinf, (void**)&lpWFSCDMCashUnit->lppPhysical);// Allocate Physical Unit structure
for(int j=0;j<1;j++)
{
LPWFSCDMPHCU lpWFSCDMPhcu = NULL;
hResult = WFMAllocateMore(sizeof(WFSCDMPHCU), lpWFSCDMCuinf, (void**)&lpWFSCDMPhcu);
lpWFSCDMCashUnit->lppPhysical[j] = lpWFSCDMPhcu;
//FILLING Phy CASHUNIT
-------------------------------------------------------
lpWFSCDMPhcu->ulMaximum = 2000;
-----------------------------
}
}
//lpWFSCDMCuinf->lppList=&lpWFSCDMCashUnit;
hResult =WFSExecute (hService,WFS_CMD_CDM_END_EXCHANGE,(LPVOID)&lpWFSCDMCuinf,60000,&lppResult);
return (int)hResult;
I'm getting stuck while I retrieve all the values in structure 1.
I need to dynamically add the values into these structure and display Structure1 as output.An allocation of memory needs to be done for this.I have tried using the above code for allocating the memory but in spite of allocating the values are not properly stored in structure.
The value of usCount changes as per the denomination set. Based on this usNumPhysicalCUs is set.
Also when I send &lpWFSCDMCuinf within the WFSExecutemethod the lppPhysical seems to be empty.
I cant exactly figure out where I'm getting wrong.
First of all your must allocate memory for each block.
For pointers array you will allocate memory to store count of pointers, than for each pointer in allocated memory you must allocate memory for structure itself.
I rewrite your code in more short form. There is no error checking and this code is sample only.
LPWFSCDMCUINFO lpWFSCDMCuinf = NULL;
HRESULT hr = WFMAllocateBuffer(sizeof(WFSCDMCUINFO), WFS_MEM_ZEROINIT|WFS_MEM_SHARE, (void**)&lpWFSCDMCuinf);
// Allocate 7 times of WFSCDMCASHUNIT
const int cuCount = 7;
lpWFSCDMCuinf->usCount = cuCount;
hr = WFMAllocateMore(cuCount * sizeof(LPWFSCDMCASHUNIT), lpWFSCDMCuinf, (void**)&lpWFSCDMCuinf->lppList);
for (int i=0; i < cuCount; i++)
{
// for one entry
LPWFSCDMCASHUNIT currentCU = NULL;
hr = WFMAllocateMore(sizeof(WFSCDMCASHUNIT), lpWFSCDMCuinf, (void**)&currentCU);
// Store pinter
lpWFSCDMCuinf->lppList[i] = currentCU;
// Fill current CU data here
// ....
// Allocate Phisical Unit Pointers
const int phuCount = 1;
currentCU->usNumPhysicalCUs = phuCount;
WFMAllocateMore(phuCount * sizeof(LPWFSCDMPHCU), lpWFSCDMCuinf, (void**)&currentCU->lppPhysical);
// Allocate Phisical Unit structure
for (int j=0; j < phuCount; j++)
{
LPWFSCDMPHCU phuCurrent = NULL;
// Allocate Phisical Unit structure
WFMAllocateMore(sizeof(WFSCDMPHCU), lpWFSCDMCuinf, (void**)&phuCurrent);
currentCU->lppPhysical[j] = phuCurrent;
// Fill Phisical Unit here
// ..
// ..
}
}
In additional to this sample I recommend you to write some helper function to allocate XFS structures like WFSCDMCUINFO. In my own project I've used some macro to serialize XFS structure in memory with WFMAllocate and WFMAllocateMore functions.
XFS structures is so complex and different from class to class. I wrote some macros to serialize and deserialize structures in memory stream and XFS memory buffers. In application I use heap alloc to store XFS structures in memory, but when I need to return structures in another XFS message I need to transfer memory buffers to XFS memory with WFMAllocate and WFMAllocateMore.

Multiple read-write synchronization issues in opencl local and global memories

I have an opencl kernel that finds the maximum ASCII character in a string.
The problem is I cannot synchronize the multiple read-writes to global and local memories.
I am trying to update a local_maximum character in shared memory, and at the end of the workgroup (last thread), the global_maximum character, by comparing it with the local_maximum. The threads are writing one over another, I guess.
eg: Input string: "pirates of the carribean".
Output String: 'r' (but it should be 's').
Please have a look at the code and give a solution as to what I can do to get everything synchronized. I am sure people having sound knowledge can understand the code. Optimization tips are welcome.
The code is below:
__kernel void find_highest_ascii( __global const char* data, __global char* result, unsigned int size, __local char* localMaxC )
{
//creating variables and initialising..
unsigned int i, localSize, globalSize, j;
char privateMaxC,temp,temp1;
i = get_global_id(0);
localSize = get_local_size(0);
globalSize = get_global_size(0);
privateMaxC = '\0';
if(i<size){
if(i == 0)
read_mem_fence( CLK_LOCAL_MEM_FENCE );
*localMaxC = '\0';
mem_fence( CLK_LOCAL_MEM_FENCE);
////////////////////////////////////////////////////
/////UPDATING PRIVATE MAX CHARACTER/////////////////
////////////////////////////////////////////////////
for( j = i; j<size; j+=globalSize )
{
if( data[j] > privateMaxC )
{
privateMaxC = data[j];
}
}
///////////////////////////////////////////////////
///////////////////////////////////////////////////
////UPDATING SHARED MAX CHARACTER//////////////////
///////////////////////////////////////////////////
temp = *localMaxC;
read_mem_fence( CLK_LOCAL_MEM_FENCE );
if(privateMaxC>temp)
{
*localMaxC = privateMaxC;
write_mem_fence( CLK_LOCAL_MEM_FENCE );
temp = privateMaxC;
}
//////////////////////////////////////////////////
//UPDATING GLOBAL MAX CHARACTER.
temp1 = *result;
if(( (i+1)%localSize == 0 || i==size-1) && (temp > temp1 ))
{
read_mem_fence( CLK_GLOBAL_MEM_FENCE );
*result = temp;
write_mem_fence( CLK_GLOBAL_MEM_FENCE );
}
}
}
You are correct that threads will be overwriting each other's values, since your code is riddled with race conditions. In OpenCL, there is no way to synchronise between work-items that are in different work-groups. Instead of trying to achieve this kind of synchronisation with explicit fences, you can make your code much simpler by using the built-in atomic functions instead. In particular, there is an atomic_max built-in which solves your problem perfectly.
So, instead of the code you currently have to update both your local and global memory maximum values, just do something like this:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
atomic_max(localMax, privateMax);
barrier(CLK_LOCAL_MEM_FENCE);
// Global reduction
if (l == 0)
{
atomic_max(output, *localMax);
}
}
This will require you to update your local memory scratch space and final result to use 32-bit integer values, but on the whole is a significantly cleaner approach to solving this problem (not to mention it actually works).
NON-ATOMIC SOLUTION
If you really don't want to use atomics, then you can implement a bog-standard reduction using local memory and work-group barriers. Here's an example:
kernel void ascii_max(global int *input, global int *output, int size,
local int *localMax)
{
int i = get_global_id(0);
int l = get_local_id(0);
// Private reduction
int privateMax = '\0';
for (int idx = i; idx < size; idx+=get_global_size(0))
{
privateMax = max(privateMax, input[idx]);
}
// Local reduction
localMax[l] = privateMax;
for (int offset = get_local_size(0)/2; offset > 1; offset>>=1)
{
barrier(CLK_LOCAL_MEM_FENCE);
if (l < offset)
{
localMax[l] = max(localMax[l], localMax[l+offset]);
}
}
// Store work-group result in global memory
if (l == 0)
{
output[get_group_id(0)] = max(localMax[0], localMax[1]);
}
}
This compares pairs of elements at a time using local memory as a scratch space. Each work-group will produce a single result, which is stored in global memory. If your data-set is small, you could run this with a single work-group (i.e. make global and local sizes the same), and this will work just fine. If it is larger, you could run a two-stage reduction by running this kernel twice, e.g.:
size_t N = ...; // something big
size_t local = 128;
size_t global = local*local; // Must result in at most 'local' number of work-groups
// First pass - run many work-groups using temporary buffer as output
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_temp);
clEnqueueNDRangeKernel(..., &global, &local, ...);
// Second pass - run one work-group with temporary buffer as input
global = local;
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_temp);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_output);
clEnqueueNDRangeKernel(..., &global, &local, ...);
I'll leave it to you to run them and decide which approach would be best for your own data-set.

Resources