OpenCL: multiple work items saving results to the same global memory address - opencl

I'm trying to do a reduce-like cumulative calculation where 4 different values need to be stored depending on certain conditions. My kernel receives long arrays as input and needs to store only 4 values, which are "global sums" obtained from each data point on the input. For example, I need to store the sum of all the data values satisfying certain condition, and the number of data points that satisfy said condition. The kernel is below to make it clearer:
__kernel void photometry(__global float* stamp,
__constant float* dark,
__global float* output)
int x = get_global_id(0);
int s = n * n;
if(x < s){
float2 curr_px = (float2)((x / n), (x % n));
float2 center = (float2)(centerX, centerY);
int dist = (int)fast_distance(center, curr_px);
if(dist < aperture){
output[0] += stamp[x]-dark[x];
}else if (dist > sky_inner && dist < sky_outer){
output[2] += stamp[x]-dark[x];
All the values not declared in the kernel are previously defined by macros. s is the length of the input arrays stamp and dark, which are nxn matrices flattened down to 1D.
I get results but they are different from my CPU version of this. Of course I am wondering: is this the right way to do what I'm trying to do? Can I be sure that each pixel data is only being added once? I can't think of any other way to save the cumulative result values.

Atomic operation is needed in your case, otherwise data races will cause the results unpredictable.
The problem is here:
output[0] += stamp[x]-dark[x];
You can imagine that threads in the same wave might still follow the same step, therefore, it might be OK for threads inside the same wave. Since they read the same output[0] value using a global load instruction (broadcasting). Then, when they finish the computation and try to store data into the same memory address (output[0]), the writing operations will be serialized. To this point, you may still get the correct results (for the work items inside the same wave).
However, since it is highly likely that your program launches more than one wave (in most applications, this is the case). Different waves may execute in an unknown order; then, when they access the same memory address, the behavior becomes more complicated. For example, wave0 and wave1 may access output[0] in the beginning before any other computation happens, that means they fetch the same value from output[0]; then they start the computation. After computation, they save their accumulative results into output[0]; apparently, result from one of the waves will be overwritten by another one, as if only the one who writes memory later got executed. Just imagine that you have much more waves in a real application, so it is not strange to have a wrong result.

You can do this in O(log2(n)) concurrently. a concept idea:
You have 16 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) inputs and you want to have the sum of these inputs concurrently.
you can concurrently sum 1 in 2, 3 in 4, 5 in 6, 7 in 8, 9 in 10, 11 in 12, 13 in 14, 15 in 16
then you sum concurrently 2 in 4, 6 in 8, 10 in 12, 14 in 16
then always concurrently 4 in 8, 10 in 16
and finally 8 in 16
everything done in O(log2(n)) in our case in 4 passages.


Most elegant way to determine how much one has been bitshifted

So let's say we bitshift 1 by some number x; eg, in c:
unsigned char cNum= 1, x= 6;
cNum <<= x;
cNum will equal 01000000b (0x40).
Easy peasy. But without using a lookup table or while loop, is there a simple operation that will take cNum and give me x back?
AFAIK, no 'simple' formula is available.
One can, however, calculate the index of most significant (or least significant) set bit:
a = 000010010, a_left = 3, a_right = 1
b = 001001000, b_left = 5, b_right = 3
The difference of the shifts is 2 (or -2).
One can then shift the smaller by abs(shift) to compare that a << 2 == b. (In some architectures there exists a shift by signed value, which works without absolute value or checking which way the shift needs to be carried.)
In ARM Neon there exists an instruction for counting the MSB bit and in Intel there exists an instruction to scan both from left and right.
log2(cNum)+ 1; will yield x where cNum != 0, at least in GNU c.
And the compiler does the casts automagically which is probably bad form, but it gives me what I need.

Why is this simple OpenCL code not vectorised?

The code below does not vectorise. With 'istart = n * 1;' instead of the 'istart = n * niters;' it does. With 'istart = n * 2;' it again does not.
// Kernel for
__kernel void pi(
int niters,
__global float* A_d,
__global float* S_d,
__global float* B_d)
int num_wrk_items = get_local_size(0);
int local_id = get_local_id(0); // work item id
int group_id = get_group_id(0); // work group id
float accum = 0.0f;
int i, istart, iend, n;
n= group_id * num_wrk_items + local_id;
istart = n * niters;
iend = istart + niters;
for (i= istart; i< iend; i++){
accum += A_d[i] * S_d[i];
B_d[n] = accum;
barrier(CLK_LOCAL_MEM_FENCE); // test: result is correct without this statement
If the code cannot be vectorised I get:
Kernel was not vectorized
If it can be:
Kernel was successfully vectorized (8)
Any idea why it is not vectorised?
When niters is 1, it makes the for loop cycle only once. This means every workitem computes its own element, in a coalesced access to memory.
Coalesced access is one of conditions to have N neighboring threads/workitems mapped to a SIMD hardware such as with width 8.
When niters is greater than 1, every workitem works only with strides of niters between neighboring workitems. This means SIMD hardware is useless. Only 1 memory cell per workitem is used at a time.
When niters is 2, at least only 2-fold memory bank collision happens. But with very big niters value, memory bank collisions happen more, making it very slow. Using SIMD or not doesn't matter (vectorized or not) as its performance will be locked into the serialized memory read/write latencies.
That for loop is doing a reduction serially. You should make it parallel. There are many examples out there, pick one and apply to your algorithm. For example, have each workitem compute a sum between id and id+niters/2 then reduce them on id and id+niters/4 and continue like this until at last only 1 workitem does final summation of id and id+1 elements.
If the reduction is a global version, then you can do local reduction per workgroup then apply their results same way on another kernel to do the global reduction.
Since you are making only partial sums per workitem, you could do "strided sum per workitem" such that each workitem using same for loop but leaping by M elements where M is something wont disturb the SIMD mapping on kernel workitems. Maybe M could be 1/100 of global number of elements(N) and for loop would cycle for 100 times (or N/M times). Something like this:
time 1 time 2 time 3 time 4
workitem 1 0 15 30 45
workitem 2 1 16 31 46
workitem 3 2 17 32 47
workitem 15 14 29 44 59
coalesced coalesced coalesced coalesced
to complete 15 partial sums for 60 elements using 15 workitems. If SIMD length can fit this 15 workitems, it is good.
Lastly, the barrier operation is not needed since kernel ending is an implicit synchronization point globally for all workitems in it. Barrier only needed when you need to use those written results on another workitem in same kernel.

Efficient program to check whether a number can be expressed as sum of two cubes

I am trying to write a program to check whether a number N can be expressed as the sum of two cubes i.e. N = a^3 + b^3
This is my code with complexity O(n):
#include <iostream>
#define ll unsigned long long
using namespace std;
int main()
bool flag=false;
ll t,N;
for(int i=1; i<=(ll)cbrtl(N/2); i++)
if(!(cbrtl(N-i*i*i)-(ll)cbrtl(N-i*i*i))) {flag=true; break;}
if(flag) cout<<"Yes\n"; else cout<<"No\n";
return 0;
As the time limit for code is 2s, This program is giving TLE? can anyone suggest a faster approch
I posted this also in StackExchange, so sorry if you consider duplicate, but I really don´t know if these are the same or different boards (Exchange and Overflow). My profile appears different here.
There is a faster algorithm to check if a given integer is a sum (or difference) of two cubes n=a^3+b^3
I don´t know if this algorithm is already known (probably yes, but I can´t find it on books or internet). I discovered and use it to compute integers until n < 10^18
This process uses a single trick
4(a^3+b^3)/(a+b) = (a+b)^2 + 3(a-b)^2)
We don´t know in advance what would be "a" and "b" and so what also would be "(a+b)", but we know that "(a+b)" should certainly divide (a^3+b^3) , so if you have a fast primes factorizing routine, you can quickly compute each one of divisors of (a^3+b^3) and then check if
(4(a^3+b^3)/divisor - divisor^2)/3 = square
When (and if) found a square, you have divisor=(a+b) and sqrt(square)=(a-b) , so you have a and b.
If not square found, the number is not sum of two cubes.
We know divisor < (4(a^3+b^3)^(1/3) and this limit improves the task, because when you are assembling divisors of (a^3+b^3) immediately discard those greater than limit.
Now some comparisons with other algorithms - for n = 10^18, by using brute force you should test all numbers below 10^6 to know the answer. On the other hand, to build all divisors of 10^18 you need primes until 10^9.
The max quantity of different primes you could fit into 10^9 is 10 (2*3*5*7*11*13*17*19*23*29 = 5*10^9) so we have 2^10-1 different combinations of primes (which assemble the divisors) to check in worst case, many of them discared because limit.
To compute prime factors I use a table with first 60.000.000 primes which works very well on this range.
Miguel Velilla
To find all the pairs of integers x and y that sum to n when cubed, set x to the largest integer less than the cube root of n, set y to 0, then repeatedly add 1 to y if the sum of the cubes is less than n, subtract 1 from x if the sum of the cubes is greater than n, and output the pair otherwise, stopping when x and y cross. If you only want to know whether or not such a pair exists, you can stop as soon as you find one.
Let us know if you have trouble coding this algorithm.

Do global_work_size and local_work_size have any effect on application logic?

I am trying to understand how all of the different parameters for dimensions fit together in OpenCL. If my question isn't clear that's partly because a well formed question requires bits of the answer which I don't have.
How do work_dim, global_work_size, and local_work_size work together to create the execution space that you use in a kernel? For example, if I make work_dim 2 then I can
I can divide those two dimensions up into n Work Groups using global_work_size, right? So if I make the global_work_size like so
size_t global_work_size[] = { 4 };
Then each dimension would have 4 work groups for a total of 8? But, as a beginner, I am only using global_id for my indices so only the global id's matter anyway. As you can tell I am pretty confused about all of this so any help you can offer would
image i made to try to understand this question
image decribing work groups i found on google
Since you stated yourself that you are a bit confused about the concepts involved in the execution space, I'm gonna try to summary them before answering your question and give some examples.
The threads/workitems are organized in a NDRange which can be viewed as a grid of 1, 2, 3 dims.
The NDRange is mainly used to map each thread to the piece of data each of them will have to manipulate. Therefore each thread should be uniquely identified and a thread should know which one it is and where it stands in the NDRange. And there come the Work-Item Built-in Functions. These functions can be called by all threads to give them info about themself and the NDRange where they stand.
The dimensions:
As already stated, an NDRange can have up to 3 dimensions. So if you set the dimensions this way:
size_t global_work_size[2] = { 4, 4 };
It doesn't mean that each dimension would have 4 work groups for a total of 8, but that you'll have 4 * 4 i.e. 16 threads in your NDRange. These threads will be arranged in a "square" with sides of 4 units. The workitems can know how many dimensions the NDRange is made of, using the uint get_work_dim () function.
The global size:
Threads can also query how big is the NDRange for a specific dimension with size_t get_global_size (uint D). Therefore they can know how big is the "line/square/rectangle/cube" NDRange.
The global unique identifiers:
Thanks to that organization, each thread can be uniquely identified with indexes corresponding to the specific dimensions. Hence the thread (2, 1) refers to a thread that is in the 3rd column and the second row of a 2D range. The function size_t get_global_id (uint D) is used in the kernel to query the id of the threads.
The workgroup (or local) size:
The NDRange can be split in smaller groups called workgroups. This is the local_work_size you were referring to which has also (and logically) up to 3 dimensions. Note that for OpenCL version below 2.0, the NDRange size in a given dimension must be a multiple of the workgroup size in that dimension. so to keep your example, since in the dimension 0 we have 4 threads, the workgroup size in the dimension 0 can be 1, 2, 4 but not 3. Similarly to the global size, threads can query the local size with size_t get_local_size (uint D).
The local unique identifiers:
Sometime it is important that a thread can be uniquely identified within a workgroup. Hence the function size_t get_local_id (uint D). Note the "within" in the previous sentence. a thread with a local id (1, 0) will be the only one to have this id in its workgroup (of 2D). But there will be as many threads with a local id (1, 0) as there will be workgroups in the NDRange.
The number of groups:
Speaking of groups sometime a thread might need to know how many groups there are. That's why the function size_t get_num_groups (uint D) exists. Note that again you have to pass as parameter the dimension you are interested in.
Each group has also an id:
...that you can query within a kernel with the function size_t get_group_id (uint D). Note that the format of the group ids will be similar to those of the threads: tuples of up to 3 elements.
To wrap things up a bit, if you have a 2D NDRange of a global work size of (4, 6) and a local work size of (2, 2) it means that:
the global size in the dimension 0 will be 4
the global size in the dimension 1 will be 6
the local size (or workgroup size) in the dimension 0 will be 2
the local size (or workgroup size) in the dimension 1 will be 2
the thread global ids in the dimension 0 will range from 0 to 3
the thread global ids in the dimension 1 will range from 0 to 5
the thread local ids in the dimension 0 will range from 0 to 1
the thread local ids in the dimension 1 will range from 0 to 1
The total number of threads in the NDRange will be 4 * 6 = 24
The total number of threads in a workgroup will be 2 * 2 = 4
The total number of workgroups will be (4/2) * (6/2) = 6
the group ids in the dimension 0 will range from 0 to 1
the group ids in the dimension 1 will range from 0 to 2
there will be only one thread will the global id (0, 0) but there will be 6 threads with the local id (0, 0) because there are 6 groups.
Here is a dummy example to use all these concepts together (note that performance would be terrible, it's just a stupid example).
Let's say you have a 2D array of 6 rows and 4 columns of int. You want to group these elements in square of 2 by 2 elements and sum them up in such a way that for instance, the elements (0, 0), (0, 1), (1, 0), (1, 1) will be in one group (hope it's clear enough). Because you'll have 6 "squares" you'll have 6 results for the sums, so you'll need an array of 6 elements to store these results.
To solve this, you use our 2D NDRange detailed just above. Each thread will fetch from the global memory one element, and will store it in the local memory. Then after a synchronization, only one thread per workgroup, let say each local(0, 0) threads will sum the elements (in local) up and then store the result at a specific place in a 6 elements array (in global).
//in is a 24 int array, result is a 6 int array, temp is a 4 int array
kernel void foo(global int *in, global int *result, local int *temp){
//use vectors for conciseness
int2 globalId = (int2)(get_global_id(0), get_global_id(1));
int2 localId = (int2)(get_local_id(0), get_local_id(1));
int2 groupId = (int2)(get_group_id (0), get_group_id (1));
int2 globalSize = (int2)(get_global_size(0), get_global_size(1));
int2 locallSize = (int2)(get_local_size(0), get_local_size(1));
int2 numberOfGrp = (int2)(get_num_groups (0), get_num_groups (1));
//Read from global and store to local
temp[localId.x + localId.y * localSize.x] = in[globalId.x + globalId.y * globalSize.x];
//Only the threads with local id (0, 0) sum elements up
if(localId.x == 0 && localId.y == 0){
int sum = 0;
for(int i = 0; i < locallSize.x * locallSize.y ; i++){
sum += temp[i];
//store result in global
result[groupId.x + numberOfGrp.x * groupId.y] = sum;
And finally to answer to your question: Do global_work_size and local_work_size have any effect on application logic?
Usually yes because it's part of the way you design you algo. Note that the size of the workgroup is not taken randomly but matches my need (here 2 by 2 square).
Note also that if you decide to use a NDRange of 1 dimension with a size of 24 and a local size of 4 in 1 dim, it'll screw things up too because the kernel was designed to use 2 dimensions.

How to efficiently convert a few bytes into an integer between a range?

I'm writing something that reads bytes (just a List<int>) from a remote random number generation source that is extremely slow. For that and my personal requirements, I want to retrieve as few bytes from the source as possible.
Now I am trying to implement a method which signature looks like:
int getRandomInteger(int min, int max)
I have two theories how I can fetch bytes from my random source, and convert them to an integer.
Approach #1 is naivé . Fetch (max - min) / 256 number of bytes and add them up. It works, but it's going to fetch a lot of bytes from the slow random number generator source I have. For example, if I want to get a random integer between a million and a zero, it's going to fetch almost 4000 bytes... that's unacceptable.
Approach #2 sounds ideal to me, but I'm unable come up with the algorithm. it goes like this:
Lets take min: 0, max: 1000 as an example.
Calculate ceil(rangeSize / 256) which in this case is ceil(1000 / 256) = 4. Now fetch one (1) byte from the source.
Scale this one byte from the 0-255 range to 0-3 range (or 1-4) and let it determine which group we use. E.g. if the byte was 250, we would choose the 4th group (which represents the last 250 numbers, 750-1000 in our range).
Now fetch another byte and scale from 0-255 to 0-250 and let that determine the position within the group we have. So if this second byte is e.g. 120, then our final integer is 750 + 120 = 870.
In that scenario we only needed to fetch 2 bytes in total. However, it's much more complex as if our range is 0-1000000 we need several "groups".
How do I implement something like this? I'm okay with Java/C#/JavaScript code or pseudo code.
I'd also like to keep the result from not losing entropy/randomness. So, I'm slightly worried of scaling integers.
Unfortunately your Approach #1 is broken. For example if min is 0 and max 510, you'd add 2 bytes. There is only one way to get a 0 result: both bytes zero. The chance of this is (1/256)^2. However there are many ways to get other values, say 100 = 100+0, 99+1, 98+2... So the chance of a 100 is much larger: 101(1/256)^2.
The more-or-less standard way to do what you want is to:
Let R = max - min + 1 -- the number of possible random output values
Let N = 2^k >= mR, m>=1 -- a power of 2 at least as big as some multiple of R that you choose.
b = a random integer in 0..N-1 formed from k random bits
while b >= mR -- reject b values that would bias the output
return min + floor(b/m)
This is called the method of rejection. It throws away randomly selected binary numbers that would bias the output. If min-max+1 happens to be a power of 2, then you'll have zero rejections.
If you have m=1 and min-max+1 is just one more than a biggish power of 2, then rejections will be near half. In this case you'd definitely want bigger m.
In general, bigger m values lead to fewer rejections, but of course they require slighly more bits per number. There is a probabilitistically optimal algorithm to pick m.
Some of the other solutions presented here have problems, but I'm sorry right now I don't have time to comment. Maybe in a couple of days if there is interest.
3 bytes (together) give you random integer in range 0..16777215. You can use 20 bits from this value to get range 0..1048575 and throw away values > 1000000
range 1 to r
256^a >= r
first find 'a'
get 'a' number of bytes into array A[]
for i=0 to len(A)-1
random number = num mod range
Your random source gives you 8 random bits per call. For an integer in the range [min,max] you would need ceil(log2(max-min+1)) bits.
Assume that you can get random bytes from the source using some function:
bool RandomBuf(BYTE* pBuf , size_t nLen); // fill buffer with nLen random bytes
Now you can use the following function to generate a random value in a given range:
// --------------------------------------------------------------------------
// produce a uniformly-distributed integral value in range [nMin, nMax]
template <class T> T RandU(T nMin, T nMax)
static_assert(std::numeric_limits<T>::is_integer, "RandU: integral type expected");
if (nMin>nMax)
std::swap(nMin, nMax);
if (0 == (T)(nMax-nMin+1)) // all range of type T
T nR;
return RandomBuf((BYTE*)&nR, sizeof(T)) ? *(T*)&nR : nMin;
ULONGLONG nRange = (ULONGLONG)nMax-(ULONGLONG)nMin+1 ; // number of discrete values
UINT nRangeBits= (UINT)ceil(log((double)nRange) / log(2.)); // bits for storing nRange discrete values
if (!RandomBuf((BYTE*)&nR, sizeof(nR)))
return nMin;
nR= nR>>((sizeof(nR)<<3) - nRangeBits); // keep nRangeBits random bits
while (nR >= nRange); // ensure value in range [0..nRange-1]
return nMin + (T)nR; // [nMin..nMax]
Since you are always getting a multiple of 8 bits, you can save extra bits between calls (for example you may need only 9 bits out of 16 bits). It requires some bit-manipulations, and it is up to you do decide if it is worth the effort.
You can save even more, if you'll use 'half bits': Let's assume that you want to generate numbers in the range [1..5]. You'll need log2(5)=2.32 bits for each random value. Using 32 random bits you can actually generate floor(32/2.32)= 13 random values in this range, though it requires some additional effort.
