get global ThreadId in 2 dimensions in OpenCl

get global ThreadId in 2 dimensions in OpenCl - opencl

How can i get the global threadId in 2 dimensions in OpenCL?
I know that for 1 dimension, the formula is:
int global_id = get_global_id(1) * get_global_size(0) + get_global_id(0);
But if i allocate like this:
size_t block_size[] = {2,2}
size_t grid_size[] = {35,20}
The above formula fails, giving indexes only from 0 to 35*20.
The indexes should go from 0 to 35*40*2*2.
Can you recommend any good documentation or writings that could give me the intuition to understand how all of this works?
Thanks!

If you're launching a 2D NDRange, then get_global_id(0) and get_global_id(1) will give you the Gx and Gy indices. You can also independently fetch the local ids using get_local_id(0/1).
There's no need to calculate it yourself.
Did you mean that you're launching a 2D thread block but want to map that thread to a position in a 1 dimensional buffer?
EDIT: After reading your comment, I thought an explanation is in order.
OpenCL launches as many kernels as get_global_size(0) * get_global_size(1) (which is 35 * 20), so you will have threads
(0 ,0) (0 ,1) ... (0,34)
(1 ,0) (1 ,1) ... (1,34)
.
.
.
(19,0) (19,1) ... (19,34)
Local worksize is simply a way of splitting up the total number of threads and distributing them across the compute units available. It is quite possible that only 2 * 2 = 4 threads are running at any point of time.
The clEnqueueNDRangeKernel documentation tells us that local_work_size can be null, in which case the implementation will determine the size to break up the total amount of work.
In no way does local work size increase the number of threads.
Perhaps this image explains it better than I can.
Note that the total number of kernel launches are still get_global_size(0) * get_global_size(1).
If you want your 1D indices to go from 0..(35*40*2*2 - 1) then launch the kernel so that get_global_size(0) * get_global_size(1) is 35*40*2*2 (perhaps 70 x 80 ?)
Hope this helps.

Related

clEnqueueNDRangeKernel return CL_INVALID_WORK_GROUP_SIZE

I have 3 OpenCL devices on my MacBookPro, so I am trying a little bit complicated calculation with a small exsample.
I create a context contain 3 devices, two are GPU and one is CPU. Then create 3 command queues, one from(or for) each of them.
Then create a big global buffer, big but not bigger than the smallest one available in any one of the device. Then create 3 sub buffers from the input buffer, the sizes of them are all calculated carefully. Another not so big output buffer is also created and 3 small sub buffers created on it.
After setup the kernel, set arguments and so on, everything looks good. The first two device accept the kernel and start to run, but the third one refused it and return CL_INVALID_WORK_GROUP_SIZE.
I don't want to put any source code here as their are nothing special and I am sure there is no bug in it.
I did some log as the following:
command queue 0
device: Iris Pro max work group size 512
local work size(32 * 16) = 512
global work size(160 * 48) = 7680
number of work groups = 15
command queue 1
device: GeForce GT 750M max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
command queue 2
device: Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
I checked the first two output are correct as expected, so the kernel and host code must be correct.
There is only one possibility I can think of, is there any limit when using CPU and GPU at the same time and share one buffer object?
Thanks in advance.

Ok I figure out the problem. The CPU support max work item size (1024, 1, 1), so local work size cannot use (32x32).
But still have problem when use local work size bigger than (1, 1). Keep trying.

From Intel's OpenCL guide:
https://software.intel.com/en-us/node/540486
Query CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE return always 1, even with a very simple kernel without barrier. In that case, work group size can be 128 (it's a 1D work group), but cannot be 256.
Conclusion is better not use it in some case :(

Method to do final sum with reduction

I take up the continuation of my first issue explained on this link.
I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop,
Currently, I did a version for only one sum reduction (i.e one iteration
). In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum.
From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel function (i.e executing a second time the kernel code).
Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel) : it seems to be a recursive issue.
Let's take an example from the above figure : if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
So after the first call, I get 640000 partial sums : how to deal with the final sumation of all these partial sums ? If I call another time the kernel code with, for example, WorkGroup size = 16 and global size = 640000, I will get nWorkGroups = 640000/16 = 40000 partial sums, so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size.
Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array )
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
...
If someone could explain what this above code snippet of kernel code does.
Is there a relation with the second stage of reduction, i.e the final sumation ?
Feel free to ask me more details if you have not understood my issue.
Thanks

As mentioned in the comment: The statement
if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
is not correct. You can choose an "arbitrary" work group size, and an "arbitrary" number of work groups. The numbers to choose here may be tailored for the target device. For example, the device may have a certain local memory size. This can be queried with clDeviceGetInfo:
cl_ulong localMemSize = 0;
clDeviceGetInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,
sizeof(cl_ulong), &localMemSize, nullptr);
This may be used to compute the size of a local work group, considering the fact that each work group will require
sizeof(cl_float) * workGroupSize
bytes of local memory.
Similarly, the number of work groups may be derived from other device specific parameters.
The key point regarding the reduction itself is that the work group size does not limit the size of the array that can be processed. I also had some difficulties with understanding the algorithm as a whole, so I tried to explain it here, hoping that a few images may be worth a thousand words:
As you can see, the number of work groups and the work group size are fixed and independent of the input array length: Even though I'm using 3 work groups with a size of 8 in the example (giving a global size of 24), an array of length 64 can be processed. This is mainly due to the first loop, which just walks through the input array, with a "step size" that is equal to the global work size (24 here). The result will be one accumulated value for each of the 24 threads. These are then reduced in parallel.

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S

To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.

The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

OpenCL - Are work-group axes exchangeable?

I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
These are my results :
GlobalWorkSize {6400 6400 1}, WorkGroupSize {64 4 1}, Time(Milliseconds) = 44.18
GlobalWorkSize {6400 6400 1}, WorkGroupSize {4 64 1}, Time(Milliseconds) = 24.39
Swapping axes caused a twice faster execution. Why !?
By the way, I was using an AMD GPU.
Thanks :-)
EDIT :
This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}

I agree with #Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
Each workitem takes care of one column, looping through each column element one at a time.
Each workitem takes care of one line, looping through each line element one at a time.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
EDIT: Now that you posted your kernel I see that I forgot to mention something else.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).

How to get a "random" number in OpenCL

I'm looking to get a random number in OpenCL. It doesn't have to be real random or even that random. Just something simple and quick.
I see there is a ton of real random parallelized fancy pants random algorithms in OpenCL that are like thousand and thousands of lines. I do NOT need anything like that. A simple 'random()' would be fine, even if it is easy to see patterns in it.
I see there is a Noise function? Any easy way to use that to get a random number?

I was solving this "no random" issue for last few days and I came up with three different approaches:
Xorshift - I created generator based on this one. All you have to do is provide one uint2 number (seed) for whole kernel and every work item will compute his own rand number
// 'randoms' is uint2 passed to kernel
uint seed = randoms.x + globalID;
uint t = seed ^ (seed << 11);
uint result = randoms.y ^ (randoms.y >> 19) ^ (t ^ (t >> 8));
Java random - I used code from .next(int bits) method to generate random number. This time you have to provide one ulong number as seed.
// 'randoms' is ulong passed to kernel
ulong seed = randoms + globalID;
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
uint result = seed >> 16;
Just generate all on CPU and pass it to kernel in one big buffer.
I tested all three approaches (generators) in my evolution algorithm computing Minimum Dominating Set in graphs.
I like the generated numbers from the first one, but it looks like my evolution algorithm doesn't.
Second generator generates numbers that has some visible pattern but my evolution algorithm likes it that way anyway and whole thing run little faster than with the first generator.
But the third approach shows that it's absolutely fine to just provide all numbers from host (cpu). First I though that generating (in my case) 1536 int32 numbers and passing them to GPU in every kernel call would be too expensive (to compute and transfer to GPU). But it turns out, it is as fast as my previous attempts. And CPU load stays under 5%.
BTW, I also tried MWC64X Random but after I installed new GPU driver the function mul_hi starts causing build fail (even whole AMD Kernel Analyer crashed).

the following is the algorithm used by the java.util.Random class according to the doc:
(seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1)
See the documentation for its various implementations. Passing the worker's id in for the seed and looping a few time should produce decent randomness
or another metod would be to have some random operations occur that are fairly ceratain to overflow:
long rand= yid*xid*as_float(xid-yid*xid);
rand*=rand<<32^rand<<16|rand;
rand*=rand+as_double(rand);
with xid=get_global_id(0); and yid= get_global_id(1);

I am currently implementing a Realtime Path Tracer. You might already know that Path Tracing requires many many random numbers.
Before generating random numbers on the GPU I simply generated them on the CPU (using rand(), which sucks) and passed them to the GPU.
That quickly became a bottleneck.
Now I am generating the random numbers on the GPU with the Park-Miller Pseudorandom Number Generator (PRNG).
It is extremely simple to implement and achieves very good results.
I took thousands of samples (in the range of 0.0 to 1.0) and averaged them together.
The resulting value was very close to 0.5 (which is what you would expect). Between different runs the divergence from 0.5 was around 0.002. Therefore it has a very uniform distribution.
Here's a paper describing the algorithm:http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/ufeen8-15-m/p1192-parkmiller.pdf
And here's a paper about the above algorithm optimized for CUDA (which can easily be ported to OpenCL): http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/papers/langdon_2009_CIGPU.pdf
Here's an example of how I'm using it:
int rand(int* seed) // 1 <= *seed < m
{
int const a = 16807; //ie 7**5
int const m = 2147483647; //ie 2**31-1
*seed = (long(*seed * a))%m;
return(*seed);
}
kernel random_number_kernel(global int* seed_memory)
{
int global_id = get_global_id(1) * get_global_size(0) + get_global_id(0); // Get the global id in 1D.
// Since the Park-Miller PRNG generates a SEQUENCE of random numbers
// we have to keep track of the previous random number, because the next
// random number will be generated using the previous one.
int seed = seed_memory[global_id];
int random_number = rand(&seed); // Generate the next random number in the sequence.
seed_memory[global_id] = *seed; // Save the seed for the next time this kernel gets enqueued.
}
The code serves just as an example. I have not tested it.
The array "seed_memory" is being filled with rand() only once before the first execution of the kernel. After that, all random number generation is happening on the GPU. I think it's also possible to simply use the kernel id instead of initializing the array with rand().

It seems OpenCL does not provide such functionality. However, some people have done some research on that and provide BSD licensed code for producing good random numbers on GPU.

This is my version of OpenCL float pseudorandom noise, using trigonometric function
//noise values in range if 0.0 to 1.0
static float noise3D(float x, float y, float z) {
float ptr = 0.0f;
return fract(sin(x*112.9898f + y*179.233f + z*237.212f) * 43758.5453f, &ptr);
}
__kernel void fillRandom(float seed, __global float* buffer, int length) {
int gi = get_global_id(0);
float fgi = float(gi)/length;
buffer[gi] = noise3D(fgi, 0.0f, seed);
}
You can generate 1D or 2D noize by passing to noise3D normalized index coordinates as a first parameters, and the random seed (generated on CPU for example) as a last parameter.
Here are some noise pictures generated with this kernel and different seeds:

GPU don't have good sources of randomness, but this can be easily overcome by seeding a kernel with a random seed from the host. After that, you just need an algorithm that can work with a massive number of concurrent threads.
This link describes a Mersenne Twister implementation using OpenCL: Parallel Mersenne Twister. You can also find an implementation in the NVIDIA SDK.

I had the same problem.
www.thesalmons.org/john/random123/papers/random123sc11.pdf
You can find the documentation here.
http://www.thesalmons.org/john/random123/releases/latest/docs/index.html
You can download the library here:
http://www.deshawresearch.com/resources_random123.html

why not? you could just write a kernel that generates random numbers, tough that would need more kernel calls and eventually passing the random numbers as argument to your other kernel which needs them

you cant generate random numbers in kernel , the best option is to generate the random number in host (CPU) and than transfer that to the GPU through buffers and use it in the kernel.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex