OpenCL clEnqueueNDRangeKernel work_dim VS global_work array elements

OpenCL clEnqueueNDRangeKernel work_dim VS global_work array elements - opencl

I'm new in OpenCL and I'm trying to understand this piece of code:
size_t global_work1[3] = {BLOCK_SIZE, 1, 1};
size_t local_work1[3] = {BLOCK_SIZE, 1, 1};
err = clEnqueueNDRangeKernel(cmd_queue, diag, 2, NULL, global_work1, local_work1, 0, 0, 0);
So, in the clEnqueueNDRangeKernel 2 dimension for the kernel are specified (work_dim field), this means that:
the dimension 0 kernel got a number of threads equal to BLOCK_SIZE and only one group (I guess the number of groups can be calculated in this way => ( global_work1[0] ) / ( local_work1[0] ) ).
the dimension 1 Kernel got a number of threads equal to 1 and only one group.
and also why a dimension of 2 is specified in the queue instruction when three are the elements in global_work1 and local_work1.

You are telling CL:
"Run this kernel, in this queue, with 2D and these global/local sizes"
CL is just getting the first 2 dimensions of the argument, and ignoring the 3rd one.
About the difference between 1D and 2D. There is none. Since OpenCL kernels launched as 1D do not fail on get_global_id(1) and get_global_id(2) calls. They will just return 0. So there is no difference at all, apart from the hint that the kernel will probably support bigger sizes for the 2nd dimension argument (ie: 512x128)

Related

Most elegant way to determine how much one has been bitshifted

So let's say we bitshift 1 by some number x; eg, in c:
unsigned char cNum= 1, x= 6;
cNum <<= x;
cNum will equal 01000000b (0x40).
Easy peasy. But without using a lookup table or while loop, is there a simple operation that will take cNum and give me x back?

AFAIK, no 'simple' formula is available.
One can, however, calculate the index of most significant (or least significant) set bit:
a = 000010010, a_left = 3, a_right = 1
b = 001001000, b_left = 5, b_right = 3
The difference of the shifts is 2 (or -2).
One can then shift the smaller by abs(shift) to compare that a << 2 == b. (In some architectures there exists a shift by signed value, which works without absolute value or checking which way the shift needs to be carried.)
In ARM Neon there exists an instruction for counting the MSB bit and in Intel there exists an instruction to scan both from left and right.

log2(cNum)+ 1; will yield x where cNum != 0, at least in GNU c.
And the compiler does the casts automagically which is probably bad form, but it gives me what I need.

OpenCL: multiple work items saving results to the same global memory address

I'm trying to do a reduce-like cumulative calculation where 4 different values need to be stored depending on certain conditions. My kernel receives long arrays as input and needs to store only 4 values, which are "global sums" obtained from each data point on the input. For example, I need to store the sum of all the data values satisfying certain condition, and the number of data points that satisfy said condition. The kernel is below to make it clearer:
__kernel void photometry(__global float* stamp,
__constant float* dark,
__global float* output)
{
int x = get_global_id(0);
int s = n * n;
if(x < s){
float2 curr_px = (float2)((x / n), (x % n));
float2 center = (float2)(centerX, centerY);
int dist = (int)fast_distance(center, curr_px);
if(dist < aperture){
output[0] += stamp[x]-dark[x];
output[1]++;
}else if (dist > sky_inner && dist < sky_outer){
output[2] += stamp[x]-dark[x];
output[3]++;
}
}
}
All the values not declared in the kernel are previously defined by macros. s is the length of the input arrays stamp and dark, which are nxn matrices flattened down to 1D.
I get results but they are different from my CPU version of this. Of course I am wondering: is this the right way to do what I'm trying to do? Can I be sure that each pixel data is only being added once? I can't think of any other way to save the cumulative result values.

Atomic operation is needed in your case, otherwise data races will cause the results unpredictable.
The problem is here:
output[0] += stamp[x]-dark[x];
output[1]++;
You can imagine that threads in the same wave might still follow the same step, therefore, it might be OK for threads inside the same wave. Since they read the same output[0] value using a global load instruction (broadcasting). Then, when they finish the computation and try to store data into the same memory address (output[0]), the writing operations will be serialized. To this point, you may still get the correct results (for the work items inside the same wave).
However, since it is highly likely that your program launches more than one wave (in most applications, this is the case). Different waves may execute in an unknown order; then, when they access the same memory address, the behavior becomes more complicated. For example, wave0 and wave1 may access output[0] in the beginning before any other computation happens, that means they fetch the same value from output[0]; then they start the computation. After computation, they save their accumulative results into output[0]; apparently, result from one of the waves will be overwritten by another one, as if only the one who writes memory later got executed. Just imagine that you have much more waves in a real application, so it is not strange to have a wrong result.

You can do this in O(log2(n)) concurrently. a concept idea:
You have 16 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) inputs and you want to have the sum of these inputs concurrently.
you can concurrently sum 1 in 2, 3 in 4, 5 in 6, 7 in 8, 9 in 10, 11 in 12, 13 in 14, 15 in 16
then you sum concurrently 2 in 4, 6 in 8, 10 in 12, 14 in 16
then always concurrently 4 in 8, 10 in 16
and finally 8 in 16
everything done in O(log2(n)) in our case in 4 passages.

Kernel attributes in the two-stage reduction example in OpenCL proposed by AMD

I have some problems understanding the two-stage reduction algorithm described here.
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
I understand the basic idea, but I am not sure about the while-loop. As far as I inderstand, the attribute length specifies the number of elements in the buffer, i.e. how many elements do I want to process at all. But get_global_size returns the global number of work-items. Aren't length and get_global_size equal then? This would mean that the while-loop condition wil be satisfied only once. Shouldn't we use get_local_size instead of get_global_size?

Aren't length and get_global_size equal then?
Not necessarily. It is common to launch less work items than there are data elements, and have each work item process more than one element. This way, you can decouple your input data size from the number of work items.
In this case, the following:
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
Performs a min-reduction of an array that resides in global memory. Basically, the work group will "slide" over the input vector, and at every iteration, each work item will update its minimum.
Here's a fictitious numerical example where we launch 2 work groups of 4 work-items over an array of 20 elements. xN represents the Nth element from the input array, aN and bN represent the Nth work item from work group a and b, respectively. Therefore the while condition is met between 2 to 3 times, depending on the work item id:
length: 20
get_global_size(): 8
get_local_size(): 4
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 Input array
--------------------------------------------------------------------- Iterations
a0 a1 a2 a3 b0 b1 b2 b3 0
a0 a1 a2 a3 b0 b1 b2 b3 1
a0 a1 a2 a3 2
When the while loop finishes, every work item will have computed a minimum over a subset of the input array. For example, a0 will have computed min(x0, x8, x16) and b0 will have computed min(x4, x12).
Then the work items write the minimum they computed to local memory, and the work groups proceed to do a min-reduction (with a reduction tree) in local memory. Their result is written back to global memory, and presumably the kernel is called again with result as the new array to min-reduce, until the final result is a single element.

The global size may be larger than the length because in OpenCL 1.x the global size must be a whole number multiple of the work group size. Therefore the global size might have been rounded up from the data size (length). For example, if length was 1000 but the work group size was 128 then the global size would be 1024.

[FULL Description]
Overview:
This is a Two Stage reduction, which outperform the recursive multistage reduction by reducing synchronizations/barrier and overheads, and keeping all the computing unit as busy as possible. Before understand the kernel, it is important to understand work items and work groups configuration set by the host program and the parameters of the kernel. In this example, the task was to find the min value of N float numbers. The configurations are given below.
Setup:
The work group configuration are, the host sets up K number of work items (K < N) and P work groups. Each work group will have Q work items where K=P*Q. It is preferable that N%K==0, but not necessary.
The parameters and dimensions of the kernel are: 1) The first argument is an N size array contains N data elements (candidate data for finding min); 2) The second argument is an empty array of size Q; 3) The value of length is equal to N;and 4) the result is an array of size P.
Workflow: Stage 1
The work flow is as given below:
If N%K== 0, each work item initially find the minimum value among N/K data elements, where the data elements are apart from each other by K items. The while loop does this task. If N%K != 0, some of the work item calculate min of ceil(N/K) elements and the rest of the work items find min of floor(N/K) elements.(as explained in the above answer by Kretab Chabawenizc).
The findings of each of these work items are initially stored in the local variable accumulator and then finally saved into the local array scratch. Once all the work items are done with this part of work (ensured by the barrier(CLK_LOCAL_MEM_FENCE)) the kernel start acting as a recursive parallel reduction kernel. Work items from a specific work group consider scratchpad as the data items array and each of the work items then reduce it by iteration (the for loop does this. Read the actual AMD documentation to get more explanation).
Finally the first P elements of result will contain the minimum value find by each of the P work groups.
Workflow: Stage 2
Now the second stage starts; and in this stage the same kernel can be invoked for P work items and 1 work group. The result array will be the first argument of the kernel this time and an one element array will be the last argument of the kernel to receive the final result.
In this run, the while loop will not do anything significant but just copy the values from the buffer to scratch. Thus you can come up with a more optimized kernel and use that for the second stage.

Do global_work_size and local_work_size have any effect on application logic?

I am trying to understand how all of the different parameters for dimensions fit together in OpenCL. If my question isn't clear that's partly because a well formed question requires bits of the answer which I don't have.
How do work_dim, global_work_size, and local_work_size work together to create the execution space that you use in a kernel? For example, if I make work_dim 2 then I can
get_global_id(0);
get_global_id(1);
I can divide those two dimensions up into n Work Groups using global_work_size, right? So if I make the global_work_size like so
size_t global_work_size[] = { 4 };
Then each dimension would have 4 work groups for a total of 8? But, as a beginner, I am only using global_id for my indices so only the global id's matter anyway. As you can tell I am pretty confused about all of this so any help you can offer would ...help.
image i made to try to understand this question
image decribing work groups i found on google

Since you stated yourself that you are a bit confused about the concepts involved in the execution space, I'm gonna try to summary them before answering your question and give some examples.
The threads/workitems are organized in a NDRange which can be viewed as a grid of 1, 2, 3 dims.
The NDRange is mainly used to map each thread to the piece of data each of them will have to manipulate. Therefore each thread should be uniquely identified and a thread should know which one it is and where it stands in the NDRange. And there come the Work-Item Built-in Functions. These functions can be called by all threads to give them info about themself and the NDRange where they stand.
The dimensions:
As already stated, an NDRange can have up to 3 dimensions. So if you set the dimensions this way:
size_t global_work_size[2] = { 4, 4 };
It doesn't mean that each dimension would have 4 work groups for a total of 8, but that you'll have 4 * 4 i.e. 16 threads in your NDRange. These threads will be arranged in a "square" with sides of 4 units. The workitems can know how many dimensions the NDRange is made of, using the uint get_work_dim () function.
The global size:
Threads can also query how big is the NDRange for a specific dimension with size_t get_global_size (uint D). Therefore they can know how big is the "line/square/rectangle/cube" NDRange.
The global unique identifiers:
Thanks to that organization, each thread can be uniquely identified with indexes corresponding to the specific dimensions. Hence the thread (2, 1) refers to a thread that is in the 3rd column and the second row of a 2D range. The function size_t get_global_id (uint D) is used in the kernel to query the id of the threads.
The workgroup (or local) size:
The NDRange can be split in smaller groups called workgroups. This is the local_work_size you were referring to which has also (and logically) up to 3 dimensions. Note that for OpenCL version below 2.0, the NDRange size in a given dimension must be a multiple of the workgroup size in that dimension. so to keep your example, since in the dimension 0 we have 4 threads, the workgroup size in the dimension 0 can be 1, 2, 4 but not 3. Similarly to the global size, threads can query the local size with size_t get_local_size (uint D).
The local unique identifiers:
Sometime it is important that a thread can be uniquely identified within a workgroup. Hence the function size_t get_local_id (uint D). Note the "within" in the previous sentence. a thread with a local id (1, 0) will be the only one to have this id in its workgroup (of 2D). But there will be as many threads with a local id (1, 0) as there will be workgroups in the NDRange.
The number of groups:
Speaking of groups sometime a thread might need to know how many groups there are. That's why the function size_t get_num_groups (uint D) exists. Note that again you have to pass as parameter the dimension you are interested in.
Each group has also an id:
...that you can query within a kernel with the function size_t get_group_id (uint D). Note that the format of the group ids will be similar to those of the threads: tuples of up to 3 elements.
Summary:
To wrap things up a bit, if you have a 2D NDRange of a global work size of (4, 6) and a local work size of (2, 2) it means that:
the global size in the dimension 0 will be 4
the global size in the dimension 1 will be 6
the local size (or workgroup size) in the dimension 0 will be 2
the local size (or workgroup size) in the dimension 1 will be 2
the thread global ids in the dimension 0 will range from 0 to 3
the thread global ids in the dimension 1 will range from 0 to 5
the thread local ids in the dimension 0 will range from 0 to 1
the thread local ids in the dimension 1 will range from 0 to 1
The total number of threads in the NDRange will be 4 * 6 = 24
The total number of threads in a workgroup will be 2 * 2 = 4
The total number of workgroups will be (4/2) * (6/2) = 6
the group ids in the dimension 0 will range from 0 to 1
the group ids in the dimension 1 will range from 0 to 2
there will be only one thread will the global id (0, 0) but there will be 6 threads with the local id (0, 0) because there are 6 groups.
Example:
Here is a dummy example to use all these concepts together (note that performance would be terrible, it's just a stupid example).
Let's say you have a 2D array of 6 rows and 4 columns of int. You want to group these elements in square of 2 by 2 elements and sum them up in such a way that for instance, the elements (0, 0), (0, 1), (1, 0), (1, 1) will be in one group (hope it's clear enough). Because you'll have 6 "squares" you'll have 6 results for the sums, so you'll need an array of 6 elements to store these results.
To solve this, you use our 2D NDRange detailed just above. Each thread will fetch from the global memory one element, and will store it in the local memory. Then after a synchronization, only one thread per workgroup, let say each local(0, 0) threads will sum the elements (in local) up and then store the result at a specific place in a 6 elements array (in global).
//in is a 24 int array, result is a 6 int array, temp is a 4 int array
kernel void foo(global int *in, global int *result, local int *temp){
//use vectors for conciseness
int2 globalId = (int2)(get_global_id(0), get_global_id(1));
int2 localId = (int2)(get_local_id(0), get_local_id(1));
int2 groupId = (int2)(get_group_id (0), get_group_id (1));
int2 globalSize = (int2)(get_global_size(0), get_global_size(1));
int2 locallSize = (int2)(get_local_size(0), get_local_size(1));
int2 numberOfGrp = (int2)(get_num_groups (0), get_num_groups (1));
//Read from global and store to local
temp[localId.x + localId.y * localSize.x] = in[globalId.x + globalId.y * globalSize.x];
//Sync
barrier(CLK_LOCAL_MEM_FENCE);
//Only the threads with local id (0, 0) sum elements up
if(localId.x == 0 && localId.y == 0){
int sum = 0;
for(int i = 0; i < locallSize.x * locallSize.y ; i++){
sum += temp[i];
}
//store result in global
result[groupId.x + numberOfGrp.x * groupId.y] = sum;
}
}
And finally to answer to your question: Do global_work_size and local_work_size have any effect on application logic?
Usually yes because it's part of the way you design you algo. Note that the size of the workgroup is not taken randomly but matches my need (here 2 by 2 square).
Note also that if you decide to use a NDRange of 1 dimension with a size of 24 and a local size of 4 in 1 dim, it'll screw things up too because the kernel was designed to use 2 dimensions.

MPI several broadcast at the same time

I have a 2D processor grid (3*3):
P00, P01, P02 are in R0, P10, P11, P12, are in R1, P20, P21, P22 are in R2.
P*0 are in the same computer. So the same to P*1 and P*2.
Now I would like to let R0, R1, R2 call MPI_Bcast at the same time to broadcast from P*0 to p*1 and P*2.
I find that when I use MPI_Bcast, it takes three times the time I need to broadcast in only one row.
For example, if I only call MPI_Bcast in R0, it takes 1.00 s.
But if I call three MPI_Bcast in all R[0, 1, 2], it takes 3.00 s in total.
It means the MPI_Bcast cannot work parallel.
Is there any methods to make the MPI_Bcast broadcast at the same time?
(ONE node broadcast with three channels at the same time.)
Thanks.

If I understand your question right, you would like to have simultaneous row-wise broadcasts:
P00 -> P01 & P02
P10 -> P11 & P12
P20 -> P21 & P22
This could be done using subcommunicators, e.g. one that only has processes from row 0 in it, another one that only has processes from row 1 in it and so on. Then you can issue simultaneous broadcasts in each subcommunicator by calling MPI_Bcast with the appropriate communicator argument.
Creating row-wise subcommunicators is extreamly easy if you use Cartesian communicator in first place. MPI provides the MPI_CART_SUB operation for that. It works like that:
// Create a 3x3 non-periodic Cartesian communicator from MPI_COMM_WORLD
int dims[2] = { 3, 3 };
int periods[2] = { 0, 0 };
MPI_Comm comm_cart;
// We do not want MPI to reorder our processes
// That's why we set reorder = 0
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm_cart);
// Split the Cartesian communicator row-wise
int remaindims[2] = { 0, 1 };
MPI_Comm comm_row;
MPI_Cart_sub(comm_cart, remaindims, &comm_row);
Now comm_row will contain handle to a new subcommunicator that will only span the same row that the calling process is in. It only takes a single call to MPI_Bcast now to perform three simultaneous row-wise broadcasts:
MPI_Bcast(&data, data_count, MPI_DATATYPE, 0, comm_row);
This works because comm_row as returned by MPI_Cart_sub will be different in processes located at different rows. 0 here is the rank of the first process in comm_row subcommunicator which will correspond to P*0 because of the way the topology was constructed.
If you do not use Cartesian communicator but operate on MPI_COMM_WORLD instead, you can use MPI_COMM_SPLIT to split the world communicator into three row-wise subcommunicators. MPI_COMM_SPLIT takes a color that is used to group processes into new subcommunicators - processes with the same color end up in the same subcommunicator. In your case color should equal to the number of the row that the calling process is in. The splitting operation also takes a key that is used to order processes in the new subcommunicator. It should equal the number of the column that the calling process is in, e.g.:
// Compute grid coordinates based on the rank
int proc_row = rank / 3;
int proc_col = rank % 3;
MPI_Comm comm_row;
MPI_Comm_split(MPI_COMM_WORLD, proc_row, proc_col, &comm_row);
Once again comm_row will contain the handle of a subcommunicator that only spans the same row as the calling process.

The MPI-3.0 draft includes a non-blocking MPI_Ibcast collective. While the non-blocking collectives aren't officially part of the standard yet, they are already available in MPICH2 and (I think) in OpenMPI.
Alternatively, you could start the blocking MPI_Bcast calls from separate threads (I'm assuming R0, R1 and R2 are different communicators).
A third possibility (which may or may not be possible) is to restructure the data so that only one broadcast is needed.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex