I have a set of operations running in a loop.
for(int i = 0; i < row; i++)
{
sum += arr1[0] - arr2[0]
sum += arr1[0] - arr2[0]
sum += arr1[0] - arr2[0]
sum += arr1[0] - arr2[0]
arr1 += offset1;
arr2 += offset2;
}
Now I'm trying to vectorize the operations like this
for(int i = 0; i < row; i++)
{
convert_int4(vload4(0, arr1) - vload4(0, arr2));
arr1 += offset1;
arr2 += offset2;
}
But how do I accumulate the resulting vector in the scalar sum without using a loop?
I'm using OpenCL 2.0.
The operation is called "reduction" and there seems to be some information on it here.
In OpenCL special functions seem to be implemented, one being work_group_reduce() that might aid you: link.
And a presentation including some code: link.
For float2,float4 and similar, easiest version could be dot product. (conversion from int to float could be expensive)
float4 v1=(float4 )(1,2,3,4);
float4 v2=(float4 )(5,6,7,8);
float sum=dot(v1-v2,(float4)(1,1,1,1));
this is equal to
(v1.x-v2.x)*1 + (v1.y-v2.y)*1+(v1.z-v2.z)*1+(v1.w-v2.w)*1
and if there is any hardware support for it, leaving it to compiler's mercy should be okay. For larger vectors and especially arrays, J.H.Bonarius's answer is the way to go. Only CPU has such vertical sum operations as I know, GPU doesn't have this but for the sake of portability, dot product and work_group_reduce are easiest ways to achieve readability and even performance.
Dot product has extra multiplications so it may not be good always.
I have found a solution which seems to be the closest way I could have expected to solve my problem.
uint sum = 0;
uint4 S;
for(int i = 0; i < row; i++)
{
S += convert_uint4(vload4(0, arr1) - vload4(0, arr2));
arr1 += offset1;
arr2 += offset2;
}
S.s01 = S.s01 + S.s23;
sum = S.s0 + S.s1;
OpenCL 2.0 provides this functionality with vectors where the elements of the vectors can successively be replaced with the addition operation as shown above. This can support up to a vector of size 16. Larger operations can be split into factors of smaller operations. For example, for adding the absolute values of differences between two vectors of size 32, we can do the following:
uint sum = 0;
uint16 S0, S1;
for(int i = 0; i < row; i++)
{
S0 += convert_uint16(abs(vload16(0, arr1) - vload16(0, arr2)));
S1 += convert_uint16(abs(vload16(1, arr1) - vload16(1, arr2)));
arr1 += offset1;
arr2 += offset2;
}
S0 = S0 + S1;
S0.s01234567 = S0.s01234567 + S0.s89abcdef;
S0.s0123 = S0.s0123 + S0.s4567;
S0.s01 = S0.s01 + S0.s23;
sum = S0.s0 + S0.s1;
Related
I have the following problem to solve: given a number N and 1<=k<=N, count the number of possible sums of (1,...,k) which add to N. There may be equal factors (e.g. if N=3 and k=2, (1,1,1) is a valid sum), but permutations must not be counted (e.g., if N=3 and k=2, count (1,2) and (2,1) as a single solution). I have implemented the recursive Python code below but I'd like to find a better solution (maybe with dynamic programming? ). It seems similar to the triple step problem, but with the extra constraint of not counting permutations.
def find_num_sums_aux(n, min_k, max_k):
# base case
if n == 0:
return 1
count = 0
# due to lower bound min_k, we evaluate only ordered solutions and prevent permutations
for i in range(min_k, max_k+1):
if n-i>=0:
count += find_num_sums_aux(n-i, i, max_k)
return count
def find_num_sums(n, k):
count = find_num_sums_aux(n,1,k)
return count
This is a standard problem in dynamic programming (subset sum problem).
Lets define the function f(i,j) which gives the number of ways you can get the sum j using a subset of the numbers (1...i), then the result to your problem will be f(k,n).
for each number x of the range (1...i), x might be a part of the sum j or might not, so we need to count these two possibilities.
Note: f(i,0) = 1 for any i, which means that you can get the sum = 0 in one way and this way is by not taking any number from the range (1...i).
Here is the code written in C++:
int n = 10;
int k = 7;
int f[8][11];
//initializing the array with zeroes
for (int i = 0; i <= k; i++)
for (int j = 0; j <= n; j++)
f[i][j] = 0;
f[0][0] = 1;
for (int i = 1; i <= k; i++) {
for (int j = 0; j <= n; j++) {
if (j == 0)
f[i][j] = 1;
else {
f[i][j] = f[i - 1][j];//without adding i to the sum j
if (j - i >= 0)
f[i][j] = f[i][j] + f[i - 1][j - i];//adding i to the sum j
}
}
}
cout << f[k][n] << endl;//print f(k,n)
Update
To handle the case where we can repeat the elements like (1,1,1) will give you the sum 3, you just need to allow picking the same element multiple times by changing the following line of code:
f[i][j] = f[i][j] + f[i - 1][j - i];//adding i to the sum
To this:
f[i][j] = f[i][j] + f[i][j - i];
I'm trying to write a trimmed mean kernel that takes as input a set of frames (~100). I'm thinking of using an insertion sort (of size ~8). This means that I'll need to read one float/ uint/ushort at a time from the input images and compare it against an 8-wide vector, shifting the elements up and inserting the new value at the correct spot (if necessary), with the largest value added to the mean.
I'm having difficulties finding a portable way of shifting the elements in the vector and inserting the new one at the correct spot. I know that AMD GPUs have ds_permute for example, but those are not portable, and I can't figure out a clever way of using arithmetic and relational operators to do it (since those operate only on their lane and AFAIK unaligned vector accesses are UB in OpenCL).
If you only have 8 items in your list then you could add some indirection and have an index table uchar[8]. You assign the pre-sorted elements values 0-7. As you perform the sort you don't rearrange those items, instead you insert their indices into the table.
To get the speedup you then need to store each index using 4 bits to that all 8 fit into a 32-bit word. Honestly, I don't think this will be faster in your case though.
float elements[8];
uint index_table = 0;
uint sorted_size = 0;
// insert elements[i]
void insert(uint i)
{
uint temp = index_table
for (j = 0; j < sorted_size ; ++j)
{
if (elements[i] < elements[temp & 0xf])
{
// Insert i
temp = (temp << 4) | i;
index_table = (index_table & (4 * j - 1)) | (temp << (4 * j));
return;
}
temp >>= 4;
}
// Insert at end
index_table |= i << 4 * sorted_size ;
}
void insertion_sort()
{
// We can skip the first iteration since the 1st element is always inserted at the start
for (sorted_size = 1; sorted_size < 8; ++sorted_size)
{
insert(sorted_size);
}
}
float ith_smallest(uint i)
{
return elements[(index_table >> 4 * i) & 0xf];
}
I am very new to OpenCL and am going through the Altera OpenCL examples.
In their matrix multiplication example, they have used the concept of blocks, where dimensions of the input matrices are multiple of block size. Here's the code:
void matrixMult( // Input and output matrices
__global float *restrict C,
__global float *A,
__global float *B,
// Widths of matrices.
int A_width, int B_width)
{
// Local storage for a block of input matrices A and B
__local float A_local[BLOCK_SIZE][BLOCK_SIZE];
__local float B_local[BLOCK_SIZE][BLOCK_SIZE];
// Block index
int block_x = get_group_id(0);
int block_y = get_group_id(1);
// Local ID index (offset within a block)
int local_x = get_local_id(0);
int local_y = get_local_id(1);
// Compute loop bounds
int a_start = A_width * BLOCK_SIZE * block_y;
int a_end = a_start + A_width - 1;
int b_start = BLOCK_SIZE * block_x;
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
// Store result in matrix C
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}
Assume block size is 2, then: block_x and block_y are both 0; and local_x and local_y are both 0.
Then A_local[0][0] would be A[0] and B_local[0][0] would be B[0].
Sizes of A_local and B_local are 4 elements each.
In that case, how would A_local and B_local access other elements of the block in that iteration?
Also would separate threads/cores be assigned for each local_x and local_y?
There is definitely a barrier missing in your code sample. The outer for loop as you have it will only produce correct results if all work items are executing instructions in lockstep fashion, thus guaranteeing the local memory is populated before the for k loop.
Maybe this is the case for Altera and other FPGAs, but this is not correct for CPUs and GPUs.
You should add barrier(CLK_LOCAL_MEM_FENCE); if you are getting unexpected results, or want to be compatible with other type of hardware.
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
A_local and B_local are both shared by all work items of the work group, so all their elements are loaded in parallel (by all work items of the work group) at each step of the encompassing for loop.
Then each work item uses some of the loaded values (not necessarily the values the work item loaded itself) to do its share of the computation.
And finally, the work item stores its individual result into the global output matrix.
It is a classical tiled implementation of a matrix-matrix multiplication. However, I'm really surprised not to see any sort of call to a memory synchronisation function, such as work_group_barrier(CLK_LOCAL_MEM_FENCE) between the load of A_local and B_local and their use in the k loop... But I might very well have overlooked something here.
What is the Big-O time complexity ( O ) of the following recursive code?
public static int abc(int n) {
if (n <= 2) {
return n;
}
int sum = 0;
for (int j = 1; j < n; j *= 2) {
sum += j;
}
for (int k = n; k > 1; k /= 2) {
sum += k;
}
return abc(n - 1) + sum;
}
My answer is O(n log(n)). Is it correct?
Where I'm sitting...I think the runtime is O(n log n). Here's why.
You are making n calls to the function. The function definitely depends on n for the number of times the following two operations are made:
You loop up to 2*log(n) values to increment a sum.
For a worst case, n is extremely large, but the overall runtime doesn't change. A best case would be that n <= 2, such that only one operation is done (the looping would not occur).
Based on How to determine if a list of polygon points are in clockwise order?
I've come up with the following code:
bool PointsClockwise(const std::vector<MyPoint>& points)
{
double sum = 0.0;
for(size_t i = 0; i < points.size() - 1; ++i)
sum += (points[i+1].x()-points[i].x()) * (points[i+1].y()+points[i].y());
return sum > 0.0;
}
However, this seems to have wrong result in certain cases. Take for example the following ring:
LINESTRING(0 119,0 60,694 70,704 72,712 77,719 83,723 92,725 102,723 111,719 120,712 126,703 130)
It is in counter-clockwise order, but the function returns true.
Thanks!
You missed one of the line segments from your summation - namely the one connecting the last point with the first.
Try that:
bool PointsClockwise(const std::vector<MyPoint>& points)
{
double sum = 0.0;
for(size_t i = 0; i < points.size() - 1; ++i)
sum += (points[i+1].x()-points[i].x()) * (points[i+1].y()+points[i].y());
sum += (points[0].x()-points[points.size()-1].x()) * (points[0].y()+points[points.size()-1].y());
return sum > 0.0;
}
You need to include the case i == points.size() - 1, but to do that, you need to do some modular arithmetic in the loop, or else separate out the last iteration. Actually, just initialize sum to the last iteration:
double sum = (points[0].x() - points[points.size() - 1].x())
* (points[0].y() + points[points.size() - 1].y());
and end the iteration at i < points.size() - 1