OpenCL find max in array - opencl

I am trying to find max value in a 1-D array by reduction operators. I had refered the method in: OpenCL™ Optimization Case Study: Simple Reductions
Following is my code:
__kernel void Normallize(__global float* input, __global float* output,__global float* cmax, int rows, int cols){
int g_idx = get_global_id(0);
for(int i=0 ; i< get_global_size(0) ; i++) cmax[i] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
for(int offset = get_global_size(0)/2 ; offset >0 ; offset--){
if(g_idx < offset){
float pre = input[g_idx];
float next = input[g_idx + offset];
cmax[g_idx] = (pre > next) ? pre:next;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
output[g_idx] = cmax[0];
}
After do some research, I still can't figure out the problem in my code.

Did you mean this(%60 VALU utilization for an amd gpu)?:
__kernel void maxping(__global __read_only float * a, __global __write_only float *b){
int threadId=get_global_id(0);
int localThreadId=get_local_id(0);
int localSize=get_local_size(0);
__local float fastMem[256];
fastMem[localThreadId]=a[threadId];
barrier(CLK_GLOBAL_MEM_FENCE|CLK_LOCAL_MEM_FENCE);
for(int i=localSize/2;i>=1;i/=2)
{
if(localThreadId<i)
{
if(fastMem[localThreadId]<fastMem[localThreadId+i])
fastMem[localThreadId]=fastMem[localThreadId+i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(localThreadId==0)
b[threadId]=fastMem[localThreadId];
}
where each group(of 256 threads) reduce in local memory and setting each first-in-group value to max of its group. This example has 4096 elements from 0 to 4095.
For upper kernel, VALU usage is something like:
x: idle thread
o: thread in process, m: thread in memory operation
** : m m m m m m m m m m m m m m m m
i=0 : o o o o o o o o x x x x x x x x
i=1 : o o o o x x x x x x x x x x x x
i=2 : o o x x x x x x x x x x x x x x
i=3 : o x x x x x x x x x x x x x x x
** : m m m m m m m m m m m m m m m m
but i counts more steps and each row spans 250 units.
__kernel void maxpong(__global __write_only float * a, __global __read_only float *b){
int threadId=get_global_id(0);
int localSize=get_local_size(0);
int maxGroups=4096/localSize;
if(threadId==0)
{
float maxv=FLT_MIN;
for(int i=0;i<maxGroups;i++)
{
if(maxv<b[i*localSize])
maxv=b[i*localSize];
}
a[0]=maxv;
}
}
where only first thread(best in cpu) does a simple max(0,1,2,...,M) and sets first element of a to max(a).
First kernel does 255/256 of total computing. But it leaves half of cores of each compute unit untouched. So, you can sort another thing in that other half of cores. That could be another array to be max()'ed or same array's min()'ed or even same max of same array but with working on half of it while other cores working on other half.
%73 VALU utilization for max(a) with a different initial kernel:
__kernel void maxping(__global __read_only float * a, __global __write_only float *b){
int threadId=get_global_id(0);
int localThreadId=get_local_id(0);
int localSize=get_local_size(0);
__local float fastMem[256];
__local float fastMem2[256];
fastMem[localThreadId]=a[threadId];
fastMem2[localThreadId]=a[threadId+2048];
barrier(CLK_GLOBAL_MEM_FENCE|CLK_LOCAL_MEM_FENCE);
for(int i=localSize/2;i>=1;i/=2)
{
if(localThreadId<i)
{
// sorting first part
if(fastMem[localThreadId]<fastMem[localThreadId+i])
fastMem[localThreadId]=fastMem[localThreadId+i];
}
else if(localThreadId>localSize-i)
{
// sorting second part
if(fastMem2[localThreadId]<fastMem2[localThreadId-i])
fastMem2[localThreadId]=fastMem2[localThreadId-i];
}
else
{
// idle thread. Free compute slot.
// can squeeze some geometry computing
// or up-sweep scan of another reduction type
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(localThreadId==0)
b[threadId]=(fastMem[localThreadId]>fastMem2[255]?fastMem[localThreadId]:fastMem2[255]);
}
this uses 2048 threads for 4096 element array. Sets 0th, 256th, 512nd, .. elements to their respective group maximums, then you easily check which one is bigger on the host side.
There is still unused cores.
For upper kernel, VALU usage is something like:
x: idle thread
o: thread in process, m: thread doing memory operation
** : m m m m m m m m m m m m m m m m
i=0 : o o o o o o o o o o o o o o o o
i=1 : o o o o x x x x x x x x o o o o
i=2 : o o x x x x x x x x x x x x o o
i=3 : o x x x x x x x x x x x x x x o
** : m m m m m m m m m m m m m m m m
but i steps for log2(256) times so there are more "i" steps and amd hardware has 64 cores which fully serve even when there are 64 threads in a step. When we sum all thread usage for this loop, it doesn't give %73 but when other "warps"(40 of these) stream to same compute unit, more holes are filled hence more vector arithmetic logic units become more oftenly used. Even the local memory assignment part is important because all cores' memory operation units are kept busy(global to local, local to global) while other warps keep comparison units busy.
Edit: If you need not a multiple of 256 of global size, then you can add a global id check after local memory operation so it doesnt do undefined behaviour. Maybe you can pad array with extra FLT_MIN values instead.

Related

Maximum cost without cycles

Given an undirected graph with positive edge costs, choose a subset of edges such that there are no cycles and the sum of the cost is maximal.
The input consists of several graphs, each defined with number of vertices n, number of edges m, and m triples x,y,c to indicate an edge between x and y of cost c. The vertices are numbered from 0 to n - 1. It is assumed that 1 ≤ n ≤ 104, 0 ≤ m ≤ 5 n, and 1 ≤ c ≤ 105. there may be more than one edge between two vertices, and even edges with x = y.
#include <iostream>
#include <vector>
using namespace std;
using P = pair<int,int>;
using VE = vector<int>;
using VP = vector<P>;
using VVE = vector<VP>;
int n,m;
VVE G;
VE cost;
VE vist;
VE pare;
int maxim(int x){
if(cost[x] != -1) return cost[x];
cost[x] = 0;
for(P y: G[x]){
if(cost[x] <= y.second + maxim(y.first)){
cost[x] = y.second + maxim(y.first);
}
}
return cost[x];
}
int main() {
while(cin >> n >> m){
G = VVE(n);
cost = VE(n,-1);
pare = VE(n,-1);
for(int i = 0; i < m; ++i){
int x,y,c; cin >> x >> y >> c;
G[x].push_back(P(y,c));
G[y].push_back(P(x,c));
}
int mx = -1;
for(int i = 0; i < n; ++i){
if(mx <= maxim(i)){
mx = maxim(i);
}
}
cout << mx << endl;
}
}
This is my code and I don't know how to solve the problem. I would appreciate help. As you can see the graph is read as a vector of vectors. In which each pair indicates that node x goes to node y with cost c.
As a commenter pointed out, this is the maximum spanning tree problem (which is the same as the minimum spanning tree problem, just negate the costs). That problem can be solved with the greedy algorithm. Initially place every node into a heap on its own. Then in a loop consider the edges in decreasing cost order. If the two endpoints of the considered edge are in the same heap, discard the edge. Otherwise select it and merge the heaps. When you have only one heap left, you can stop, and the selected edges form your solution.

Compute product of large 3-D arrays in R

I am working on an optimization problem, and to supply the analytic gradient to the routine, I need to compute the gradient of large 3D arrays with respect to parameters. The largest of these arrays s are of dimensions [L,N,J] where L,J ~ 2000, and N= 15. L and N stand for nodes over which the arrays are then aggregated up with some fixed weights w to vectors of length J. Computing the gradient naively generates a [L,N,J,J] arrays x whose elements are x(l,n,j,k) = -s(l,n,j)s(l,n,k) if j=/=k and x(l,n,j,j) = s(l,n,j)(1-s(l,n,j)).
Several functions in the procedure would use x as input, but as of right now I cannot keep x in memory due to its size. My approach so far has been to compute and directly aggregate up x over L and N to only ever store JxJ matrices, but the downside is that I cannot reuse x in other functions. This is what the following code does:
arma::mat agg_dsnode_ddelta_v3(arma::cube s_lnj,
arma::mat w_ln,
arma::vec w_l){
// Normal Matrix dimensions
unsigned int L = s_lnj.n_rows;
unsigned int N = s_lnj.n_cols;
unsigned int J = s_lnj.n_slices;
//resulting matrix
arma::mat ds_ddelta_jj = arma::mat(J,J, arma::fill::zeros);
for (unsigned int l = 0; l < L; l++) {
for (unsigned int n = 0; n < N; n++) {
arma::vec s_j = s_lnj.subcube(arma::span(l), arma::span(n), arma::span());
ds_ddelta_jj += - arma::kron(w_l(l) * w_ln(l,n) * s_j, s_j.as_row()) + arma::diagmat(w_l(l) * w_ln(l,n) * s_j);
}
}
return ds_ddelta_jj;
}
Alternatively, the 4-D array x could for instance be computed with sparseMatrix, but this approach does not scale up when the L and J increase
library(Matrix)
L = 2
N = 3
J = 4
s_lnj <- array(rnorm(L*N*J), dim=c(L,N,J))
## create spare Matrix with s(l,n,:) vertically on the diagonal
As_lnj = A = sparseMatrix(i=c(1:(L*N*J)),j=rep(1:(L*N), each=J),x= as.vector(aperm(s_lnj, c(3, 1, 2))))
## create spare Matrix with s(l,n,:) horizontally on the diagonal
Bs_lnj = sparseMatrix(i=rep(1:(L*N), each=J),j=c(1:(L*N*J)),x= as.vector(aperm(s_lnj, c(3, 1, 2))))
## create spare Matrix with s(l,n,:) diagonnally
Cs_lnj = sparseMatrix(i=c(1:(L*N*J)),j=c(1:(L*N*J)),x= as.vector(aperm(s_lnj, c(3, 1, 2))))
## compute 4-D array with sparseMatrix product
x = -(As_lnj %*% Bs_lnj) + Cs_lnj
I was wondering if you knew of faster way to implement the first code, or alternatively of an approach that would make the second one scalable.
Thank you in advance

behaviour of atomic_add in opencl

I'm playing around with an example on opencl:
__kernel void atomic(__global int* x) {
__local int a, b;
a = 0; b = 0;
a++;
atomic_inc(&b);
x[0] = a;
x[1] = b;
x[2]++;
atomic_inc(x+3);
}
Running this code with global_size = 1024 and workgroup_size = 8, this is the following output:
[1 8 1 1024]
I can understand what is happening for all cases except the value given for x[1]. Why is the value of x[1] not 1024 but 8?
Under x[1] is stored value of b which is a variable residing in __local address space meaning the variable is shared by all work items within a workgroup. Each of workgroup have b initialized to 0 and atomically incremented to 8 because workgroup size is 8 (each work item increments by 1).

Is it safe to replace "a/(b*c)" with "a/b/c" when using integer-division?

Is it safe to replace a/(b*c) with a/b/c when using integer-division on positive integers a,b,c, or am I at risk losing information?
I did some random tests and couldn't find an example of a/(b*c) != a/b/c, so I'm pretty sure it's safe but not quite sure how to prove it.
Thank you.
Mathematics
As mathematical expressions, ⌊a/(bc)⌋ and ⌊⌊a/b⌋/c⌋ are equivalent whenever b is nonzero and c is a positive integer (and in particular for positive integers a, b, c). The standard reference for these sorts of things is the delightful book Concrete Mathematics: A Foundation for Computer Science by Graham, Knuth and Patashnik. In it, Chapter 3 is mostly on floors and ceilings, and this is proved on page 71 as a part of a far more general result:
In the 3.10 above, you can define x = a/b (mathematical, i.e. real division), and f(x) = x/c (exact division again), and plug those into the result on the left ⌊f(x)⌋ = ⌊f(⌊x⌋)⌋ (after verifying that the conditions on f hold here) to get ⌊a/(bc)⌋ on the LHS equal to ⌊⌊a/b⌋/c⌋ on the RHS.
If we don't want to rely on a reference in a book, we can prove ⌊a/(bc)⌋ = ⌊⌊a/b⌋/c⌋ directly using their methods. Note that with x = a/b (the real number), what we're trying to prove is that ⌊x/c⌋ = ⌊⌊x⌋/c⌋. So:
if x is an integer, then there is nothing to prove, as x = ⌊x⌋.
Otherwise, ⌊x⌋ < x, so ⌊x⌋/c < x/c which means that ⌊⌊x⌋/c⌋ ≤ ⌊x/c⌋. (We want to show it's equal.) Suppose, for the sake of contradiction, that ⌊⌊x⌋/c⌋ < ⌊x/c⌋ then there must be a number y such that ⌊x⌋ < y ≤ x and y/c = ⌊x/c⌋. (As we increase a number from ⌊x⌋ to x and consider division by c, somewhere we must hit the exact value ⌊x/c⌋.) But this means that y = c*⌊x/c⌋ is an integer between ⌊x⌋ and x, which is a contradiction!
This proves the result.
Programming
#include <stdio.h>
int main() {
unsigned int a = 142857;
unsigned int b = 65537;
unsigned int c = 65537;
printf("a/(b*c) = %d\n", a/(b*c));
printf("a/b/c = %d\n", a/b/c);
}
prints (with 32-bit integers),
a/(b*c) = 1
a/b/c = 0
(I used unsigned integers as overflow behaviour for them is well-defined, so the above output is guaranteed. With signed integers, overflow is undefined behaviour, so the program can in fact print (or do) anything, which only reinforces the point that the results can be different.)
But if you don't have overflow, then the values you get in your program are equal to their mathematical values (that is, a/(b*c) in your code is equal to the mathematical value ⌊a/(bc)⌋, and a/b/c in code is equal to the mathematical value ⌊⌊a/b⌋/c⌋), which we've proved are equal. So it is safe to replace a/(b*c) in code by a/b/c when b*c is small enough not to overflow.
While b*c could overflow (in C) for the original computation, a/b/c can't overflow, so we don't need to worry about overflow for the forward replacement a/(b*c) -> a/b/c. We would need to worry about it the other way around, though.
Let x = a/b/c. Then a/b == x*c + y for some y < c, and a == (x*c + y)*b + z for some z < b.
Thus, a == x*b*c + y*b + z. y*b + z is at most b*c-1, so x*b*c <= a <= (x+1)*b*c, and a/(b*c) == x.
Thus, a/b/c == a/(b*c), and replacing a/(b*c) by a/b/c is safe.
Nested floor division can be reordered as long as you keep track of your divisors and dividends.
#python3.x
x // m // n = x // (m * n)
#python2.x
x / m / n = x / (m * n)
Proof (sucks without LaTeX :( ) in python3.x:
Let k = x // m
then k - 1 < x / m <= k
and (k - 1) / n < x / (m * n) <= k / n
In addition, (x // m) // n = k // n
and because x // m <= x / m and (x // m) // n <= (x / m) // n
k // n <= x // (m * n)
Now, if k // n < x // (m * n)
then k / n < x / (m * n)
and this contradicts the above statement that x / (m * n) <= k / n
so if k // n <= x // (m * n) and k // n !< x // (m * n)
then k // n = x // (m * n)
and (x // m) // n = x // (m * n)
https://en.wikipedia.org/wiki/Floor_and_ceiling_functions#Nested_divisions

How to solve this hard combinatoric?

This is a contest problem (ACM ICPC South America 2015), it was the hardest in the problem set.
Summary: Given integers N and K, count the number of sequences a of length N consisting of integers 1 ≤ ai ≤ K, subject to the condition that for any x in that sequence there has to be a pair i, j satisfying i < j and ai = x − 1 and aj = x, i.e. the last x is preceded by x − 1 at some point.
Example: for N = 1000 and K = 100 the solution should be congruent to 265428620 modulo (109 + 7). Other examples and details can be found in the problem description.
I tried everything in my knowledge, but I need pointers to know how to do it. I even printed some lists with brute force to find the pattern, but I didn't succeed.
I'm looking for an algorithm, or formula that allows me to get to the right solution for this problem. It can be any language.
EDIT:
I solved the problem using a formula I found on the internet (someone who explained this problem). However, just because I programmed it, doesn't mean I understand it, so the question remains open. My code is here (the online judge returns Accepted):
#include <bits/stdc++.h>
using namespace std;
typedef long long int ll;
ll mod = 1e9+7;
ll memo[5001][5001];
ll dp(int n, int k){
// K can't be greater than N
k = min(n, k);
// if N or K is 1, it means there's only one possible list
if(n <= 1 || k <= 1) return 1;
if(memo[n][k] != -1) return memo[n][k];
ll ans1 = (n-k) * dp(n-1, k-1);
ll ans2 = k * dp(n-1, k);
memo[n][k] = ((ans1 % mod) + (ans2 % mod)) % mod;
return memo[n][k];
}
int main(){
int n, q;
for(int i=0; i<5001; i++)
fill(memo[i], memo[i]+5001, -1);
while(scanf("%d %d", &n, &q) == 2){
for(int i=0; i<q; i++){
int k;
scanf("%d", &k);
printf("%s%lld", i==0? "" : " ", dp(n, k));
}
printf("\n");
}
return 0;
}
The most important lines are the recursive call, particularly, these lines
ll ans1 = (n-k) * dp(n-1, k-1);
ll ans2 = k * dp(n-1, k);
memo[n][k] = ((ans1 % mod) + (ans2 % mod)) % mod;
Here I show the brute force algorithm for the problem in python. It works for small numbers, but for very big numbers it takes too much time. For N=1000 and K=5 it is already infeasible (Needs more than 100 years time to calculate)(In C it should also be infeasible as C is only 100 times faster than Python). So the problem actually forces you to find a shortcut.
import itertools
def checkArr(a,K):
for i in range(2,min(K+1,max(a)+1)):
if i-1 not in a:
return False
if i not in a:
return False
if a.index(i-1)>len(a)-1-a[::-1].index(i):
return False
return True
def num_sorted(N,K):
result=0
for a in itertools.product(range(1,K+1), repeat=N):
if checkArr(a,K):
result+=1
return result
num_sorted(3,10)
It returns 6 as expected.

Resources