Using local memory to speed calculation - opencl

Should be an easy one but my OpenCL skills are completely rusty. :)
I have a simple kernel that does the sum of two arrays:
__kernel void sum(__global float* a, __global float* b, __global float* c)
{
__private size_t gid = get_global_id(0);
c[gid] = log(sqrt(exp(cos(sin(a[gid]))))) + log(sqrt(exp(cos(sin(b[gid])))));
}
It's working fine.
Now I'm trying to use local memory hoping it could speed things up:
__kernel void sum_with_local_copy(__global float* a, __global float* b, __global float* c, __local float* tmpa, __local float* tmpb, __local float* tmpc)
{
__private size_t gid = get_global_id(0);
__private size_t lid = get_local_id(0);
__private size_t grid = get_group_id(0);
__private size_t lsz = get_local_size(0);
event_t evta = async_work_group_copy(tmpa, a + grid * lsz, lsz, 0);
wait_group_events(1, &evta);
event_t evtb = async_work_group_copy(tmpb, b + grid * lsz, lsz, 0);
wait_group_events(1, &evtb);
tmpc[lid] = log(sqrt(exp(cos(sin(tmpa[lid]))))) + log(sqrt(exp(cos(sin(tmpb[lid])))));
event_t evt = async_work_group_copy(c + grid * lsz, tmpc, lsz, 0);
wait_group_events(1, &evt);
}
But there is two issues with this kernel:
it's something like 3 times slower than the naive implementation
the results are wrong starting at index 64
My local-size is the max workgroup size.
So my questions are:
1) Am I missing something obvious or is there really a subtlety?
2) How to use local memory to speed up the computation?
3) Should I loop inside the kernel so that each work-item does more than one operation?
Thanks in advance.

Your simple kernel is already optimal w.r.t work-group performance.
Local memory will only improve performance in cases where multiple work-items in a work-group read from the same address in local memory. As there is no shared data in your kernel there is no gain to be had by transferring data from global to local memory, thus the slow-down.
As for point 3, you may see a gain by processing multiple values per thread (depending on how expensive your computation is and what hardware you have).

As you probably know you can explicitly set the local work group size (LWS) when executing your kernel using:
clEnqueueNDRangeKernel( ... bunch of args include Local Work Size ...);
as discussed here. But as already mentioned by Kyle, you don't really have to do this because OpenCL tries to pick the best value for the LWS when you pass in NULL for LWS argument.
Indeed the specification says: "local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances."
I was curious to see how this played out in your case so I setup your calculation to verify the performance against the default value chosen by OpenCL on my device.
In case your interested I setup some arbitrary data:
int n = powl(2, 20);
float* a = (float*)malloc(sizeof(float)*n);
float* b = (float*)malloc(sizeof(float)*n);
float* results = (float*)malloc(sizeof(float)*n);
for (int i = 0; i<n; i++) {
a[i] = (float)i;
b[i] = (float)(n-i);
results[i] = 0.f;
}
and then after defining all of the other OpenCL structures I varied, lws = VALUE, from 2 to 256 (max allowed on my device for this kernel) in powers of 2, and measured the wall-clock time (note: can also use OpenCL events):
struct timeval timer;
int trials = 100;
gettimeofday(&timer, NULL);
double t0 = timer.tv_sec+(timer.tv_usec/1000000.0);
// ---------- Execution ---------
size_t global_work_size = n;
size_t lws[] = {VALUE}; // VALUE was varied from 2 to 256 in powers of 2.
for (int trial = 0; trial<trials; trial++) {
clEnqueueNDRangeKernel(cmd_queue, kernel[0], 1, NULL, &global_work_size, lws, 0, NULL, NULL);
}
clFinish(cmd_queue);
gettimeofday(&timer, NULL);
double t1 = timer.tv_sec+(timer.tv_usec/1000000.0);
double avgTime = (double)(t1-t0)/trials/1.0f;
I then plotted the total execution time as a function of the LWS and as expected the performance varies by quite a bit, until the best value of LWS = 256, is reached. For LWS > 256, the memory on my device is exceeded with this kernel.
FYI for these tests I am running a laptop GPU: AMD ATI Radeon HD 6750M, Max compute units = 6 and the CL_DEVICE_LOCAL_MEM_SIZE = 32768 (so no big screamer compared other GPUs)
Here are the raw numbers:
LWS time(sec)
2 14.004
4 6.850
8 3.431
16 1.722
32 0.866
64 0.438
128 0.436
256 0.436
Next, I checked the default value chosen by OpenCL (passing NULL for the LWS) and this corresponds to the best value that I found by profiling, i.e., LWS = 256.
So in the code you setup you found one of the suboptimal cases, and as mentioned before, its best to let OpenCL pick the best values for the local work groups, especially when there is no shared data in your kernel between multiple work-items in a work-group.
As to the error you got, you probably violated a constraint (from the spec):
The total number of work-items in the work-group must be less than or equal to the CL_DEVICE_MAX_WORK_GROUP_SIZE
Did you check that in detail, by querying the CL_DEVICE_MAX_WORK_GROUP_SIZE for your device?

Adding to what Kyle has written: It has to be multiple work items reading from the same address; if it's just each work item itself reading multiple times from the same address - then again local memory won't help you any; just use the work item's private memory, i.e. variables you define within your kernel.
Also, some points not related to the use of local memory:
log(sqrt(exp(x)) = log(exp(x)) / 2 = x / 2 ... assuming it's the natural logarithm.
log(sqrt(exp(x)) = log(exp(x)) / 2 = x / (2 ln(2)) ... assuming it's the base-2 logarithm. Compute ln(2) in advance of course.
If you really did have some complex function-of-a-function-of-a-function, you might be better off using a Taylor series expansion. For example, your function expands to 1/2-x^2/4+(5 x^4)/48+O(x^6) (order 5).
The last term is an error term, which you can bound from above to choose the appropriate order for the expansion; the error term should not be that high for 'well-behaving' functions. The Taylor expansion calculation might even benefit from further parallelization (but then again, it might not).

Related

Constraining the solutions for linear equations

I'm searching for a way to solve a system of linear equations. Specifically 8 equations with a total of 16 unknown values.
Each unknown value (w[0...15]) is a 32-bit binary value which corresponds to 4 ascii characters written over 8 bits. For example:
For :
I've tried writing this system of linear equations as a single matrix equation. Which gives:
Right now, using the Eigen linear algebra library, I get my 16 solutions (w[0...15]) but all of them are either decimal or null values, which is not what I need. All 16 solutions need to be the equivalent of 4 hexadecimal characters under their binary representation. Meaning integers between 48 and 56 (ascii for '0' to '9'), 65 and 90 (ascii for 'A' to 'Z'), or 97 and 122 (ascii for 'a' to 'z').
Current 16 solutions:
I've found a solution to this problem using something called box-constraints. An example is shown here using python's lsq_linear function which allows the user to specify bounds. It seems Eigen does not let the user specify bounds in its decomposition methods.
Therefore, my question is, how do you get a similar result in C++ using a linear algebra library? Or is there a better way to solve such systems of equations without writing it under a single matrix equation?
Thanks in advance.
Since you're working with linear equations over Z/232Z, integer linear programming (as you tagged the question) may be a solution, and algorithms that are inherently floating point are not appropriate. Box constraints are not enough, they won't force the variables to take on integer values. Also, the model shown in the question does not taken into account that multiplying and adding in Z/232Z can wrap, which excludes many potential solutions (or perhaps that is intended?) and may make the instance accidentally infeasible when it was intended to be solvable.
ILP can model equations over Z/232Z relatively directly (using integer variables between 0 and 232 and some unconstrained additional variables scaled by 232 to "absorb" the wraparound), but it tends really struggle with that kind of formulation - I would say it's one of the worst cases for an ILP solver without getting into the "intentionally difficult" cases. A more indirect model with 32x boolean variables is also possible, but this leads to constraints with very large constants and ILP solvers tend to struggle with them too. Overall I do not recommend using ILP for this problem.
What I would recommend for this is an SMT solver that offers the bitvector theory, or as alternatively, a pseudo-boolean solver or plain SAT solver (which would leave the grunt work of implementing boolean circuits and converting them to CNF to you instead of having them builtin in the solver).
If you have more unknowns than equations for sure your system will be indeterminate, the rank of a 8 x 16 matrix is at most 8, thus you have at least 16 degrees of freedom.
Further more if you have bounds to your variables i.e. mixed equalities and inequalities, then your problem is better posed as a linear programming. You can set a dummy objective function c[i] = 0, you could use GLPK but that is a very generic solution. If you want a small code snipped you probably can find a toy implementation of the Simplex method that will satisfy your needs.
I went for an SMT solver as suggested by #harold. Specifically the CVC4 SMT Solver. Here is the code I've written in C++ answering my question about finding the 16 solutions (w[0...15]) for a system of 8 equations, constrained to be ascii characters. I have one last question though. What are pushing and popping for? (slv.push() and slv.pop())
#include <iostream>
#include <cvc4/cvc4.h>
using namespace std;
using namespace CVC4;
int main() {
// 1. initialize a CVC4 BitVector SMT solver
ExprManager em;
SmtEngine slv(&em);
slv.setOption("incremental", true); // enable incremental solving
slv.setOption("produce-models", true); // enable models
slv.setLogic("QF_BV"); // set the bitvector theory logic
Type bitvector8 = em.mkBitVectorType(size_8); // create a 8-bit wide bit-vector type (4 x 8-bit = 32-bit)
// 2. create the SMT solver variables
Expr w[16][4]; // w[0...15] where each w corresponds to 4 ascii characters
for (int i = 0; i < 16; ++i) {
for (int j = 0; j < 4; ++j) {
// a. define w[i] (four ascii characters per w[i])
w[i][j] = em.mkVar("w" + to_string(i) + to_string(j), bitvector8);
// b. constraint w[i][0...3] to be an ascii character
// - digit (0-9) constraint
// ascii lower bound digit constraint (bit-vector unsigned greater than or equal)
Expr digit_lower = em.mkExpr(kind::BITVECTOR_UGE, w[i][j], em.mkConst(BitVector(size_8, Integer(48))));
// ascii upper bound digit constraint (bit-vector unsigned less than or equal)
Expr digit_upper = em.mkExpr(kind::BITVECTOR_ULE, w[i][j], em.mkConst(BitVector(size_8, Integer(56))));
Expr digit_constraint = em.mkExpr(kind::AND, digit_lower, digit_upper);
// - lower alphanumeric character (a-z) constraint
// ascii lower bound alpha constraint (bit-vector unsigned greater than or equal)
Expr alpha_lower = em.mkExpr(kind::BITVECTOR_UGE, w[i][j], em.mkConst(BitVector(size_8, Integer(97))));
// ascii upper bound alpha constraint (bit-vector unsigned less than or equal)
Expr alpha_upper = em.mkExpr(kind::BITVECTOR_ULE, w[i][j], em.mkConst(BitVector(size_8, Integer(122))));
Expr alpha_constraint = em.mkExpr(kind::AND, alpha_lower, alpha_upper);
Expr ascii_constraint = em.mkExpr(kind::OR, digit_constraint, alpha_constraint);
slv.assertFormula(ascii_constraint);
}
}
// 3. encode the 8 equations
for (int i = 0; i < 8; ++i) {
// a. build the multiplication part (index * w[i])
vector<Expr> left_mult_hand;
for (int j = 0; j < 16; ++j) {
vector <Expr> inner_wj;
for (int k = 0; k < 4; ++k) inner_wj.push_back(w[j][k]);
Expr wj = em.mkExpr(kind::BITVECTOR_CONCAT, inner_wj);
Expr index = em.mkConst(BitVector(size_32, Integer(m_unknowns[j])));
left_mult_hand.push_back(em.mkExpr(kind::BITVECTOR_MULT, index, wj));
}
// b. sum each index * w[i]
slv.push();
Expr left_hand = em.mkExpr(kind::BITVECTOR_PLUS, left_mult_hand);
Expr result = em.mkConst(BitVector(size_32, Integer(globalSums.to_ulong())));
Expr assumption = em.mkExpr(kind::EQUAL, left_hand, result);
slv.assertFormula(assumption);
// c. check for satisfiability
cout << "Result from CVC4 is: " << slv.checkSat(em.mkConst(true)) << endl << endl;
slv.pop();
}
return 0;
}

K nearest neighbor search for caret R implementation

[EDIT: I understand that it is faster also because the function is written in C, but I want to know if It does a brute force search on all the training instances or something more sophisticated ]
I'm implementing in R, for studying purpose, the KNN algorithm.
I'm also checking the code correctness by comparison with the caret implementation.
The problem lies on the execution time of the two versions. My version seems to take a lot of time, instead the caret implementation is very fast (even with crossvalidation with 10 folds).
Why? I'm calculating every euclidean distance of my test instances from the training ones. Which means that I'm doing NxM distance calculation (where N are my test instances, and M my training instances):
for (i in 1:nrow(test)){
distances <- c()
classes <- c()
for(j in 1:nrow(training)){
d = calculateDistance(test[i,], training[j,])
distances <- c(distances, d)
classes <- c(classes, training[j,][[15]])
}
}
Is the caret implementation using some approximate search? Or an exact search, for example with the kd-tree? How can I speed up the search? I got 14 features for the problem, but I've been reading that the kd-tree is suggested for problem with 1 to 5 features.
EDIT:
I've found the C function called by R (VR_knn), which is pretty complex for me to understand, maybe someone can help.
Anyway I've written on the fly a brute force search in cpp, which seems to go faster than my previous R version, (but not fast as the caret C version) :
#include <Rcpp.h>
using namespace Rcpp;
double distance(NumericVector x1, NumericVector x2){
int vectorLen = x1.size();
double sum = 0;
for(int i=0;i<vectorLen-1;i++){
sum = sum + pow((x1.operator()(i)-x2.operator()(i)),2);
}
return sqrt(sum);
}
// [[Rcpp::export]]
void searchCpp(NumericMatrix training, NumericMatrix test) {
int numRowTr = training.rows();
int numColTr = training.cols();
int numRowTe = test.rows();
int numColTe = test.cols();
for (int i=0;i<numRowTe;i++)
{
NumericVector test_i = test.row(i);
NumericVector distances = NumericVector(numRowTe);
for (int j=0;j<numRowTr;j++){
NumericVector train_j = training.row(j);
double dist = distance(test_i, train_j);
distances.insert(i,dist);
}
}
}

A numerical library which uses a paralleled algorithm to do one dimensional integration?

Is there a numerical library which can use a paralleled algorithm to do one dimensional integration (global adaptive method)? The infrastructure of my code decides that I cannot do multiple numerical integrations in parallel, but I have to use a paralleled algorithm to speed up.
Thanks!
Nag C numerical library does have a parallel version of adaptive quadrature (link here). Their trick is to request the user the following function
void (*f)(const double x[], Integer nx, double fv[], Integer *iflag, Nag_Comm *comm)
Here the function "f" evaluates the integrand at nx abscise points given by the vector x[]. This is where parallelization comes along, because you can use parallel_for (implemented in openmp for example) to evaluate f at those points concurrently. The integrator itself is single threaded.
Nag is a very expensive library, but if you code the integrator yourself using, for example, numerical recipes, it is not difficult to modify serial implementations to create parallel adaptive integrators using NAG idea.
I can't reproduce numerical recipes book to show where modifications are necessary due to license restriction. So let's take the simplest example of trapezoidal rule, where the implementation is quite simple and well known. The simplest way to create adaptive method using trapezoidal rule is to calculate the integral at a grid of points, then double the number of abscise points and compare the results. If the result changes by less than the requested accuracy, then there is convergence.
At each step, the trapezoidal rule can be computed using the following generic implementation
double trapezoidal( double (*f)(double x), double a, double b, int n)
{
double h = (b - a)/n;
double s = 0.5 * h * (f(a) + f(b));
for( int i = 1; i < n; ++i ) s += h * f(a + i*h);
return s;
}
Now you can make the following changes to implement NAG idea
double trapezoidal( void (*f)( double x[], int nx, double fv[] ), double a, double b, int n)
{
double h = (b - a)/n;
double x[n+1];
double fv[n+1];
for( int i = 0; i < n; ++i ) x[i+1] = (a + i * h);
x[n] = b;
f(x, n, fv); // inside f, use parallel_for to evaluate the integrand at x[i], i=0..n
double s = 0.5 * h * ( fv[0] + fv[n] );
for( int i = 1; i < n; ++i ) s += h * fv[i];
return s;
}
This procedure, however, will only speed-up your code if the integrand is very expensive to compute. Otherwise, you should parallelize your code at higher loops and not inside the integrator.
Why not simply implement a wrapper around a single threaded algorithm that dispatches integrals of subdivisions of the bounds to different threads and then adds them together at the end? e.g.
thread 0: i0 = integral(x0, (x0+x1)/2)
thread 1: i1 = integral((x0+x1)/2, x1)
i = i0 + i1

dividing by 2 and ceiling until remains 1

having the following algorithm only for natural numbers:
rounds(n)={1, if n=1; 1+rounds(ceil(n/2)), else}
so writing in a programming language this will be
int rounds(int n){
if(n==1)
return 1;
return 1+rounds(ceil(n/2));
}
i think this has time complexity O(log n)
is there a better complexity?
Start by listing the results from 1 upward,
rounds(1) = 1
rounds(2) = 1 + rounds(2/2) = 1 + 1 = 2
Next, when ceil(n/2) is 2, rounds(n) will be 3. That's for n = 3 and n = 4.
rounds(3) = rounds(4) = 3
then, when ceil(n/2) is 3 or 4, the result will be 4. 3 <= ceil(n/2) <= 4 happens if and only if 2*3-1 <= n <= 2*4, so
round(5) = ... = rounds(8) = 4
Continuing, you can see that
rounds(n) = k+2 if 2^k < n <= 2^(k+1)
by induction.
You can rewrite that to
rounds(n) = 2 + floor(log_2(n-1)) if n > 1 [and rounds(1) = 1]
and mathematically, you can also treat n = 1 uniformly by rewriting it to
rounds(n) = 1 + floor(log_2(2*n-1))
The last formula has the potential for overflow if you're using fixed-width types, though.
So the question is
how fast can you compare a number to 1,
how fast can you subtract 1 from a number,
how fast can you compute the (floor of the) base-2 logarithm of a positive integer?
For a fixed-width type, thus a bounded range, all these are of course O(1) operations, but then you're probably still interested in making it as efficient as possible, even though computational complexity doesn't enter the game.
For native machine types - which int and long usually are - comparing and subtracting integers are very fast machine instructions, so the only possibly problematic one is the base-2 logarithm.
Many processors have a machine instruction to count the leading 0-bits in a value of the machine types, and if that is made accessible by the compiler, you will get a very fast implementation of the base-2 logarithm. If not, you can get a faster version than the recursion using one of the classic bit-hacks.
For example, sufficiently recent versions of gcc and clang have a __builtin_clz (resp. __builtin_clzl for 64-bit types) that maps to the bsr* instruction if that is present on the processor, and presumably a good implementation using some bit-twiddling if it isn't provided by the processor.
The version
unsigned rounds(unsigned long n) {
if (n <= 1) return n;
return sizeof n * CHAR_BIT + 1 - __builtin_clzl(n-1);
}
using the bsrq instruction takes (on my box) 0.165 seconds to compute rounds for 1 to 100,000,000, the bit-hack
unsigned rounds(unsigned n) {
if (n <= 1) return n;
--n;
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n -= (n >> 1) & 0x55555555;
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F);
return ((n * 0x01010101) >> 24)+1;
}
takes 0.626 seconds, and the naive loop
unsigned rounds(unsigned n) {
unsigned r = 1;
while(n > 1) {
++r;
n = (n+1)/2;
}
return r;
}
takes 1.865 seconds.
If you don't use a fixed-width type, but arbitrary precision integers, things change a bit. The naive loop (or recursion) still uses Θ(log n) steps, but the steps take Θ(log n) time (or worse) on average, so overall you have a Θ(log² n) algorithm (or worse). Then using the formula above can not only offer an implementation with lower constant factors, but one with lower algorithmic complexity.
Comparing to 1 can be done in constant time for suitable representations, O(log n) is the worst case for reasonable representations.
Subtracting 1 from a positive integer takes O(log n) for reasonable representations.
Computing the (floor of the) base-2 logarithm can be done in constant time for some representations, and in O(log n) for other reasonable representations [if they use a power-of-2 base, which all arbitrary precision libraries I'm semi-familiar with do; if they used a power-of-10 base, that would be different].
If you think of the algorithm as iterative and the numbers as binary, then this function shifts out the lowest bit and increases the number by 1 if it was a 1 that was shifted out. Thus, except for the increment, it counts the number of bits in the number (that is, the position of the highest 1). The increment will eventually increase the result by one, except when the number is of the form 1000.... Thus, you get the number of bits plus one, or the number of bits if the number is a power of two. Depending on your machine model, this might be faster to calculate than O(log n).

Using a global_work_offset in clEnqueueNDRangeKernel

I have a global work size of 1000 but i want only to execute the kernel from 200 to 1000.
size_t global_work_size = 1000;
size_t global_work_offset = 200;
clEnqueueNDRangeKernel(cpu_queue, kernel [0], 1, &global_work_offset, &global_work_size, NULL, 0, NULL, NULL);
The problem is it does compute the whole 0-1000 range even if I specify an offset. I tried using:
size_t global_work_offset [1] = {200}; but still no luck.
You should notice the difference between that parameter in CL 1.0 and 1.1:
CL 1.0:
global_work_offset
Must currently be a NULL value. In a future revision of OpenCL,
global_work_offset can be used to specify an array of work_dim
unsigned values that describe the offset used to calculate the global
ID of a work-item instead of having the global IDs always start at
offset (0, 0,... 0).
CL 1.1:
global_work_offset
global_work_offset can be used to specify an array of work_dim
unsigned values that describe the offset used to calculate the global
ID of a work-item. If global_work_offset is NULL, the global IDs start
at offset (0, 0, ... 0).
So, check that you have a CL 1.1 device and drivers.

Resources