Using a global_work_offset in clEnqueueNDRangeKernel - opencl

I have a global work size of 1000 but i want only to execute the kernel from 200 to 1000.
size_t global_work_size = 1000;
size_t global_work_offset = 200;
clEnqueueNDRangeKernel(cpu_queue, kernel [0], 1, &global_work_offset, &global_work_size, NULL, 0, NULL, NULL);
The problem is it does compute the whole 0-1000 range even if I specify an offset. I tried using:
size_t global_work_offset [1] = {200}; but still no luck.

You should notice the difference between that parameter in CL 1.0 and 1.1:
CL 1.0:
global_work_offset
Must currently be a NULL value. In a future revision of OpenCL,
global_work_offset can be used to specify an array of work_dim
unsigned values that describe the offset used to calculate the global
ID of a work-item instead of having the global IDs always start at
offset (0, 0,... 0).
CL 1.1:
global_work_offset
global_work_offset can be used to specify an array of work_dim
unsigned values that describe the offset used to calculate the global
ID of a work-item. If global_work_offset is NULL, the global IDs start
at offset (0, 0, ... 0).
So, check that you have a CL 1.1 device and drivers.

Related

Construct a bijective function to map arbitrary integer from [1, n] to [1, n] randomly

I want to construct a bijective function f(k, n, seed) from [1,n] to [1,n] where 1<=k<=n and 1<=f(k, n, seed)<=n for each given seed and n. The function actually should return a value from a random permutation of 1,2,...,n. The randomness is decided by the seed. Different seed may corresponds to different permutation. I want the function f(k, n, seed)'s time complexity to be O(1) for each 1<=k<=n and any given seed.
Anyone knows how can I construct such a function? The randomness is allowed to be pseudo-randomness. n can be very large (e.g. >= 1e8).
No matter how you do it, you will always have to store a list of numbers still available or numbers already used ... A simple possibility would be the following
const avail = [1,2,3, ..., n];
let random = new Random(seed)
function f(k,n) {
let index = random.next(n - k);
let result = avail[index]
avail[index] = avail[n-k];
}
The assumptions for this are the following
the array avail is 0-indexed
random.next(x) creates an random integer i with 0 <= i < x
the first k to call the function f with is 0
f is called for contiguous k 0, 1, 2, 3, ..., n
The principle works as follows:
avail holds all numbers still available for the permution. When you take a random index, the element at that index is the next element of the permutation. Then instead of slicing out that element from the array, which is quite expensive, you just replace the currently selected element with the last element in the avail array. In the next iteration you (virtually) decrease the size of the avail array by 1 by decreasing the upper limit for the random by one.
I'm not sure, how secure this random permutation is in terms of distribution of the values, ie for instance it may happen that a certain range of numbers is more likely to be in the beginning of the permuation or in the end of the permutation.
A simple, but not very 'random', approach would be to use the fact that, if a is relatively prime to n (ie they have no common factors), then
x-> (a*x + b)%n
is a permutation of {0,..n-1} to {0,..n-1}. To find the inverse of this, you can use the extended euclidean algorithm to find k and l so that
1 = gcd(a,n) = k*a+l*n
for then the inverse of the map above is
y -> (k*x + c) mod n
where c = -k*b mod n
So you could choose a to be a 'random' number in {0,..n-1} that is relatively prime to n, and b to be any number in {0,..n-1}
Note that you'll need to do this in 64 bit arithmetic to avoid overflow in computing a*x.

Constraining the solutions for linear equations

I'm searching for a way to solve a system of linear equations. Specifically 8 equations with a total of 16 unknown values.
Each unknown value (w[0...15]) is a 32-bit binary value which corresponds to 4 ascii characters written over 8 bits. For example:
For :
I've tried writing this system of linear equations as a single matrix equation. Which gives:
Right now, using the Eigen linear algebra library, I get my 16 solutions (w[0...15]) but all of them are either decimal or null values, which is not what I need. All 16 solutions need to be the equivalent of 4 hexadecimal characters under their binary representation. Meaning integers between 48 and 56 (ascii for '0' to '9'), 65 and 90 (ascii for 'A' to 'Z'), or 97 and 122 (ascii for 'a' to 'z').
Current 16 solutions:
I've found a solution to this problem using something called box-constraints. An example is shown here using python's lsq_linear function which allows the user to specify bounds. It seems Eigen does not let the user specify bounds in its decomposition methods.
Therefore, my question is, how do you get a similar result in C++ using a linear algebra library? Or is there a better way to solve such systems of equations without writing it under a single matrix equation?
Thanks in advance.
Since you're working with linear equations over Z/232Z, integer linear programming (as you tagged the question) may be a solution, and algorithms that are inherently floating point are not appropriate. Box constraints are not enough, they won't force the variables to take on integer values. Also, the model shown in the question does not taken into account that multiplying and adding in Z/232Z can wrap, which excludes many potential solutions (or perhaps that is intended?) and may make the instance accidentally infeasible when it was intended to be solvable.
ILP can model equations over Z/232Z relatively directly (using integer variables between 0 and 232 and some unconstrained additional variables scaled by 232 to "absorb" the wraparound), but it tends really struggle with that kind of formulation - I would say it's one of the worst cases for an ILP solver without getting into the "intentionally difficult" cases. A more indirect model with 32x boolean variables is also possible, but this leads to constraints with very large constants and ILP solvers tend to struggle with them too. Overall I do not recommend using ILP for this problem.
What I would recommend for this is an SMT solver that offers the bitvector theory, or as alternatively, a pseudo-boolean solver or plain SAT solver (which would leave the grunt work of implementing boolean circuits and converting them to CNF to you instead of having them builtin in the solver).
If you have more unknowns than equations for sure your system will be indeterminate, the rank of a 8 x 16 matrix is at most 8, thus you have at least 16 degrees of freedom.
Further more if you have bounds to your variables i.e. mixed equalities and inequalities, then your problem is better posed as a linear programming. You can set a dummy objective function c[i] = 0, you could use GLPK but that is a very generic solution. If you want a small code snipped you probably can find a toy implementation of the Simplex method that will satisfy your needs.
I went for an SMT solver as suggested by #harold. Specifically the CVC4 SMT Solver. Here is the code I've written in C++ answering my question about finding the 16 solutions (w[0...15]) for a system of 8 equations, constrained to be ascii characters. I have one last question though. What are pushing and popping for? (slv.push() and slv.pop())
#include <iostream>
#include <cvc4/cvc4.h>
using namespace std;
using namespace CVC4;
int main() {
// 1. initialize a CVC4 BitVector SMT solver
ExprManager em;
SmtEngine slv(&em);
slv.setOption("incremental", true); // enable incremental solving
slv.setOption("produce-models", true); // enable models
slv.setLogic("QF_BV"); // set the bitvector theory logic
Type bitvector8 = em.mkBitVectorType(size_8); // create a 8-bit wide bit-vector type (4 x 8-bit = 32-bit)
// 2. create the SMT solver variables
Expr w[16][4]; // w[0...15] where each w corresponds to 4 ascii characters
for (int i = 0; i < 16; ++i) {
for (int j = 0; j < 4; ++j) {
// a. define w[i] (four ascii characters per w[i])
w[i][j] = em.mkVar("w" + to_string(i) + to_string(j), bitvector8);
// b. constraint w[i][0...3] to be an ascii character
// - digit (0-9) constraint
// ascii lower bound digit constraint (bit-vector unsigned greater than or equal)
Expr digit_lower = em.mkExpr(kind::BITVECTOR_UGE, w[i][j], em.mkConst(BitVector(size_8, Integer(48))));
// ascii upper bound digit constraint (bit-vector unsigned less than or equal)
Expr digit_upper = em.mkExpr(kind::BITVECTOR_ULE, w[i][j], em.mkConst(BitVector(size_8, Integer(56))));
Expr digit_constraint = em.mkExpr(kind::AND, digit_lower, digit_upper);
// - lower alphanumeric character (a-z) constraint
// ascii lower bound alpha constraint (bit-vector unsigned greater than or equal)
Expr alpha_lower = em.mkExpr(kind::BITVECTOR_UGE, w[i][j], em.mkConst(BitVector(size_8, Integer(97))));
// ascii upper bound alpha constraint (bit-vector unsigned less than or equal)
Expr alpha_upper = em.mkExpr(kind::BITVECTOR_ULE, w[i][j], em.mkConst(BitVector(size_8, Integer(122))));
Expr alpha_constraint = em.mkExpr(kind::AND, alpha_lower, alpha_upper);
Expr ascii_constraint = em.mkExpr(kind::OR, digit_constraint, alpha_constraint);
slv.assertFormula(ascii_constraint);
}
}
// 3. encode the 8 equations
for (int i = 0; i < 8; ++i) {
// a. build the multiplication part (index * w[i])
vector<Expr> left_mult_hand;
for (int j = 0; j < 16; ++j) {
vector <Expr> inner_wj;
for (int k = 0; k < 4; ++k) inner_wj.push_back(w[j][k]);
Expr wj = em.mkExpr(kind::BITVECTOR_CONCAT, inner_wj);
Expr index = em.mkConst(BitVector(size_32, Integer(m_unknowns[j])));
left_mult_hand.push_back(em.mkExpr(kind::BITVECTOR_MULT, index, wj));
}
// b. sum each index * w[i]
slv.push();
Expr left_hand = em.mkExpr(kind::BITVECTOR_PLUS, left_mult_hand);
Expr result = em.mkConst(BitVector(size_32, Integer(globalSums.to_ulong())));
Expr assumption = em.mkExpr(kind::EQUAL, left_hand, result);
slv.assertFormula(assumption);
// c. check for satisfiability
cout << "Result from CVC4 is: " << slv.checkSat(em.mkConst(true)) << endl << endl;
slv.pop();
}
return 0;
}

BigInt calculations on the GPU in Julia

I need to perform calculations on random batches of very larger integers. I have a function that compares the numbers for certain properties and returns a value based on those properties. Since the batches and the numbers themselves can be very large I want to speed up the process by utilizing the GPU.
Here is a short version of what i have running purely on the CPU now.
using Statistics
function check(M)
val = 0
#some code that calculates val based on M, e.g. the mean
val = mean(M)
return val
end
function distribution(N, n, exp) # N=batchsize, n=# of batches, exp=exponent of the upper limit of the integers
avg = 0
M = zeros(BigInt, N)
for i = 1 : n
M = rand(1 : BigInt(10) ^ exp, N)
avg += check(M)
end
avg /= n
println(avg, ":", N)
end
#example
distribution(10 ^ 3, 10 ^ 6, 100)
I have briefly used CUDAnative in Julia but I don't know how to implement the BigInt calculations. That package would be preferred but others are fine as well. Any help is appreciated.
BigInts are CPU only since they are not implemented in Julia, see 1.

Using local memory to speed calculation

Should be an easy one but my OpenCL skills are completely rusty. :)
I have a simple kernel that does the sum of two arrays:
__kernel void sum(__global float* a, __global float* b, __global float* c)
{
__private size_t gid = get_global_id(0);
c[gid] = log(sqrt(exp(cos(sin(a[gid]))))) + log(sqrt(exp(cos(sin(b[gid])))));
}
It's working fine.
Now I'm trying to use local memory hoping it could speed things up:
__kernel void sum_with_local_copy(__global float* a, __global float* b, __global float* c, __local float* tmpa, __local float* tmpb, __local float* tmpc)
{
__private size_t gid = get_global_id(0);
__private size_t lid = get_local_id(0);
__private size_t grid = get_group_id(0);
__private size_t lsz = get_local_size(0);
event_t evta = async_work_group_copy(tmpa, a + grid * lsz, lsz, 0);
wait_group_events(1, &evta);
event_t evtb = async_work_group_copy(tmpb, b + grid * lsz, lsz, 0);
wait_group_events(1, &evtb);
tmpc[lid] = log(sqrt(exp(cos(sin(tmpa[lid]))))) + log(sqrt(exp(cos(sin(tmpb[lid])))));
event_t evt = async_work_group_copy(c + grid * lsz, tmpc, lsz, 0);
wait_group_events(1, &evt);
}
But there is two issues with this kernel:
it's something like 3 times slower than the naive implementation
the results are wrong starting at index 64
My local-size is the max workgroup size.
So my questions are:
1) Am I missing something obvious or is there really a subtlety?
2) How to use local memory to speed up the computation?
3) Should I loop inside the kernel so that each work-item does more than one operation?
Thanks in advance.
Your simple kernel is already optimal w.r.t work-group performance.
Local memory will only improve performance in cases where multiple work-items in a work-group read from the same address in local memory. As there is no shared data in your kernel there is no gain to be had by transferring data from global to local memory, thus the slow-down.
As for point 3, you may see a gain by processing multiple values per thread (depending on how expensive your computation is and what hardware you have).
As you probably know you can explicitly set the local work group size (LWS) when executing your kernel using:
clEnqueueNDRangeKernel( ... bunch of args include Local Work Size ...);
as discussed here. But as already mentioned by Kyle, you don't really have to do this because OpenCL tries to pick the best value for the LWS when you pass in NULL for LWS argument.
Indeed the specification says: "local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances."
I was curious to see how this played out in your case so I setup your calculation to verify the performance against the default value chosen by OpenCL on my device.
In case your interested I setup some arbitrary data:
int n = powl(2, 20);
float* a = (float*)malloc(sizeof(float)*n);
float* b = (float*)malloc(sizeof(float)*n);
float* results = (float*)malloc(sizeof(float)*n);
for (int i = 0; i<n; i++) {
a[i] = (float)i;
b[i] = (float)(n-i);
results[i] = 0.f;
}
and then after defining all of the other OpenCL structures I varied, lws = VALUE, from 2 to 256 (max allowed on my device for this kernel) in powers of 2, and measured the wall-clock time (note: can also use OpenCL events):
struct timeval timer;
int trials = 100;
gettimeofday(&timer, NULL);
double t0 = timer.tv_sec+(timer.tv_usec/1000000.0);
// ---------- Execution ---------
size_t global_work_size = n;
size_t lws[] = {VALUE}; // VALUE was varied from 2 to 256 in powers of 2.
for (int trial = 0; trial<trials; trial++) {
clEnqueueNDRangeKernel(cmd_queue, kernel[0], 1, NULL, &global_work_size, lws, 0, NULL, NULL);
}
clFinish(cmd_queue);
gettimeofday(&timer, NULL);
double t1 = timer.tv_sec+(timer.tv_usec/1000000.0);
double avgTime = (double)(t1-t0)/trials/1.0f;
I then plotted the total execution time as a function of the LWS and as expected the performance varies by quite a bit, until the best value of LWS = 256, is reached. For LWS > 256, the memory on my device is exceeded with this kernel.
FYI for these tests I am running a laptop GPU: AMD ATI Radeon HD 6750M, Max compute units = 6 and the CL_DEVICE_LOCAL_MEM_SIZE = 32768 (so no big screamer compared other GPUs)
Here are the raw numbers:
LWS time(sec)
2 14.004
4 6.850
8 3.431
16 1.722
32 0.866
64 0.438
128 0.436
256 0.436
Next, I checked the default value chosen by OpenCL (passing NULL for the LWS) and this corresponds to the best value that I found by profiling, i.e., LWS = 256.
So in the code you setup you found one of the suboptimal cases, and as mentioned before, its best to let OpenCL pick the best values for the local work groups, especially when there is no shared data in your kernel between multiple work-items in a work-group.
As to the error you got, you probably violated a constraint (from the spec):
The total number of work-items in the work-group must be less than or equal to the CL_DEVICE_MAX_WORK_GROUP_SIZE
Did you check that in detail, by querying the CL_DEVICE_MAX_WORK_GROUP_SIZE for your device?
Adding to what Kyle has written: It has to be multiple work items reading from the same address; if it's just each work item itself reading multiple times from the same address - then again local memory won't help you any; just use the work item's private memory, i.e. variables you define within your kernel.
Also, some points not related to the use of local memory:
log(sqrt(exp(x)) = log(exp(x)) / 2 = x / 2 ... assuming it's the natural logarithm.
log(sqrt(exp(x)) = log(exp(x)) / 2 = x / (2 ln(2)) ... assuming it's the base-2 logarithm. Compute ln(2) in advance of course.
If you really did have some complex function-of-a-function-of-a-function, you might be better off using a Taylor series expansion. For example, your function expands to 1/2-x^2/4+(5 x^4)/48+O(x^6) (order 5).
The last term is an error term, which you can bound from above to choose the appropriate order for the expansion; the error term should not be that high for 'well-behaving' functions. The Taylor expansion calculation might even benefit from further parallelization (but then again, it might not).

Interview: random3 function implementation using random2

On recent interview I was asked the following question. There is a function random2(), wich returns 0 or 1 with equal probability (0.5). Write implementation of random4() and random3() using random2().
It was easy to implement random4() like this
if(random2())
return random2();
return random2() + 2;
But I had difficulties with random3(). The only realization I could represent:
uint32_t sum = 0;
for (uint32_t i = 0; i != N; ++i)
sum += random2();
return sum % 3;
This implementation of random4() is based only my intuition only. I'm not sure if it is correct actually, because I can't mathematically prove its correctness. Can somebody help me with this question, please.
random3:
Not sure if this is the most efficient way, but here's my take:
x = random2 + 2*random2
What can happen:
0 + 0 = 0
0 + 2 = 2
1 + 0 = 1
1 + 2 = 3
The above are all the possibilities of what can happen, thus each has equal probability, so...
(p(x=c) is the probability that x = c)
p(x=0) = 0.25
p(x=1) = 0.25
p(x=2) = 0.25
p(x=3) = 0.25
Now while x = 3, we just keep generating another number, thus giving equal probability to 0,1,2. More technically, you would distribute the probability from x=3 across all of them repeatedly such that p(x=3) tends to 0, thus the probability of the others will tend to 0.33 each.
Code:
do
val = random2() + 2*random2();
while (val != 3);
return val;
random4:
Let's run through your code:
if(random2())
return random2();
return random2() + 2;
First call has 50% chance of 1 (true) => returns either 0 or 1 with 50% * 50% probability, thus 25% each
First call has 50% chance of 0 (false) => returns either 2 or 3 with 50% * 50% probability, thus 25% each
Thus your code generates 0,1,2,3 with equal probability.
Update inspired by e4e5f4's answer:
For a more deterministic answer than the one I provided above...
Generate some large number by calling random2 a bunch of times and mod the result by the desired number.
This won't be exactly the right probability for each, but it will be close.
So, for a 32-bit integer by calling random2 32 times, target = 3:
Total numbers: 4294967296
Number of x's such that x%3 = 1 or 2: 1431655765
Number of x's such that x%3 = 0: 1431655766
Probability of 1 or 2 (each): 0.33333333325572311878204345703125
Probability of 0: 0.3333333334885537624359130859375
So within 0.00000002% of the correct probability, seems pretty close.
Code:
sum = 0;
for (int i = 0; i < 32; i++)
sum = 2*sum + random2();
return sum % N;
Note:
As pjr pointed out, this is, in general, far less efficient than the rejection method above. The probability of getting to the same number of calls of random2 (i.e. 32) (assuming this is the slowest operation) with the rejection method is 0.25^(32/2) = 0.0000000002 = 0.00000002%. This together with the fact that this method isn't exact, gives way more preference to the rejection method. Lower this number decreases the running time, but increases the error, and it would probably need to be lowered quite a bit (thus reaching a high error) to approach the average running time of the rejection method.
It is useful to note the above algorithm has a maximum running time. The rejection method does not. If your random number generator is totally broken for some reason, it could keep generating the rejected number and run for quite a while or forever with the rejection method, but the for-loop above will run 32 times, regardless of what happens.
Using modulo(%) is not recommended because it introduces bias. Mapping will be nice only if n is power of 2. Otherwise some kind of rejection is involved as suggested by other answer.
Another generic approach would be to emulate built-in PRNGs by -
Generate 32 random2() and map it to a 32-bit integer
Get random number in range (0,1) by dividing it by max integer value
Simply multiply this number by n (=3,4...73 so on) and floor to get desired output

Resources