BigInt calculations on the GPU in Julia - julia

I need to perform calculations on random batches of very larger integers. I have a function that compares the numbers for certain properties and returns a value based on those properties. Since the batches and the numbers themselves can be very large I want to speed up the process by utilizing the GPU.
Here is a short version of what i have running purely on the CPU now.
using Statistics
function check(M)
val = 0
#some code that calculates val based on M, e.g. the mean
val = mean(M)
return val
end
function distribution(N, n, exp) # N=batchsize, n=# of batches, exp=exponent of the upper limit of the integers
avg = 0
M = zeros(BigInt, N)
for i = 1 : n
M = rand(1 : BigInt(10) ^ exp, N)
avg += check(M)
end
avg /= n
println(avg, ":", N)
end
#example
distribution(10 ^ 3, 10 ^ 6, 100)
I have briefly used CUDAnative in Julia but I don't know how to implement the BigInt calculations. That package would be preferred but others are fine as well. Any help is appreciated.

BigInts are CPU only since they are not implemented in Julia, see 1.

Related

Construct a bijective function to map arbitrary integer from [1, n] to [1, n] randomly

I want to construct a bijective function f(k, n, seed) from [1,n] to [1,n] where 1<=k<=n and 1<=f(k, n, seed)<=n for each given seed and n. The function actually should return a value from a random permutation of 1,2,...,n. The randomness is decided by the seed. Different seed may corresponds to different permutation. I want the function f(k, n, seed)'s time complexity to be O(1) for each 1<=k<=n and any given seed.
Anyone knows how can I construct such a function? The randomness is allowed to be pseudo-randomness. n can be very large (e.g. >= 1e8).
No matter how you do it, you will always have to store a list of numbers still available or numbers already used ... A simple possibility would be the following
const avail = [1,2,3, ..., n];
let random = new Random(seed)
function f(k,n) {
let index = random.next(n - k);
let result = avail[index]
avail[index] = avail[n-k];
}
The assumptions for this are the following
the array avail is 0-indexed
random.next(x) creates an random integer i with 0 <= i < x
the first k to call the function f with is 0
f is called for contiguous k 0, 1, 2, 3, ..., n
The principle works as follows:
avail holds all numbers still available for the permution. When you take a random index, the element at that index is the next element of the permutation. Then instead of slicing out that element from the array, which is quite expensive, you just replace the currently selected element with the last element in the avail array. In the next iteration you (virtually) decrease the size of the avail array by 1 by decreasing the upper limit for the random by one.
I'm not sure, how secure this random permutation is in terms of distribution of the values, ie for instance it may happen that a certain range of numbers is more likely to be in the beginning of the permuation or in the end of the permutation.
A simple, but not very 'random', approach would be to use the fact that, if a is relatively prime to n (ie they have no common factors), then
x-> (a*x + b)%n
is a permutation of {0,..n-1} to {0,..n-1}. To find the inverse of this, you can use the extended euclidean algorithm to find k and l so that
1 = gcd(a,n) = k*a+l*n
for then the inverse of the map above is
y -> (k*x + c) mod n
where c = -k*b mod n
So you could choose a to be a 'random' number in {0,..n-1} that is relatively prime to n, and b to be any number in {0,..n-1}
Note that you'll need to do this in 64 bit arithmetic to avoid overflow in computing a*x.

How to calculate the Theta order of the following module as the function of n?

function rec (n:integer);
begin
if n<=1 then
return (1)
else
return(rec(n-1)+rec(n-1)+rec(n-1))
end
My recurrence is as follow, I am confused to express this recurrence as a function of n.
I think equation is some what like; T(n) = 3T(n-1)+2.
Consider a slightly more general version of this function:
Re-substitute this into itself multiple times to spot a pattern emerging:
After repeating the process for m times.
When do we stop? The stopping condition for this recurrence is n <= 1, so:
Therefore the expression for T becomes:
Substitute in the numbers, a = 3, b = 1, c = 2:
Note that we ignored any rounding for the max value of m, since integer rounding errors have maximum magnitude 0.5, and thus only give a constant factor difference.

Julia challenge - FitzHugh–Nagumo model PDE Runge-Kutta solver

I am newbie in Julia programming language, so I don't know much of how to optimize a code. I have heard that Julia should be faster in comparison to Python, but I've written a simple Julia code for solving the FitzHugh–Nagumo model , and it doesn't seems to be faster than Python.
The FitzHugh–Nagumo model equations are:
function FHN_equation(u,v,a0,a1,d,eps,dx)
u_t = u - u.^3 - v + laplacian(u,dx)
v_t = eps.*(u - a1 * v - a0) + d*laplacian(v,dx)
return u_t, v_t
end
where u and v are the variables, which are 2D fields (that is, 2 dimensional arrays), and a0,a1,d,eps are the model's parameters. Both parameters and the variables are of type Float. dx is the parameter that control the separation between grid point, for the use of the laplacian function, which is an implementation of finite differences with periodic boundary conditions.
If one of you expert Julia coders can give me a hint of how to do things better in Julia I will be happy to hear.
The Runge-Kutte step function is:
function uv_rk4_step(Vs,Ps, dt)
u = Vs.u
v = Vs.v
a0=Ps.a0
a1=Ps.a1
d=Ps.d
eps=Ps.eps
dx=Ps.dx
du_k1, dv_k1 = FHN_equation(u,v,a0,a1,d,eps,dx)
u_k1 = dt*du_k1י
v_k1 = dt*dv_k1
du_k2, dv_k2 = FHN_equation((u+(1/2)*u_k1),(v+(1/2)*v_k1),a0,a1,d,eps,dx)
u_k2 = dt*du_k2
v_k2 = dt*dv_k2
du_k3, dv_k3 = FHN_equation((u+(1/2)*u_k2),(v+(1/2)*v_k2),a0,a1,d,eps,dx)
u_k3 = dt*du_k3
v_k3 = dt*dv_k3
du_k4, dv_k4 = FHN_equation((u+u_k3),(v+v_k3),a0,a1,d,eps,dx)
u_k4 = dt*du_k4
v_k4 = dt*dv_k4
u_next = u+(1/6)*u_k1+(1/3)*u_k2+(1/3)*u_k3+(1/6)*u_k4
v_next = v+(1/6)*v_k1+(1/3)*v_k2+(1/3)*v_k3+(1/6)*v_k4
return u_next, v_next
end
And I've used imshow() from PyPlot package to plot the u field.
This is not a complete answer, but a taste of an optimization attempt on the laplacian function. The original laplacian on a 10x10 matrix gave me the #time:
0.000038 seconds (51 allocations: 12.531 KB)
While this version:
function laplacian2(a,dx)
# Computes Laplacian of a matrix
# Usage: al=laplacian(a,dx)
# where dx is the grid interval
ns=size(a,1)
ns != size(a,2) && error("Input matrix must be square")
aa=zeros(ns+2,ns+2)
for i=1:ns
aa[i+1,1]=a[i,end]
aa[i+1,end]=a[i,1]
aa[1,i+1]=a[end,i]
aa[end,i+1]=a[1,i]
end
for i=1:ns,j=1:ns
aa[i+1,j+1]=a[i,j]
end
lap = Array{eltype(a),2}(ns,ns)
scale = inv(dx*dx)
for i=1:ns,j=1:ns
lap[i,j]=(aa[i,j+1]+aa[i+2,j+1]+aa[i+1,j]+aa[i+1,j+2]-4*aa[i+1,j+1])*scale
end
return lap
end
Gives #time:
0.000010 seconds (6 allocations: 2.250 KB)
Notice the reduction in allocations. Extra allocations usually indicate the potential for optimization.

Julia on Float versus Octave on Float

Version: v"0.5.0-dev+1259"
Context: The goal is to calculate the Rademacher penalty bound on a give data points n with respect to VC-dimension dvc and probability expressed by delta
Please consider Julia code:
#Growth function on any n points with respect to VC-dimmension
function mh(n, dvc)
if n <= dvc
2^n #A
else
n^dvc #B
end
end
#Rademacher penalty bound
function rademacher_penalty_bound(n::Int, dvc::Int, delta::Float64)
sqrt((2.0*log(2.0*n*mh(n,dvc)))/n) + sqrt((2.0/n)*log(1.0/delta)) + 1.0/n
end
and the equivalent code in Octave/Matlab:
%Growth function on n points for a give VC dimmension (dvc)
function md = mh(n, dvc)
if n <= dvc
md= 2^n;
else
md = n^dvc;
end
end
%Rademacher penalty bound
function epsilon = rademacher_penalty_bound (n, dvc, delta)
epsilon = sqrt ((2*log(2*n*mh(n,dvc)))/n) + sqrt((2/n)*log(1/delta)) + 1/n;
end
Problem:
When I start testing it I receive the following results:
Julia first:
julia> rademacher_penalty_bound(50, 50, 0.05) #50 points
1.619360057204432
julia> rademacher_penalty_bound(500, 50, 0.05) #500 points
ERROR: DomainError:
[inlined code] from math.jl:137
in rademacher_penalty_bound at none:2
in eval at ./boot.jl:264
Now Octave:
octave:17> rademacher_penalty_bound(50, 50, 0.05)
ans = 1.6194
octave:18> rademacher_penalty_bound(500, 50, 0.05)
ans = 1.2387
Question: According to Noteworthy differences from MATLAB I think I followed the rule of thumb ("literal numbers without a decimal point (such as 42) create integers instead of floating point numbers..."). The code crashes when the number of points exceeds 51 (line #B in mh). Can someone with more experience can look at the code and say what I should improve/change?
While BigInt and BigFloat will work here, they're serious overkill. The real issue is that you're doing integer exponentiation in Julia and floating-point exponentiation in Octave/Matlab. So you just need to change mh to use floats instead of integers for exponents:
mh(n, dvc) = n <= dvc ? 2^float(n) : n^float(dvc)
rademacher_penalty_bound(n, dvc, δ) =
√((2log(2n*mh(n,dvc)))/n) + √(2log(1/δ)/n) + 1/n
With these definitions, you get the same results as Octave/Matlab:
julia> rademacher_penalty_bound(50, 50, 0.05)
1.619360057204432
julia> rademacher_penalty_bound(500, 50, 0.05)
1.2386545010981596
In Octave/Matlab, even when you input a literal without a decimal point, you still get a float – you have to do an explicit cast to int type. Also, exponentiation in Octave/Matlab always converts to float first. In Julia, x^2 is equivalent to x*x which prohibits conversion to floating-point.
Although BigInt and BigFloat are excellent tools when they are necessary, they should usually be avoided, since they are overkill and slow.
In this case, the problem is indeed the difference between Octave, that treats everything as a floating-point number, and Julia, that treats e.g. 2 as an integer.
So the first thing to do is to use floating-point numbers in Julia too:
function mh(n, dvc)
if n <= dvc
2.0 ^ n
else
Float64(n) ^ dvc
end
end
This already helps, e.g. mh(50, 50) works.
However, the correct solution for this problem is to look at the code more carefully, and realise that the function mh only occurs inside a log:
log(2.0*n*mh(n,dvc))
We can use the laws of logarithms to rewrite this as
log(2.0*n) + log_mh(n, dvc)
where log_mh is a new function, which returns the logarithm of the result of mh. Of course, this should not be written directly as log(mh(n, dvc)), but is rather a new function:
function log_mh(n, dvc)
if n <= dvc
n * log(2.0)
else
dvc * log(n)
end
end
In this way, you will be able to use huge numbers without overflow.
I don't know is it acceptable to get results of BigFloat but anyway in julia part you can use BigInt
#Growth function on any n points with respect to VC-dimmension
function mh(n, dvc)
if n <= dvc
(BigInt(2))^n #A
else
n^dvc #B
end
end
#Rademacher penalty bound
function rademacher_penalty_bound(n::BigInt, dvc::BigInt, delta::Float64)
sqrt((2.0*log(2.0*n*mh(n,dvc)))/n) + sqrt((2.0/n)*log(1.0/delta)) + 1.0/n
end
rademacher_penalty_bound(BigInt(500), BigInt(500), 0.05)
# => 1.30055251010957621105182244420.....
Because by default a Julia Int is a "machine-size" integer, a 64-bit integer for the common x86-64 platform, whereas Octave uses floating point. So in Julia mh(500,50) overflows. You can fix it by replacing mh() as follows:
function mh(n, dvc)
n2 = BigInt(n) # Or n2 = Float64(n)
if n <= dvc
2^n2 #A
else
n2^dvc #B
end
end

dividing by 2 and ceiling until remains 1

having the following algorithm only for natural numbers:
rounds(n)={1, if n=1; 1+rounds(ceil(n/2)), else}
so writing in a programming language this will be
int rounds(int n){
if(n==1)
return 1;
return 1+rounds(ceil(n/2));
}
i think this has time complexity O(log n)
is there a better complexity?
Start by listing the results from 1 upward,
rounds(1) = 1
rounds(2) = 1 + rounds(2/2) = 1 + 1 = 2
Next, when ceil(n/2) is 2, rounds(n) will be 3. That's for n = 3 and n = 4.
rounds(3) = rounds(4) = 3
then, when ceil(n/2) is 3 or 4, the result will be 4. 3 <= ceil(n/2) <= 4 happens if and only if 2*3-1 <= n <= 2*4, so
round(5) = ... = rounds(8) = 4
Continuing, you can see that
rounds(n) = k+2 if 2^k < n <= 2^(k+1)
by induction.
You can rewrite that to
rounds(n) = 2 + floor(log_2(n-1)) if n > 1 [and rounds(1) = 1]
and mathematically, you can also treat n = 1 uniformly by rewriting it to
rounds(n) = 1 + floor(log_2(2*n-1))
The last formula has the potential for overflow if you're using fixed-width types, though.
So the question is
how fast can you compare a number to 1,
how fast can you subtract 1 from a number,
how fast can you compute the (floor of the) base-2 logarithm of a positive integer?
For a fixed-width type, thus a bounded range, all these are of course O(1) operations, but then you're probably still interested in making it as efficient as possible, even though computational complexity doesn't enter the game.
For native machine types - which int and long usually are - comparing and subtracting integers are very fast machine instructions, so the only possibly problematic one is the base-2 logarithm.
Many processors have a machine instruction to count the leading 0-bits in a value of the machine types, and if that is made accessible by the compiler, you will get a very fast implementation of the base-2 logarithm. If not, you can get a faster version than the recursion using one of the classic bit-hacks.
For example, sufficiently recent versions of gcc and clang have a __builtin_clz (resp. __builtin_clzl for 64-bit types) that maps to the bsr* instruction if that is present on the processor, and presumably a good implementation using some bit-twiddling if it isn't provided by the processor.
The version
unsigned rounds(unsigned long n) {
if (n <= 1) return n;
return sizeof n * CHAR_BIT + 1 - __builtin_clzl(n-1);
}
using the bsrq instruction takes (on my box) 0.165 seconds to compute rounds for 1 to 100,000,000, the bit-hack
unsigned rounds(unsigned n) {
if (n <= 1) return n;
--n;
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n -= (n >> 1) & 0x55555555;
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F);
return ((n * 0x01010101) >> 24)+1;
}
takes 0.626 seconds, and the naive loop
unsigned rounds(unsigned n) {
unsigned r = 1;
while(n > 1) {
++r;
n = (n+1)/2;
}
return r;
}
takes 1.865 seconds.
If you don't use a fixed-width type, but arbitrary precision integers, things change a bit. The naive loop (or recursion) still uses Θ(log n) steps, but the steps take Θ(log n) time (or worse) on average, so overall you have a Θ(log² n) algorithm (or worse). Then using the formula above can not only offer an implementation with lower constant factors, but one with lower algorithmic complexity.
Comparing to 1 can be done in constant time for suitable representations, O(log n) is the worst case for reasonable representations.
Subtracting 1 from a positive integer takes O(log n) for reasonable representations.
Computing the (floor of the) base-2 logarithm can be done in constant time for some representations, and in O(log n) for other reasonable representations [if they use a power-of-2 base, which all arbitrary precision libraries I'm semi-familiar with do; if they used a power-of-10 base, that would be different].
If you think of the algorithm as iterative and the numbers as binary, then this function shifts out the lowest bit and increases the number by 1 if it was a 1 that was shifted out. Thus, except for the increment, it counts the number of bits in the number (that is, the position of the highest 1). The increment will eventually increase the result by one, except when the number is of the form 1000.... Thus, you get the number of bits plus one, or the number of bits if the number is a power of two. Depending on your machine model, this might be faster to calculate than O(log n).

Resources