I'm trying to speed up dot product in Julia. But I can't find BLAS function for dot product.
My current solution is:
X = rand(5,1);
Y = rand(5,1);
res = BLAS.gemm('T','N', X, Y);
res[1]
I was wondering whether we have a more simple function for dot product in BLAS in Julia. Like BLAS.dot(X,Y).
LinearAlgebra.BLAS.dotu is the BLAS1 dot product, but it won't be faster than the Julia built-in. The Julia generic functions for BLAS1 and BLAS2 routines pretty much match OpenBLAS in performance. BLAS3 routines (matrix multiplication) are more in-depth and are faster in OpenBLAS.
There is the
dot function:
For any iterable containers x and y (including arrays of any
dimension) of numbers (or any element type for which dot is defined),
compute the dot product (or inner product or scalar product), i.e. the
sum of dot(x[i],y[i]), as if they were vectors.
x ⋅ y (where ⋅ can be typed by tab-completing \cdot in the REPL) is a
synonym for dot(x, y).
It seems faster than the gemm BLAS call:
using LinearAlgebra
using BenchmarkTools
n = 10000
x = rand(n, 1);
y = rand(n, 1);
#btime(BLAS.gemm('T','N', x, y))
19.212 μs (1 allocation: 96 bytes)
#btime(x ⋅ y)
1.536 μs (1 allocation: 16 bytes)
versioninfo()
Julia Version 1.0.3
Commit 099e826241* (2018-12-18 01:34 UTC)
Platform Info:
OS: Linux (x86_64-suse-linux)
CPU: Intel(R) Core(TM) i7-5820K CPU # 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
Related
I have a matrix $X$ and I would like to find its first principal component and the corresponding loadings. I would like to do this without computing the covariance matrix of $X$. How can I do so?
This is the standard version, which uses the eigendecomposition of the covariance matrix.
using LinearAlgebra: eigen
using Statistics: mean
function find_principal_component(X)
n = size(X, 1)
B = X .- mapslices(mean, X, dims=[1]) # Center columns of X
evalues, V = eigen(B'B / (n - 1)) # EigenDecomposition of Covariance Matrix
PC = V[:, argmax(evalues)] # Grab principal component and compute loading
return B * PC, PC
end
Alternatively, one could use the power method, which still uses the covariance matrix
function power_method(X, niter=50)
pc = randn(size(X, 2))
pc /= norm(pc)
M = X'X
for i in 1:niter
pc = M * pc
pc /= norm(pc)
end
return X * pc, pc
end
I would like something like the power method, but without needing to compute the covariance matrix, which can be quite costly.
Possible solution
I noticed something interesting. Let r_t be the principal component vector at time t. The idea of the power method is to start with a random r_t and multiply it by X' X many times to stretch it towards the principal component. In other words r_{t+1} = X' X r_t
Once we have the principal component r_t then the loadings are simply \ell_t = X r_t. This means we can write r_{t+1} = X^\top \ell_t
One could therefore start with r_t and \ell_t initialized randomly and then do
r_{t+1} = normalize(X^\top \ell_t)\\
\ell_{t+1} = X r_{t+1}
In general, you may find singular value decompositions more useful for this.
The definition of the singular value decomposition is
B = U Σ V'
This means that
B'B = V Σ² V'
As a result, you code can avoid the computation of B'B. More importantly, the singular values are always real and thus you don't have to worry about whether B'B will be exactly symmetric.
Even better, Arpack.svds allows you to compute just the largest few singular values.
Here is a version of your code that uses SVD instead of eigen decomposition:
using LinearAlgebra: eigen
using Statistics: mean
using Arpack: svds
function find_principal_component(X)
n = size(X, 1)
# Center columns of X
B = X .- mapslices(mean, X, dims=[1])
# Decomposition of Covariance Matrix
svd,_ = svds(B / (n - 1), nsv=1)
# Grab principal component and compute loading
PC = svd.V[:, 1]
return B * PC, PC
end
Running this on a large sparse matrix (100k x 1k, 1M non-zeros) gives this speed:
julia> #time find_principal_component(sprandn(100_000, 1_000, 0.01))
25.529426 seconds (18.45 k allocations: 3.015 GiB, 0.02% gc time)
([0.014242904195824286, 0.10635817357717596, -0.010142643763158442, ...])
and on a large non-sparse example (1M x 100 entries):
julia> #time find_principal_component(randn(1_000_000, 100))
4.922949 seconds (1.31 k allocations: 2.280 GiB, 0.02% gc time)
([-0.06629858174095249, 0.6996443876327108, -1.1783642870384952, ...])
Try using KrylovKit.jl. Specifically, eigsolve(X, howmany=1, which=:LM]) will give you the eigen value with largest magnitude and the associated eigenvector. Docs are at https://jutho.github.io/KrylovKit.jl/stable/man/eig/
I am working on an optimization problem with a huge number of variables (upwards of hundreds of millions). Each of them should be a 0-1 binary variable.
I can write it in the form (maximize x'Qx) where Q is positive semi-definite, and I am using Julia, so the package COSMO.jl seems like a great fit. However, there is a ton of sparsity in my problem. Q is 0 except on approximately sqrt(|Q|) entries, and for the constraints there are approximately sqrt(|Q|) linear constraints on the variables.
I can describe this system pretty easily using SparseArrays, but it appears the most natural way to input problems into COSMO uses standard arrays. Is there a way I can take advantage of the sparsity in this massive problem?
While there is no sample code in your perhaps this could help:
JuMP works with sparse arrays so perhaps the easiest thing could be just use it in the construction of the goal function:
julia> using JuMP, SparseArrays, COSMO
julia> m = Model(with_optimizer(COSMO.Optimizer));
julia> q = sprand(Bool, 20, 20,0.05) # for readability I use a binary q
20×20 SparseMatrixCSC{Bool, Int64} with 21 stored entries:
⠀⠀⠀⡔⠀⠀⠀⠀⡀⠀
⠀⠀⠂⠀⠠⠀⠀⠈⠑⠀
⠀⠀⠀⠀⠀⠤⠀⠀⠀⠀
⠀⢠⢀⠄⠆⠀⠂⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠄⠀⠀⠌
julia> #variable(m, x[1:20], Bin);
julia> x'*q*x
x[1]*x[14] + x[14]*x[3] + x[15]*x[8] + x[16]*x[5] + x[18]*x[4] + x[18]*x[13] + x[19]*x[14] + x[20]*x[11]
You can see that the equation gets correctly reduced.
Indeed you could check the performance with a very sparse q having 100M elements:
julia> q = sprand(10000, 10000,0.000001)
10000×10000 SparseMatrixCSC{Float64, Int64} with 98 stored entries:
...
julia> #variable(m,z[1:10000], Bin);
julia> #btime $z'*$q*$z
1.276 ms (51105 allocations: 3.95 MiB)
You can see that you are just getting the expected performance when constructing the goal function.
For scalars, the \ (solve linear system) operator is equivalent to the division operator /. Is the performance similar?
I ask because currently my code has a line like
x = (1 / alpha) * averylongfunctionname(input1, input2, input3)
Visually, it is important that the division by alpha happens on the "left," so I am considering replacing this with
x = alpha \ averylongfunctionname(input1, input2, input3)
What is the best practice in this situation, from the standpoint of style and the standpoint of performance?
Here are some perplexing benchmarking results:
julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
julia> #btime x[1]\sum(x) setup=(x=rand(100))
15.014 ns (0 allocations: 0 bytes)
56.23358979466163
julia> #btime (1/x[1]) * sum(x) setup=(x=rand(100))
13.312 ns (0 allocations: 0 bytes)
257.4552413802698
julia> #btime sum(x)/x[1] setup=(x=rand(100))
14.929 ns (0 allocations: 0 bytes)
46.25209548841374
They are all about the same, but I'm surprised that the (1 / x) * foo approach has the best performance.
Scalar / and \ really should have the same meaning and performance. Let's define these two test functions:
f(a, b) = a / b
g(a, b) = b \ a
We can then see that they produce identical LLVM code:
julia> #code_llvm f(1.5, 2.5)
; # REPL[29]:1 within `f'
define double #julia_f_380(double %0, double %1) {
top:
; ┌ # float.jl:335 within `/'
%2 = fdiv double %0, %1
; └
ret double %2
}
julia> #code_llvm g(1.5, 2.5)
; # REPL[30]:1 within `g'
define double #julia_g_382(double %0, double %1) {
top:
; ┌ # operators.jl:579 within `\'
; │┌ # float.jl:335 within `/'
%2 = fdiv double %0, %1
; └└
ret double %2
}
And the same machine code too. I'm not sure what is causing the differences in #btime results, but I'm pretty sure that the difference between / and \ is an illusion and not real.
As to x*(1/y), that does not compute the same thing as x/y: it will be potentially less accurate since there is rounding done when computing 1/y and then that rounded value is multiplied by x, which also rounds. For example:
julia> 17/0.7
24.28571428571429
julia> 17*(1/0.7)
24.285714285714285
Since floating point division is guaranteed to be correctly rounded, doing division directly is always going to be more accurate. If the divisor is shared by a lot of loop iterations, however, you can get a speedup by rewriting the computation like this since floating-point multiplication is usually faster than division (although timing my current computer does not show this). Be aware that this comes at a loss of accuracy, however, and if the divisor is not shared there would still be a loss of accuracy and no performance gain.
I don't know, but I can suggest you to try using BenchmarkTools
package: it can help you to evaluate the performance of the two
different statements. Here you can find more details. Bye!
I think that the best choice is (1/x)*foo for two reasons:
it has the best performance (although not much compared to the other ones);
it is more clear for another person reading the code.
I'm experimenting with DSP.jl - the conv() method, in particular. I'm using CUDANative and CuArrays to create arrays to be arguments to conv(), so that the cuda versions of fft(), etc. will be used. I'm using BenchmarkTools to get performance data. I find that the Julia runtime complains about running out of CPU or GPU memory under odd circumstances. Here's my test setup:
using CUDAdrv, CUDAnative, CuArrays
using DSP
using FFTW
using BenchmarkTools
N = 120
A = rand(Float32, N, N, N);
B = rand(Float32, N, N, N);
A_d = cu(A);
B_d = cu(B);
function doConv(A, B)
C = conv(A, B)
finalize(C)
C = []
end
t = #benchmark doConv($A_d, $B_d)
display(t)
Here's an example of the odd behavior I mentioned. If I set N to 120, my script runs to completion. If I set N to 64, I get the "out of memory" error:ERROR: LoadError: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory). I can run the smaller case first, get the error, then bump N to the larger value and have the script complete successfully.
Is there something I should be doing differently to prevent this from happening?
I am newbie in Julia programming language, so I don't know much of how to optimize a code. I have heard that Julia should be faster in comparison to Python, but I've written a simple Julia code for solving the FitzHugh–Nagumo model , and it doesn't seems to be faster than Python.
The FitzHugh–Nagumo model equations are:
function FHN_equation(u,v,a0,a1,d,eps,dx)
u_t = u - u.^3 - v + laplacian(u,dx)
v_t = eps.*(u - a1 * v - a0) + d*laplacian(v,dx)
return u_t, v_t
end
where u and v are the variables, which are 2D fields (that is, 2 dimensional arrays), and a0,a1,d,eps are the model's parameters. Both parameters and the variables are of type Float. dx is the parameter that control the separation between grid point, for the use of the laplacian function, which is an implementation of finite differences with periodic boundary conditions.
If one of you expert Julia coders can give me a hint of how to do things better in Julia I will be happy to hear.
The Runge-Kutte step function is:
function uv_rk4_step(Vs,Ps, dt)
u = Vs.u
v = Vs.v
a0=Ps.a0
a1=Ps.a1
d=Ps.d
eps=Ps.eps
dx=Ps.dx
du_k1, dv_k1 = FHN_equation(u,v,a0,a1,d,eps,dx)
u_k1 = dt*du_k1י
v_k1 = dt*dv_k1
du_k2, dv_k2 = FHN_equation((u+(1/2)*u_k1),(v+(1/2)*v_k1),a0,a1,d,eps,dx)
u_k2 = dt*du_k2
v_k2 = dt*dv_k2
du_k3, dv_k3 = FHN_equation((u+(1/2)*u_k2),(v+(1/2)*v_k2),a0,a1,d,eps,dx)
u_k3 = dt*du_k3
v_k3 = dt*dv_k3
du_k4, dv_k4 = FHN_equation((u+u_k3),(v+v_k3),a0,a1,d,eps,dx)
u_k4 = dt*du_k4
v_k4 = dt*dv_k4
u_next = u+(1/6)*u_k1+(1/3)*u_k2+(1/3)*u_k3+(1/6)*u_k4
v_next = v+(1/6)*v_k1+(1/3)*v_k2+(1/3)*v_k3+(1/6)*v_k4
return u_next, v_next
end
And I've used imshow() from PyPlot package to plot the u field.
This is not a complete answer, but a taste of an optimization attempt on the laplacian function. The original laplacian on a 10x10 matrix gave me the #time:
0.000038 seconds (51 allocations: 12.531 KB)
While this version:
function laplacian2(a,dx)
# Computes Laplacian of a matrix
# Usage: al=laplacian(a,dx)
# where dx is the grid interval
ns=size(a,1)
ns != size(a,2) && error("Input matrix must be square")
aa=zeros(ns+2,ns+2)
for i=1:ns
aa[i+1,1]=a[i,end]
aa[i+1,end]=a[i,1]
aa[1,i+1]=a[end,i]
aa[end,i+1]=a[1,i]
end
for i=1:ns,j=1:ns
aa[i+1,j+1]=a[i,j]
end
lap = Array{eltype(a),2}(ns,ns)
scale = inv(dx*dx)
for i=1:ns,j=1:ns
lap[i,j]=(aa[i,j+1]+aa[i+2,j+1]+aa[i+1,j]+aa[i+1,j+2]-4*aa[i+1,j+1])*scale
end
return lap
end
Gives #time:
0.000010 seconds (6 allocations: 2.250 KB)
Notice the reduction in allocations. Extra allocations usually indicate the potential for optimization.