Compute sum_i f(i) x(i) x(i)' fast? - julia

I'm trying to compute the summation of f(i) * x(i) * x(i)'
where x(i) is a column vector, x(i)' is the transpose, and f(i) is a scalar. So it's a weighted sum of outer products.
In MATLAB, this can be achieved pretty fast by using bsxfun.
The following code runs in 260 ms on my laptop (MacBook Air 2010)
N = 1e5;
d = 100;
f = randn(N, 1);
x = randn(N, d);
% H = zeros(d, d);
tic;
H = x' * bsxfun(#times, f, x);
toc
I've been trying to make Julia do the same job, but I can't get to do it faster.
N = int(1e5);
d = 100;
f = randn(N);
x = randn(N, d);
function hess1(x, f)
N, d = size(x);
temp = zeros(N, d);
#simd for kk = 1:N
#inbounds temp[kk, :] = f[kk] * x[kk, :];
end
H = x' * temp;
end
function hess2(x, f)
N, d = size(x);
H2 = zeros(d,d);
#simd for k = 1:N
#inbounds H2 += f[k] * x[k, :]' * x[k, :];
end
return H2
end
function hess3(x, f)
N, d = size(x);
H3 = zeros(d,d);
for k = 1:N
for k1 = 1:d
#simd for k2 = 1:d
#inbounds H3[k1, k2] += x[k, k1] * x[k, k2] * f[k];
end
end
end
return H3
end
The results are
#time H1 = hess1(x, f);
#time H2 = hess2(x, f);
#time H3 = hess3(x, f);
elapsed time: 0.776116469 seconds (262480224 bytes allocated, 26.49% gc time)
elapsed time: 30.496472345 seconds (16385442496 bytes allocated, 56.07% gc time)
elapsed time: 2.769934563 seconds (80128 bytes allocated)
hess1 is like MATLAB's bsxfun but slower, and hess3 uses no temporary memory, but significantly slower. My best julia code is 3 times slower than MATLAB.
How can I make this julia code faster?
IJulia gist: http://nbviewer.ipython.org/gist/memming/669fb8e78af3338ebf6f
Julia version: 0.3.0-rc1
EDIT:
I tested on a more powerful computer (3.5 Ghz Intel i7, 4 core, L2 256kB, L3 8 MB)
MATLAB R2014a without -singleCompThread: 0.053 s
MATLAB R2014a with -singleCompThread: 0.080 s (#tholy's suggestion)
Julia 0.3.0-rc1
hess1 elapsed time: 0.215406904 seconds (262498648 bytes allocated, 32.74% gc time)
hess2 elapsed time: 10.722578699 seconds (16384080176 bytes allocated, 62.20% gc time)
hess3 elapsed time: 1.065504355 seconds (80176 bytes allocated)
bsxfunstyle elapsed time: 0.063540168 seconds (80081072 bytes allocated, 25.04% gc time) (#IainDunning's solution)
Indeed, using broadcast is much faster and comparable to MATLAB's bsxfun.

You are looking for the broadcast function. Here is the relevant issue discussing the functionality and naming.
I implemented your version as well as a broadcast version, here is what I found:
srand(1988)
N = 100_000
d = 100
f = randn(N, 1)
x = randn(N, d)
function hess1(x, f)
N, d = size(x);
temp = zeros(N, d);
#simd for kk = 1:N
#inbounds temp[kk, :] = f[kk] * x[kk, :];
end
H = x' * temp;
end
function bsxfunstyle(x, f)
x' * broadcast(*,f,x)
end
# Warmup
hess1(x,f)
bsxfunstyle(x, f)
# For real
println("Hess1")
#time H1 = hess1(x, f)
println("Broadcast")
#time H2 = bsxfunstyle(x, f)
# Check solutions are identical
println(sum(abs(H1-H2)))
with output
Hess1
elapsed time: 0.324256216 seconds (262498648 bytes allocated, 33.95% gc time)
Broadcast
elapsed time: 0.126647594 seconds (80080696 bytes allocated, 20.22% gc time)
0.0

There are several performance issues with your functions
you're creating temporary arrays by x[kk, :].
you are traversing matrix in rows while they are stored in column order.
You are using x' (which first transpose the matrix) rather than At_mul_B(x,...)
A simple modification gives better performances :
N = 100_000
d = 100
f = randn(N)
x = randn(N, d)
f = randn(N, 1)
x = randn(N, d)
function hess(x, f)
N, d = size(x);
temp = zeros(N, d);
#inbounds for k1 = 1:d
#simd for kk = 1:N
temp[kk, k1] = f[kk] * x[kk, k1]
end
end
H = At_mul_B(x, temp)
end
#time hess(x, f)
# 0.067636 seconds (9 allocations: 76.371 MB, 11.24% gc time)

Related

broadcasting arrays and summation without temporary matrix allocations in Julia

I need to perform a calculation of the form:
A = reshape(big_mat,m1,1,1,n1,n2,n3)
B = reshape(big_mat,1,m1,n1,1,n2,n3)
C = reshape(another_mat,m1,m1,n1,n1,1,1)
D = sum(A.*B.*C,dims=(5,6))
A.*B.*C is creating a temporary big matrix of size(m1,m1,n1,n1,n2,n3). Given that D is only of size(m1,m1,n1,n1), is there a more efficient procedure of doing this summation without invoking for-loops?
You can ask the computer to write the loops for you. #einsum D[a,b,c,d] := mat[a,d,e,f] * mat[b,c,e,f] * another[a,b,c,d] will write 6 nested for loops, summing over e,f which do not appear on the left.
#tullio will do the same, and add tiled memory access (and multi-threading) which should be a bit faster.
julia> using Einsum, Tullio, BenchmarkTools
julia> let
n = 25
big_mat = rand(n,n,n,n)
another_mat = rand(n,n,n,n)
D1 = #btime let n = $n
A = reshape($big_mat, n,1,1,n, n,n)
B = reshape($big_mat, 1,n,n,1, n,n)
C = reshape($another_mat, n,n,n,n, 1,1)
D = sum(A.*B.*C,dims=(5,6))
end
D2 = #btime #einsum D[a,b,c,d] := $big_mat[a,d,e,f] * $big_mat[b,c,e,f] * $another_mat[a,b,c,d]
D3 = #btime #tullio D[a,b,c,d] := $big_mat[a,d,e,f] * $big_mat[b,c,e,f] * $another_mat[a,b,c,d]
D1 ≈ D2 ≈ D3
end
min 462.545 ms, mean 494.505 ms (20 allocations, 1.82 GiB)
min 213.126 ms, mean 214.412 ms (3 allocations, 2.98 MiB)
min 80.585 ms, mean 80.785 ms (53 allocations, 2.98 MiB)
true
julia> 2.98 * 25^2 # memory 2.98 MiB -> 1.82 GiB
1862.5
julia> #macroexpand1 #einsum D[a,b,c,d] := mat[a,d,e,f] * mat[b,c,e,f] * another[a,b,c,d]
quote # this will print the loops

Julia: marker_z takes much time

I'm trying to do a gradation plotting.
using Plots
using LinearAlgebra
L = 60 #size of a matrix
N = 10000 #number of loops
E = zeros(Complex{Float64},N,L) #set of eigenvalues
IPR = zeros(Complex{Float64},N,L) #indicator for marker_z
Preparing E & IPR
function main()
cnt = 0
for i = 1:N
cnt += 1
H = rand(Complex{Float64},L,L)
eigenvalue,eigenvector = eigen(H)
for j = 1:L
E[cnt,j] = eigenvalue[j]
IPR[cnt,j] = abs2(norm(abs2.(eigenvector[:,j])))/(abs2(norm(eigenvector[:,j])))
end
end
end
Plotting
function main1()
plot(real.(E),imag.(E),marker_z = real.(IPR),st = scatter,markercolors=:cool,markerstrokewidth=0,markersize=1,dpi=300)
plot!(legend=false,xlabel="ReE",ylabel="ImE")
savefig("test.png")
end
#time main1()
358.794885 seconds (94.30 M allocations: 129.882 GiB, 2.05% gc time)
Comparing with a uniform plotting, a gradation plotting takes too much time.
function main2()
plot(real.(E),imag.(E),st = scatter,markercolor=:blue,markerstrokewidth=0,markersize=1,dpi=300)
plot!(legend=false,xlabel="ReE",ylabel="ImE")
savefig("test1.png")
end
#time main2()
8.100609 seconds (10.85 M allocations: 508.054 MiB, 0.47% gc time)
Is there a way of gradation plotting as fast as a uniform plotting?
I solved the problem by myself.
After updating from Julia 1.3.1 to Julia1.6.3, I checked the main1 became faster as Bill's comments.

Conditional closures in Julia

In many applications of map(f,X), it helps to create closures that depending on parameters apply different functions f to data X.
I can think of at least the following three ways to do this (note that the second for some reason does not work, bug?)
f0(x,y) = x+y
f1(x,y,p) = x+y^p
function g0(power::Bool,X,y)
if power
f = x -> f1(x,y,2.0)
else
f = x -> f0(x,y)
end
map(f,X)
end
function g1(power::Bool,X,y)
if power
f(x) = f1(x,y,2.0)
else
f(x) = f0(x,y)
end
map(f,X)
end
abstract FunType
abstract PowerFun <: FunType
abstract NoPowerFun <: FunType
function g2{S<:FunType}(T::Type{S},X,y)
f(::Type{PowerFun},x) = f1(x,y,2.0)
f(::Type{NoPowerFun},x) = f0(x,y)
map(x -> f(T,x),X)
end
X = 1.0:1000000.0
burnin0 = g0(true,X,4.0) + g0(false,X,4.0);
burnin1 = g1(true,X,4.0) + g1(false,X,4.0);
burnin2 = g2(PowerFun,X,4.0) + g2(NoPowerFun,X,4.0);
#time r0true = g0(true,X,4.0); #0.019515 seconds (12 allocations: 7.630 MB)
#time r0false = g0(false,X,4.0); #0.002984 seconds (12 allocations: 7.630 MB)
#time r1true = g1(true,X,4.0); # 0.004517 seconds (8 allocations: 7.630 MB, 26.28% gc time)
#time r1false = g1(false,X,4.0); # UndefVarError: f not defined
#time r2true = g2(PowerFun,X,4.0); # 0.085673 seconds (2.00 M allocations: 38.147 MB, 3.90% gc time)
#time r2false = g2(NoPowerFun,X,4.0); # 0.234087 seconds (2.00 M allocations: 38.147 MB, 60.61% gc time)
What is the optimal way to do this in Julia?
There's no need to use map here at all. Using a closure doesn't make things simpler or faster. Just use "dot-broadcasting" to apply the functions directly:
function g3(X,y,power=1)
if power != 1
return f1.(X, y, power) # or simply X .+ y^power
else
return f0.(X, y) # or simply X .+ y
end
end

Fast tensor initialisation in Julia

I would like to initialize a 3d tensor (multi-dimensional array) with the values of the "diagonal Gaussian"
exp(-32*(u^2 + 16*(v^2 + w^2)))
where u = 1/sqrt(3)*(x+y+z) and v,w are any two coordinates orthogonal to u, discretised on a uniform mesh on [-1,1]^3. The following code achieves this:
function gaussian3d(n)
q = qr(ones(3,1), thin=false)[1]
x = linspace(-1.,1., n)
p = Array(Float64,(n,n,n))
square(x) = x*x
Base.#nloops 3 i p begin
#inbounds p[i_1,i_2,i_3] =
exp(
-32*(
square(q[1,1]*x[i_1] + q[2,1]*x[i_2] + q[3,1]*x[i_3])
+ 16*(
square(q[1,2]*x[i_1] + q[2,2]*x[i_2] + q[3,2]*x[i_3]) +
square(q[1,3]*x[i_1] + q[2,3]*x[i_2] + q[3,3]*x[i_3])
)
)
)
end
return p
end
It seems to be quite slow, however. For example, if I replace the defining function with exp(x*y*z), the code runs 50x faster. Also, the #time macro reports ~20% GC time for the above code which I do not understand where they come from. (These numeric values were obtained with n = 128.) My questions therefore are
How can I speed up this piece of code?
Where is the memory allocation hidden which causes the GC overhead?
Knowing nothing of 3D tensors with values of the "diagonal Gaussian", using thesquare comment from the original post, "typing" q (#code_warntype helps here: Big performance jump!), and further specializing the #nloops, this works much faster on the platforms I tried it on.
julia> square(x::Float64) = x * x
square (generic function with 1 method)
julia> function my_gaussian3d(n)
q::Array{Float64,2} = qr(ones(3,1), thin=false)[1]
x = linspace(-1.,1., n)
p = Array(Float64,(n,n,n))
Base.#nloops 3 i p d->x_d=x[i_d] begin
#inbounds p[i_1,i_2,i_3] =
exp(
-32*(
square(q[1,1]*x_1 + q[2,1]*x_2 + q[3,1]*x_3)
+ 16*(
square(q[1,2]*x_1 + q[2,2]*x_2 + q[3,2]*x_3) +
square(q[1,3]*x_1 + q[2,3]*x_2 + q[3,3]*x_3)
)
)
)
end
return p
end
my_gaussian3d (generic function with 1 method)
julia> #time gaussian3d(128);
elapsed time: 3.952389337 seconds (1264 MB allocated, 4.50% gc time in 57 pauses with 0 full sweep)
julia> #time gaussian3d(128);
elapsed time: 3.527316699 seconds (1264 MB allocated, 4.42% gc time in 58 pauses with 0 full sweep)
julia> #time my_gaussian3d(128);
elapsed time: 0.285837566 seconds (16 MB allocated)
julia> #time my_gaussian3d(128);
elapsed time: 0.28476448 seconds (16 MB allocated, 1.22% gc time in 0 pauses with 0 full sweep)
julia> my_gaussian3d(128) == gaussian3d(128)
true

How can I do a bitwise-or reduction along an axis of a boolean array in Julia?

I'm trying to find the best way to do a bitwise-or reduction of a 3D boolean array of masks to 2D in Julia.
I can always write a for loop, of course:
x = randbool(3,3,3)
out = copy(x[:,:,1])
for i = 1:3
for j = 1:3
for k = 2:3
out[i,j] |= x[i,j,k]
end
end
end
But I'm wondering if there is a better way to do the reduction.
A simple answer would be
out = x[:,:,1] | x[:,:,2] | x[:,:,3]
but I did some benchmarking:
function simple(n,x)
out = x[:,:,1] | x[:,:,2]
for k = 3:n
#inbounds out |= x[:,:,k]
end
return out
end
function forloops(n,x)
out = copy(x[:,:,1])
for i = 1:n
for j = 1:n
for k = 2:n
#inbounds out[i,j] |= x[i,j,k]
end
end
end
return out
end
function forloopscolfirst(n,x)
out = copy(x[:,:,1])
for j = 1:n
for i = 1:n
for k = 2:n
#inbounds out[i,j] |= x[i,j,k]
end
end
end
return out
end
shorty(n,x) = |([x[:,:,i] for i in 1:n]...)
timholy(n,x) = any(x,3)
function runtest(n)
x = randbool(n,n,n)
#time out1 = simple(n,x)
#time out2 = forloops(n,x)
#time out3 = forloopscolfirst(n,x)
#time out4 = shorty(n,x)
#time out5 = timholy(n,x)
println(all(out1 .== out2))
println(all(out1 .== out3))
println(all(out1 .== out4))
println(all(out1 .== out5))
end
runtest(3)
runtest(500)
which gave the following results
# For 500
simple: 0.039403016 seconds (39716840 bytes allocated)
forloops: 6.259421683 seconds (77504 bytes allocated)
forloopscolfirst 1.809124505 seconds (77504 bytes allocated)
shorty: elapsed time: 0.050384062 seconds (39464608 bytes allocated)
timholy: 2.396887396 seconds (31784 bytes allocated)
So I'd go with simple or shorty
Try any(x, 3). Just typing a little more here so StackOverflow doesn't nix this response.
There are various standard optimization tricks and hints that can be applied, but the critical observation to make here is that Julia organizes array in column-major rather than row-major order. For small size arrays this is not easily seen but when the arrays grow large it's telling. There is a method reduce provided that is optimized to perform an function on a collection (in this case OR), but it comes at a cost. If the number of combining steps is relatively small then it's better to simply loop. In all cases minimizing the number of memory access is over all better. Below are various attempts at optimization using these 2 things in mind.
Various Attempts and Observations
Initial function
Here's a function that takes your example and generalizes it.
function boolReduce1(x)
out = copy(x[:,:,1])
for i = 1:size(x,1)
for j = 1:size(x,2)
for k = 2:size(x,3)
out[i,j] |= x[i,j,k]
end
end
end
out
end
Creating a fairly large array, we can time it's performance
julia> #time boolReduce1(b);
elapsed time: 42.372058096 seconds (1056704 bytes allocated)
Applying optimizations
Here's another similar version but with the standard type hints, use of #inbounds and inverting the loops.
function boolReduce2(b::BitArray{3})
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
#inbounds a[i,j] = b[i,j,1]
for k = 2:size(b,3)
#inbounds a[i,j] |= b[i,j,k]
end
end
end
a
end
And take the time
julia> #time boolReduce2(b);
elapsed time: 12.892392891 seconds (500520 bytes allocated)
The insight
The 2nd function is a lot faster, and also less memory is allocated because a temporary array wasn't created. But what if we simply take the first function and invert the array indexing?
function boolReduce3(x)
out = copy(x[:,:,1])
for j = 1:size(x,2)
for i = 1:size(x,1)
for k = 2:size(x,3)
out[i,j] |= x[i,j,k]
end
end
end
out
end
and take the time now
julia> #time boolReduce3(b);
elapsed time: 12.451501749 seconds (1056704 bytes allocated)
That's just as fast as the 2nd function.
Using reduce
There is a function called reduce that we can use to eliminate the 3rd loop. Its function is to repeatedly apply an operation on all of the elements with the result of the previous operation. This is exactly what we want.
function boolReduce4(b)
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
#inbounds a[i,j] = reduce(|,b[i,j,:])
end
end
a
end
Now take it's time
julia> #time boolReduce4(b);
elapsed time: 15.828273008 seconds (1503092520 bytes allocated, 4.07% gc time)
That's ok, but not even as fast as the simple optimized original. The reason is, take a look at all of the extra memory that was allocated. This is because data has to be copied from all over to produce input for reduce.
Combining things
But what if we max out the insight as best we can. Instead of the last index being reduced, the first one is?
function boolReduceX(b)
a = BitArray{2}(size(b)[2:3]...)
for j = 1:size(b,3)
for i = 1:size(b,2)
#inbounds a[i,j] = reduce(|,b[:,i,j])
end
end
a
end
And now create a similar array and time it.
julia> c = randbool(200,2000,2000);
julia> #time boolReduceX(c);
elapsed time: 1.877547669 seconds (927092520 bytes allocated, 21.66% gc time)
Resulting in a function 20x faster than the original version for large arrays. Pretty good.
But what if medium size?
If the size is very large then the above function appears best, but if the data set size is smaller, the use of reduce doesn't pay enough back and the following is faster. Including a temporary variable speeds things from version 2. Another version of boolReduceX using a loop instead of reduce (not show here) was even faster.
function boolReduce5(b)
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
#inbounds t = b[i,j,1]
for k = 2:size(b,3)
#inbounds t |= b[i,j,k]
end
#inbounds a[i,j] = t
end
end
a
end
julia> b = randbool(2000,2000,20);
julia> c = randbool(20,2000,2000);
julia> #time boolReduceX(c);
elapsed time: 1.535334322 seconds (799092520 bytes allocated, 23.79% gc time)
julia> #time boolReduce5(b);
elapsed time: 0.491410981 seconds (500520 bytes allocated)
It is faster to devectorize. It's just a matter of how much work you want to put in. The naïve devectorized approach is slow because it's a BitArray: extracting contiguous regions and bitwise OR can both be done a 64-bit chunk at a time, but the naïve devectorized approach operates an element at a time. On top of that, indexing BitArrays is slow, both because there is a sequence of bit operations involved and because it can't presently be inlined due to the bounds check. Here's a strategy that is devectorized but exploits the structure of the BitArray. Most of the code is copy-pasted from copy_chunks! in bitarray.jl and I didn't try to prettify it (sorry!).
function devec(n::Int, x::BitArray)
src = x.chunks
out = falses(n, n)
dest = out.chunks
numbits = n*n
kd0 = 1
ld0 = 0
for j = 1:n
pos_s = (n*n)*(j-1)+1
kd1, ld1 = Base.get_chunks_id(numbits - 1)
ks0, ls0 = Base.get_chunks_id(pos_s)
ks1, ls1 = Base.get_chunks_id(pos_s + numbits - 1)
delta_kd = kd1 - kd0
delta_ks = ks1 - ks0
u = Base._msk64
if delta_kd == 0
msk_d0 = ~(u << ld0) | (u << (ld1+1))
else
msk_d0 = ~(u << ld0)
msk_d1 = (u << (ld1+1))
end
if delta_ks == 0
msk_s0 = (u << ls0) & ~(u << (ls1+1))
else
msk_s0 = (u << ls0)
end
chunk_s0 = Base.glue_src_bitchunks(src, ks0, ks1, msk_s0, ls0)
dest[kd0] |= (dest[kd0] & msk_d0) | ((chunk_s0 << ld0) & ~msk_d0)
delta_kd == 0 && continue
for i = 1 : kd1 - kd0
chunk_s1 = Base.glue_src_bitchunks(src, ks0 + i, ks1, msk_s0, ls0)
chunk_s = (chunk_s0 >>> (64 - ld0)) | (chunk_s1 << ld0)
dest[kd0 + i] |= chunk_s
chunk_s0 = chunk_s1
end
end
out
end
With Iain's benchmarks, this gives me:
simple: 0.051321131 seconds (46356000 bytes allocated, 30.03% gc time)
forloops: 6.226652258 seconds (92976 bytes allocated)
forloopscolfirst: 2.099381939 seconds (89472 bytes allocated)
shorty: 0.060194226 seconds (46387760 bytes allocated, 36.27% gc time)
timholy: 2.464298752 seconds (31784 bytes allocated)
devec: 0.008734413 seconds (31472 bytes allocated)

Resources