broadcasting arrays and summation without temporary matrix allocations in Julia - julia

I need to perform a calculation of the form:
A = reshape(big_mat,m1,1,1,n1,n2,n3)
B = reshape(big_mat,1,m1,n1,1,n2,n3)
C = reshape(another_mat,m1,m1,n1,n1,1,1)
D = sum(A.*B.*C,dims=(5,6))
A.*B.*C is creating a temporary big matrix of size(m1,m1,n1,n1,n2,n3). Given that D is only of size(m1,m1,n1,n1), is there a more efficient procedure of doing this summation without invoking for-loops?

You can ask the computer to write the loops for you. #einsum D[a,b,c,d] := mat[a,d,e,f] * mat[b,c,e,f] * another[a,b,c,d] will write 6 nested for loops, summing over e,f which do not appear on the left.
#tullio will do the same, and add tiled memory access (and multi-threading) which should be a bit faster.
julia> using Einsum, Tullio, BenchmarkTools
julia> let
n = 25
big_mat = rand(n,n,n,n)
another_mat = rand(n,n,n,n)
D1 = #btime let n = $n
A = reshape($big_mat, n,1,1,n, n,n)
B = reshape($big_mat, 1,n,n,1, n,n)
C = reshape($another_mat, n,n,n,n, 1,1)
D = sum(A.*B.*C,dims=(5,6))
end
D2 = #btime #einsum D[a,b,c,d] := $big_mat[a,d,e,f] * $big_mat[b,c,e,f] * $another_mat[a,b,c,d]
D3 = #btime #tullio D[a,b,c,d] := $big_mat[a,d,e,f] * $big_mat[b,c,e,f] * $another_mat[a,b,c,d]
D1 ≈ D2 ≈ D3
end
min 462.545 ms, mean 494.505 ms (20 allocations, 1.82 GiB)
min 213.126 ms, mean 214.412 ms (3 allocations, 2.98 MiB)
min 80.585 ms, mean 80.785 ms (53 allocations, 2.98 MiB)
true
julia> 2.98 * 25^2 # memory 2.98 MiB -> 1.82 GiB
1862.5
julia> #macroexpand1 #einsum D[a,b,c,d] := mat[a,d,e,f] * mat[b,c,e,f] * another[a,b,c,d]
quote # this will print the loops

Related

julia multiplication of two arrays

Is there a way to speed-up/ write more elegantly this array multiplication (which, in numpy arrays, I would write as A*B)?
A = rand(8,15,10)
B = rand(10,5)
C = zeros(8,15,5)
for i in 1:8
for j in 1:15
for k in 1:10
for l in 1:5
C[i,j,l] = A[i,j,:]⋅B[:,l]
end
end
end
end
There are a bunch of Julia packages which allow you to write your contraction in one simple line. Here a few examples based on Einsum.jl, OMEinsum.jl, and TensorOperations.jl:
using OMEinsum
f_omeinsum(A,B) = ein"ijk,km->ijm"(A,B)
using Einsum
f_einsum(A,B) = #einsum C[i,j,l] := A[i,j,k] * B[k,l]
using TensorOperations
f_tensor(A,B) = #tensor C[i,j,l] := A[i,j,k] * B[k,l]
Apart from these elegant (and fast, see below) versions, you can improve your loop code quite a bit. Here your code, wrapped into a function, and an improved version with comments:
function f(A,B)
C = zeros(8,15,5)
for i in 1:8
for j in 1:15
for k in 1:10
for l in 1:5
C[i,j,l] = A[i,j,:]⋅B[:,l]
end
end
end
end
return C
end
function f_fast(A,B)
# check bounds
n1,n2,n3 = size(A)
m1, m2 = size(B)
#assert m1 == n3
C = zeros(n1,n2,m2)
# * #inbounds to skip boundchecks inside the loop
# * different order of the loops to account for Julia's column major order
# * written out the k-loop (dot product) explicitly to avoid temporary allocations
#inbounds for l in 1:m2
for k in 1:m1
for j in 1:n2
for i in 1:n1
C[i,j,l] += A[i,j,k]*B[k,l]
end
end
end
end
return C
end
Let's compare all approaches. First we check for correctness:
using Test
#test f(A,B) ≈ f_omeinsum(A,B) # Test passed
#test f(A,B) ≈ f_einsum(A,B) # Test passed
#test f(A,B) ≈ f_tensor(A,B) # Test passed
#test f(A,B) ≈ f_fast(A,B) # Test passed
Now, let's benchmark using BenchmarkTools.jl. I put the timings on my machine as comments.
using BenchmarkTools
#btime f($A,$B); # 663.500 μs (12001 allocations: 1.84 MiB)
#btime f_omeinsum($A,$B); # 33.799 μs (242 allocations: 20.20 KiB)
#btime f_einsum($A,$B); # 4.200 μs (1 allocation: 4.81 KiB)
#btime f_tensor($A,$B); # 2.367 μs (3 allocations: 4.94 KiB)
#btime f_fast($A,$B); # 7.375 μs (1 allocation: 4.81 KiB)
As we can see, all the einsum/tensor notation based approaches are much faster than your original loop implementation - and only one liners! The performance of our f_fast is in the same ballpark but still quite a bit behind f_tensor, which is the fastest.
Finally, let's go all for performance, because we can. Utilizing the wizardry from LoopVectorization.jl, we replace the #inbounds in f_fast with #avx (we call this new version f_avx below) and automagically get another 2x speed up relative to the f_tensor performance above:
#test f(A,B) ≈ f_avx(A,B) # Test passed
#btime f_avx($A,$B); # 930.769 ns (1 allocation: 4.81 KiB)
However, because of its simplicity I'd still prefer f_tensor unless every microsecond counts in your application.

Order of linear algebra operations in Julia

If I have a command y = A*B*x where A & B are large matrices and x & y are vectors, will Julia preform y = ((A*B)*x) or y = (A*(B*x))?
The second option should be the best as it only has to allocate an extra vector rather than a large matrix.
The best way to verify this kind of thing is to dump the lowered code via #code_lowered macro:
julia> #code_lowered A * B * x
CodeInfo(:(begin
nothing
return (Core._apply)(Base.afoldl, (Core.tuple)(Base.*, (a * b) * c), xs)
end))
Like many other languages, Julia does y = (A*B)*x instead of y = A*(B*x), so it's up to you to explicitly use parens to reduce the allocation.
julia> using BenchmarkTools
julia> #btime $A * ($B * $x);
6.800 μs (2 allocations: 1.75 KiB)
julia> #btime $A * $B * $x;
45.453 μs (3 allocations: 79.08 KiB)

What is the fastest way to compute the sum of outer products [Julia]

In Julia, what is the fastest way to do:
where is an -dimensional column vector of variables at time t.
In Julia code, one option is:
A = zeros(n,n);
for j=1:T
A = A + Y(j,:)'*Y(j,:);
end
where
Y = [y_1'
...
y_T']`
is a (Txn)matrix.
But, is there a faster way ? Thanks.
For comparison, I have tried several codes for computing the A matrix (which I hope to be what OP wants...), including built-in matrix multiplication, BLAS.ger!, and explicit loops:
print_(x) = print(rpad(x,12))
# built-in vector * vector'
function perf0v( n, T, Y )
print_("perf0v")
out = zeros(n,n)
for t = 1 : T
out += slice( Y, :,t ) * slice( Y, :,t )'
end
return out
end
# built-in matrix * matrix'
function perf0m( n, T, Y )
print_("perf0m")
out = Y * Y'
return out
end
# BLAS.ger!
function perf1( n, T, Y )
print_("perf1")
out = zeros(n,n)
for t = 1 : T
BLAS.ger!( 1.0, Y[ :,t ], Y[ :,t ], out )
end
return out
end
# BLAS.ger! with sub
function perf1sub( n, T, Y )
print_("perf1sub")
out = zeros(n,n)
for t = 1 : T
BLAS.ger!( 1.0, sub( Y, :,t ), sub( Y, :,t ), out )
end
return out
end
# explicit loop
function perf2( n, T, Y )
print_("perf2")
out = zeros(n,n)
for t = 1 : T,
i2 = 1 : n,
i1 = 1 : n
out[ i1, i2 ] += Y[ i1, t ] * Y[ i2, t ]
end
return out
end
# explicit loop with simd
function perf2simd( n, T, Y )
print_("perf2simd")
out = zeros(n,n)
for i2 = 1 : n,
i1 = 1 : n
#simd for t = 1 : T
out[ i1, i2 ] += Y[ i1, t ] * Y[ i2, t ]
end
end
return out
end
# transposed perf2
function perf2tr( n, T, Yt )
print_("perf2tr")
out = zeros(n,n)
for t = 1 : T,
i2 = 1 : n,
i1 = 1 : n
out[ i1, i2 ] += Yt[ t, i1 ] * Yt[ t, i2 ]
end
return out
end
# transposed perf2simd
function perf2simdtr( n, T, Yt )
print_("perf2simdtr")
out = zeros(n,n)
for i2 = 1 : n,
i1 = 1 : n
#simd for t = 1 : T
out[ i1, i2 ] += Yt[ t, i1 ] * Yt[ t, i2 ]
end
end
return out
end
#.........................................................
n = 100
T = 1000
#show n, T
Y = rand( n, T )
Yt = copy( Y' )
out = Dict()
for loop = 1:2
println("loop = ", loop)
for fn in [ perf0v, perf0m, perf1, perf1sub, perf2, perf2simd ]
#time out[ fn ] = fn( n, T, Y )
end
for fn in [ perf2tr, perf2simdtr ]
#time out[ fn ] = fn( n, T, Yt )
end
end
# Check
error = 0.0
for k1 in keys( out ),
k2 in keys( out )
#assert sumabs( out[ k1 ] ) > 0.0
#assert sumabs( out[ k2 ] ) > 0.0
error += sumabs( out[ k1 ] - out[ k2 ] )
end
#show error
The result obtained with julia -O --check-bounds=no test.jl (ver0.4.5) is:
(n,T) = (100,1000)
loop = 2
perf0v 0.056345 seconds (15.04 k allocations: 154.803 MB, 31.66% gc time)
perf0m 0.000785 seconds (7 allocations: 78.406 KB)
perf1 0.155182 seconds (5.96 k allocations: 1.846 MB)
perf1sub 0.155089 seconds (8.01 k allocations: 359.625 KB)
perf2 0.011192 seconds (6 allocations: 78.375 KB)
perf2simd 0.016677 seconds (6 allocations: 78.375 KB)
perf2tr 0.011698 seconds (6 allocations: 78.375 KB)
perf2simdtr 0.009682 seconds (6 allocations: 78.375 KB)
and for some different values of n & T:
(n,T) = (1000,100)
loop = 2
perf0v 0.610885 seconds (2.01 k allocations: 1.499 GB, 25.11% gc time)
perf0m 0.008866 seconds (9 allocations: 7.630 MB)
perf1 0.182409 seconds (606 allocations: 9.177 MB)
perf1sub 0.180720 seconds (806 allocations: 7.657 MB, 0.67% gc time)
perf2 0.104961 seconds (6 allocations: 7.630 MB)
perf2simd 0.119964 seconds (6 allocations: 7.630 MB)
perf2tr 0.137186 seconds (6 allocations: 7.630 MB)
perf2simdtr 0.103878 seconds (6 allocations: 7.630 MB)
(n,T) = (2000,100)
loop = 2
perf0v 2.514622 seconds (2.01 k allocations: 5.993 GB, 24.38% gc time)
perf0m 0.035801 seconds (9 allocations: 30.518 MB)
perf1 0.473479 seconds (606 allocations: 33.591 MB, 0.04% gc time)
perf1sub 0.475796 seconds (806 allocations: 30.545 MB, 0.95% gc time)
perf2 0.422808 seconds (6 allocations: 30.518 MB)
perf2simd 0.488539 seconds (6 allocations: 30.518 MB)
perf2tr 0.554685 seconds (6 allocations: 30.518 MB)
perf2simdtr 0.400741 seconds (6 allocations: 30.518 MB)
(n,T) = (3000,100)
loop = 2
perf0v 5.444797 seconds (2.21 k allocations: 13.483 GB, 20.77% gc time)
perf0m 0.080458 seconds (9 allocations: 68.665 MB)
perf1 0.927325 seconds (806 allocations: 73.261 MB, 0.02% gc time)
perf1sub 0.926690 seconds (806 allocations: 68.692 MB, 0.51% gc time)
perf2 0.958189 seconds (6 allocations: 68.665 MB)
perf2simd 1.067098 seconds (6 allocations: 68.665 MB)
perf2tr 1.765001 seconds (6 allocations: 68.665 MB)
perf2simdtr 0.902838 seconds (6 allocations: 68.665 MB)
Hmm, so the built-in matrix * matrix (Y * Y') was fastest. It seems that BLAS gemm is called at the end (from the output of #less Y * Y').
If you know the components of y_t in advance, the simplest, easiest, and likely fastest way is simply:
A = Y*Y'
Where the different values of y_t are stored as columns in the matrix Y.
If you don't know the component of y_t in advance, you can use BLAS:
n = 100;
t = 1000;
Y = rand(n,t);
outer_sum = zeros(n,n);
for tdx = 1:t
BLAS.ger!(1.0, Y[:,tdx], Y[:,tdx], outer_sum)
end
See this post (for a similar example) in case you're new to BLAS and would like help interpreting the arguments in this function here.
One of the key things here is to store the y_t vectors as columns rather than as rows of Y because accessing columns is much faster than accessing rows. See the Julia performance tips for more info on this.
Update for the second option (where it isn't known in advance what the components of Y will be, BLAS will sometimes but not always be fastest. The big determining factor is the size of the vector you're using. Calling BLAS incurs a certain overhead and so is only worthwhile in certain settings. Julia's native matrix multiplication will automatically chose whether to use BLAS, and will generally do a good job with it. But, if you know ahead of time that you're dealing with a situation where BLAS will be optimal, then you can save the Julia optimizer some work (and thus speed up your code) by specifying it ahead of time.
See the great response by roygvib below. It presents a LOT of creative and instructive ways to compute this sum of dot products. Many will be faster than BLAS in certain situations. From the time trials that roygvib presents, it looks like the break-even point is around n = 3000.
For the sake of completion, here is another, vectorised approach:
Assume Y is as follows:
julia> Y = rand(1:10, 10,5)
10×5 Array{Int64,2}:
2 1 6 2 10
8 2 6 8 2
2 10 10 4 6
5 9 8 5 1
5 4 9 9 4
4 6 3 4 8
2 9 2 8 1
6 8 5 10 2
1 7 10 6 9
8 7 10 10 8
julia> Y = reshape(Y, 10,5,1); # add a singular 3rd dimension, so we can
# be allowed to shuffle the dimensions
The idea is that you create one array which is defined in dimensions 1 and 3, and only has one column, and you array-multiply this by an array which is defined in dimensions 2 and 3, but only has one row. Your 'Time' variable varies along dimension 3. This essentially results in the individual kronecker products from each timestep, concatenated along the time (i.e. 3rd) dimension.
julia> KroneckerProducts = permutedims(Y, [2,3,1]) .* permutedims(Y, [3,2,1]);
Now it wasn't clear to me if your end result was meant to be an "nxn" matrix, resulting from the sum of all timings at each 'kronecker' position
julia> sum(KroneckerProducts, 3)
5×5×1 Array{Int64,3}:
[:, :, 1] =
243 256 301 324 192
256 481 442 427 291
301 442 555 459 382
324 427 459 506 295
192 291 382 295 371
or simply the sum of all the elements in that massive 3D array
julia> sum(KroneckerProducts)
8894
Choose your preferred poison :p
I'm not sure this will be faster than Michael's approach above, since the permutedims step is presumably expensive, and for very large arrays it may actually be a bottleneck (but I don't know how it's implemented in Julia ... maybe it's not!), so it may not necessarily perform better than a simple loop iterating for each timestep, even though it's "vectorised code". You can try both approaches and see for yourself what is fastest for your particular arrays!

Updating a dense vector by a sparse vector in Julia is slow

I am using Julia version 0.4.5 and I am experiencing the following issue:
As far as I know, taking inner product between a sparse vector and a dense vector should be as fast as updating the dense vector by a sparse vector. The latter one is much slower.
A = sprand(100000,100000,0.01)
w = rand(100000)
#time for i=1:100000
w += A[:,i]
end
26.304380 seconds (1.30 M allocations: 150.556 GB, 8.16% gc time)
#time for i=1:100000
A[:,i]'*w
end
0.815443 seconds (921.91 k allocations: 1.540 GB, 5.58% gc time)
I created a simple sparse matrix type of my own, and the addition code was ~ the same as the inner product.
Am I doing something wrong? I feel like there should be a special function doing the operation w += A[:,i], but I couldn't find it.
Any help is appreciated.
I asked the same question on GitHub and we came to the following conclusion. The type SparseVector was added as of Julia 0.4 and with it the BLAS function LinAlg.axpy!, which updates in-place a (possibly dense) vector x by a sparse vector y multiplied by a scalar a, i.e. performs x += a*y efficiently. However, in Julia 0.4 it is not implemented properly. It works only in Julia 0.5
#time for i=1:100000
LinAlg.axpy!(1,A[:,i],w)
end
1.041587 seconds (799.49 k allocations: 1.530 GB, 8.01% gc time)
However, this code is still sub-optimal, as it creates the SparseVector A[:,i]. One can get an even faster version with the following function:
function upd!(w,A,i,c)
rowval = A.rowval
nzval = A.nzval
#inbounds for j = nzrange(A,i)
w[rowval[j]] += c* nzval[j]
end
return w
end
#time for i=1:100000
upd!(w,A,i,1)
end
0.500323 seconds (99.49 k allocations: 1.518 MB)
This is exactly what I needed to achieve, after some research we managed to get there, thanks everyone!
Assuming you want to compute w += c * A[:, i], there is an easy way to vectorize it:
>>> A = sprand(100000, 100000, 0.01)
>>> c = rand(100000)
>>> r1 = zeros(100000)
>>> #time for i = 1:100000
>>> r1 += A[:, i] * c[i]
>>> end
29.997412 seconds (1.90 M allocations: 152.077 GB, 12.73% gc time)
>>> #time r2 = sum(A .* c', 2);
1.191850 seconds (50 allocations: 1.493 GB, 0.14% gc time)
>>> all(r1 == r2)
true
First, create a vector c of the constants to multiply with. Then multiplay de columns of A element-wise by the values of c (A .* c', it does broadcasting inside). Last, reduce over the columns of A (the part sum(.., 2)).

Compute sum_i f(i) x(i) x(i)' fast?

I'm trying to compute the summation of f(i) * x(i) * x(i)'
where x(i) is a column vector, x(i)' is the transpose, and f(i) is a scalar. So it's a weighted sum of outer products.
In MATLAB, this can be achieved pretty fast by using bsxfun.
The following code runs in 260 ms on my laptop (MacBook Air 2010)
N = 1e5;
d = 100;
f = randn(N, 1);
x = randn(N, d);
% H = zeros(d, d);
tic;
H = x' * bsxfun(#times, f, x);
toc
I've been trying to make Julia do the same job, but I can't get to do it faster.
N = int(1e5);
d = 100;
f = randn(N);
x = randn(N, d);
function hess1(x, f)
N, d = size(x);
temp = zeros(N, d);
#simd for kk = 1:N
#inbounds temp[kk, :] = f[kk] * x[kk, :];
end
H = x' * temp;
end
function hess2(x, f)
N, d = size(x);
H2 = zeros(d,d);
#simd for k = 1:N
#inbounds H2 += f[k] * x[k, :]' * x[k, :];
end
return H2
end
function hess3(x, f)
N, d = size(x);
H3 = zeros(d,d);
for k = 1:N
for k1 = 1:d
#simd for k2 = 1:d
#inbounds H3[k1, k2] += x[k, k1] * x[k, k2] * f[k];
end
end
end
return H3
end
The results are
#time H1 = hess1(x, f);
#time H2 = hess2(x, f);
#time H3 = hess3(x, f);
elapsed time: 0.776116469 seconds (262480224 bytes allocated, 26.49% gc time)
elapsed time: 30.496472345 seconds (16385442496 bytes allocated, 56.07% gc time)
elapsed time: 2.769934563 seconds (80128 bytes allocated)
hess1 is like MATLAB's bsxfun but slower, and hess3 uses no temporary memory, but significantly slower. My best julia code is 3 times slower than MATLAB.
How can I make this julia code faster?
IJulia gist: http://nbviewer.ipython.org/gist/memming/669fb8e78af3338ebf6f
Julia version: 0.3.0-rc1
EDIT:
I tested on a more powerful computer (3.5 Ghz Intel i7, 4 core, L2 256kB, L3 8 MB)
MATLAB R2014a without -singleCompThread: 0.053 s
MATLAB R2014a with -singleCompThread: 0.080 s (#tholy's suggestion)
Julia 0.3.0-rc1
hess1 elapsed time: 0.215406904 seconds (262498648 bytes allocated, 32.74% gc time)
hess2 elapsed time: 10.722578699 seconds (16384080176 bytes allocated, 62.20% gc time)
hess3 elapsed time: 1.065504355 seconds (80176 bytes allocated)
bsxfunstyle elapsed time: 0.063540168 seconds (80081072 bytes allocated, 25.04% gc time) (#IainDunning's solution)
Indeed, using broadcast is much faster and comparable to MATLAB's bsxfun.
You are looking for the broadcast function. Here is the relevant issue discussing the functionality and naming.
I implemented your version as well as a broadcast version, here is what I found:
srand(1988)
N = 100_000
d = 100
f = randn(N, 1)
x = randn(N, d)
function hess1(x, f)
N, d = size(x);
temp = zeros(N, d);
#simd for kk = 1:N
#inbounds temp[kk, :] = f[kk] * x[kk, :];
end
H = x' * temp;
end
function bsxfunstyle(x, f)
x' * broadcast(*,f,x)
end
# Warmup
hess1(x,f)
bsxfunstyle(x, f)
# For real
println("Hess1")
#time H1 = hess1(x, f)
println("Broadcast")
#time H2 = bsxfunstyle(x, f)
# Check solutions are identical
println(sum(abs(H1-H2)))
with output
Hess1
elapsed time: 0.324256216 seconds (262498648 bytes allocated, 33.95% gc time)
Broadcast
elapsed time: 0.126647594 seconds (80080696 bytes allocated, 20.22% gc time)
0.0
There are several performance issues with your functions
you're creating temporary arrays by x[kk, :].
you are traversing matrix in rows while they are stored in column order.
You are using x' (which first transpose the matrix) rather than At_mul_B(x,...)
A simple modification gives better performances :
N = 100_000
d = 100
f = randn(N)
x = randn(N, d)
f = randn(N, 1)
x = randn(N, d)
function hess(x, f)
N, d = size(x);
temp = zeros(N, d);
#inbounds for k1 = 1:d
#simd for kk = 1:N
temp[kk, k1] = f[kk] * x[kk, k1]
end
end
H = At_mul_B(x, temp)
end
#time hess(x, f)
# 0.067636 seconds (9 allocations: 76.371 MB, 11.24% gc time)

Resources