Strange Memory Allocations with DifferentialEquations.jl - julia

I'm observing very strange and large memory allocations while solving a standard ODE and evaluating it at given times. However, after the first benchmark, if (and only if!) I recompile the integrated function these allocations disappear.
using BenchmarkTools
import DifferentialEquations: Vern9, ODEProblem, solve
The function I'm integrating is:
#inbounds function dyn_twobody!(du::Vector{Float64}, u::Vector{Float64},
GM::Float64, t::Float64)
r = (u[1]*u[1] + u[2]*u[2] + u[3]*u[3])^1.5
du[1] = u[4]
du[2] = u[5]
du[3] = u[6]
du[4] = -GM*u[1]/r
du[5] = -GM*u[2]/r
du[6] = -GM*u[3]/r
nothing
end
while test is the function that evaluates the solution at the given times:
function test(x::Vector{Float64}, tms::Vector{Float64})
GM_MOON = 4.902780137400001e3;
user_prob = ODEProblem(dyn_twobody!, x, [0., 86400], GM_MOON)
user_sol = solve(user_prob, Vern9(), abstol=1e-10, reltol=1e-9, saveat=tms, save_everystep=false).u
return user_sol
end
Calling it with:
x = [50500., 232., 32321., -1.2, 0.01, 0.3];
tms = collect(LinRange(0., 86400., 86400));
#benchmark test($x, $tms)
yields:
BenchmarkTools.Trial: 30 samples with 1 evaluation.
Range (min … max): 129.553 ms … 535.830 ms ┊ GC (min … max): 11.32% … 74.42%
Time (median): 151.338 ms ┊ GC (median): 14.82%
Time (mean ± σ): 167.691 ms ± 71.469 ms ┊ GC (mean ± σ): 21.46% ± 11.46%
██
▆▇██▆▁▆▁▆▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
130 ms Histogram: frequency by time 536 ms <
Memory estimate: 304.61 MiB, allocs estimate: 432505.
If i recompile dyn_twobody! the new benchmark is:
BenchmarkTools.Trial: 81 samples with 1 evaluation.
Range (min … max): 49.294 ms … 106.972 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 59.441 ms ┊ GC (median): 0.00%
Time (mean ± σ): 62.334 ms ± 12.016 ms ┊ GC (mean ± σ): 2.41% ± 4.19%
█ ▁ ▁▁ ▃
█▆█▇▄▇▇██▇▆▇▇█▁▇▆▆▆▆▄▆▆▆▄▄▁▁▄▁▁▆▆▁▁▁▄▁▁▁▁▁▁▁▆▁▁▄▁▄▄▁▁▁▁▁▁▁▁▄ ▁
49.3 ms Histogram: frequency by time 98.8 ms <
Memory estimate: 17.22 MiB, allocs estimate: 173311.
Any idea what's happening with my code? I see the Garbage Collector is taking quite a lot of time

Related

Extract value from DataFrame as Float64 value not as Vector

The below is the sample dataframe and I want to extract the value bw of Rat as Float64 not as Vector{Float64}.
df = DataFrame(id=["Mouse","Rat"],
time=[1,1],
bw=[25.0,100.45])
I get a Vector{Float64} when I use the below code.
df[in.(df.id, Ref(["Rat"])), :bw]
Can yo please tell me how can I get only the value of bw=100.45 from the dataframe. I don't want to use the below code, instead I want to reference id=="Rat" and take it that way to be sure I am getting the correct value from a larger dataset.
df[2, :bw]
Thanks again...!
You can do:
julia> only(df[in.(df.id, Ref(["Rat"])), :bw])
100.45
The reason why you are getting a vector back is that you are indexing with a vector. This is consistent with base Julia:
julia> x = ["a", "b"]
2-element Vector{String}:
"a"
"b"
julia> x[[true, false]]
1-element Vector{String}:
"a"
So another option is to index with a scalar:
julia> df[findfirst(in.(df.id, Ref(["Rat"]))), :bw]
100.45
here the findfirst returns the index of the first true element of your indexing vector, which is a scalar.
If you don't want to use only as other's have suggested, you can do:
julia> df[in.(df.id, Ref(["Rat"])), :bw][1]
# 100.45
Which I think is a bit more legible although the performance looks almost identical:
julia> #benchmark df[in.(df.id, Ref(["Rat"])), :bw][1]
BenchmarkTools.Trial: 10000 samples with 210 evaluations.
Range (min … max): 358.929 ns … 23.962 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 369.643 ns ┊ GC (median): 0.00%
Time (mean ± σ): 396.594 ns ± 523.519 ns ┊ GC (mean ± σ): 4.35% ± 3.61%
█
▂██▇▄▄▃▄▄▄▃▃▃▄▆▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂ ▃
359 ns Histogram: frequency by time 472 ns <
Memory estimate: 320 bytes, allocs estimate: 8.
julia> #benchmark only(df[in.(df.id, Ref(["Rat"])), :bw])
BenchmarkTools.Trial: 10000 samples with 212 evaluations.
Range (min … max): 354.557 ns … 17.193 μs ┊ GC (min … max): 0.00% … 97.69%
Time (median): 366.354 ns ┊ GC (median): 0.00%
Time (mean ± σ): 395.770 ns ± 593.797 ns ┊ GC (mean ± σ): 5.57% ± 3.64%
█▅
▄██▆▃▃▄▃▃▃▃▃▆▆▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
355 ns Histogram: frequency by time 481 ns <
Memory estimate: 320 bytes, allocs estimate: 8.
If the :id field is the 'primary key of df, it might be worth sorting df by this field and using binary search for quick access. For example:
julia> sort!(df, :id)
2×3 DataFrame
Row │ id time bw
│ String Int64 Float64
─────┼────────────────────────
1 │ Mouse 1 25.0
2 │ Rat 1 100.45
julia> df[searchsortedfirst(df.id, "Rat"),:bw]
100.45
It seems to me like the simplest and most direct solution is to use findfirst:
julia> findfirst(==("Rat"), df.id)
2
Then the complete solution is
julia> df[findfirst(==("Rat"), df.id), :bw]
100.45
It is also the fastest solution so far, unless the keys are pre-sorted.

Suitable data structure to gain advantage in performance for broad data in Julia

I’ve come from python to Julia, seeking better performance. Then I fell in love with this beautiful language because of its flexibility before its speed. After a while, I’m transporting my significant projects from Python to Julia. But in some cases, there are gaps in how I can gain Julia’s best performance in tasks with comprehensive data (Say data with thousands of rows and a thousand columns). In most cases, there are talks about using the built-in Arrays of Julia with a relatively bad sense if you want to perform fast computations since their size isn’t inferable by the compiler and this can lead to bad performance in some cases. But Array-like structures are the only hope to perform scientific computations. Can I do those computations by using Tuples? No way. There can be at least (and optimistically) a phase of updating elements in the procedure that Tuples can’t settle since they are immutable. I know in Julia we have StaticArrays.jl that proposes immutable data structures that have known length (Since they are static, I guess) in the compile time, which leads to a speedup in runtime. But when I have data with 5000 rows and 1000 columns, creating a SMatrix wouldn’t be possible:
julia> rnd = SMatrix{5000, 1000, Float64}(rand(5000, 1000));
[process exited with code 3221225725 (0xc00000fd)]
So, I saw in some developers avoided using the SMatrix in the source codes, and instead, they do the trick by creating the vector of SVectors since this can be done (in contrast with the SMatrix approach) and it’s relatively fast:
julia> #benchmark [SVector{1000, Float64}(rand(1000)) for _∈1:5000]
BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 40.097 ms … 76.436 ms ┊ GC (min … max): 0.00% … 30.38%
Time (median): 49.359 ms ┊ GC (median): 19.95%
Time (mean ± σ): 50.296 ms ± 6.551 ms ┊ GC (mean ± σ): 17.96% ± 8.91%
▃ ▃██▆▆▄▁▃ ▃▁
█▄▆▁▆▆▄▁▄▆▄▇▇████████▆▄██▆▁▄▁▆▄▁▁▄▄▁▁▆▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▄
40.1 ms Histogram: frequency by time 72.5 ms <
Memory estimate: 76.90 MiB, allocs estimate: 5002.
But, you can’t perform a fast operation on it:
julia> rnd_sv1 = [SVector{1000, Float64}(rand(1000)) for _∈1:5000];
julia> rnd_sv2 = [SVector{1000, Float64}(rand(1000)) for _∈1:5000];
julia> function dot(sv1, sv2)
result = Vector{SVector{1000, Float64}}(undef, 5000)
for idx∈eachindex(sv1)
result[idx]=SVector{1000, Float64}([sv1[idx][i]*sv2[idx][i] for i∈1:1000])
end
end;
julia> #benchmark dot($rnd_sv1, $rnd_sv2)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 3.211 s … 3.232 s ┊ GC (min … max): 0.27% … 0.00%
Time (median): 3.221 s ┊ GC (median): 0.13%
Time (mean ± σ): 3.221 s ± 14.793 ms ┊ GC (mean ± σ): 0.13% ± 0.19%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
3.21 s Histogram: frequency by time 3.23 s <
Memory estimate: 76.90 MiB, allocs estimate: 5002.
While I can do it much faster using regular matrixes and the matrix multiplication concept:
julia> rnd_mat1 = rand(5000, 1000);
julia> rnd_mat2 = rand(1000, 5000);
julia> #benchmark $rnd_mat1 * $rnd_mat2
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min … max): 706.033 ms … 857.549 ms ┊ GC (min … max): 0.06% … 12.98%
Time (median): 770.682 ms ┊ GC (median): 5.71%
Time (mean ± σ): 775.272 ms ± 61.905 ms ┊ GC (mean ± σ): 6.11% ± 6.41%
█ █ █ █ █ █ █
█▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ ▁
706 ms Histogram: frequency by time 858 ms <
Memory estimate: 190.73 MiB, allocs estimate: 2.
I know, the result was expected because in the case of SVectors I couldn’t take advantage of matrix multiplication (However, this is not my fault :D).
Maybe you ask why I didn’t create SMatrixes with the same dimension as rnd_mat1 and rnd_mat2. Because it’s not possible to create SMatrixes with this size! So choosing the best data structure in similar cases is crucial and I want to know what is the best data structure in Julia for data with a massive number of rows and columns in the case of performing scientific computations (considering the real problems in the world even may reach to the millions number of rows and even maybe in higher dimensions). Writing Julia code is easy, but I find it hard to write an optimal Julia code that dramatically distinguishes it from other languages like Python. So I want to get help and information to learn how to ease this difficulty in the case of choosing the best data structure for performing scientific computation on data with massive size if it's possible.
I wanted to say that Julia is fast, like really fast, especially when carefully written. I'll give some examples below to see how Julia code compares to some of the most optimized codes.
Here is matrix multiplication written in Julia code with the help of Julia packages. I used Tullio.jl, which is similar to TensorCast.jl, along with LoopVectorization.jl.
using Tullio, LoopVectorization
function matgen(n)
tmp = 1 / n^2
[tmp * (i-j) * (i+j-2) for i = 1:n, j = 1:n]
end
function matmul(a, b)
# transpose for cache-friendliness
aT = transpose(a)
#tullio out[i,j] := aT[k,i] * b[k,j]
end
n = 1500
a = matgen(n)
b = matgen(n)
Now benchmark my code against OpenBLAS (a spoiler: Julia code matches OpenBLAS or even slightly outperforms it on average):
#benchmark matmul($a, $b)
BenchmarkTools.Trial: 91 samples with 1 evaluation.
Range (min … max): 50.620 ms … 80.234 ms ┊ GC (min … max): 0.00% … 15.08%
Time (median): 52.428 ms ┊ GC (median): 0.00%
Time (mean ± σ): 55.428 ms ± 5.976 ms ┊ GC (mean ± σ): 2.83% ± 5.58%
▆█▄
████▃▆▃▆▅▃▅▄▄▁▁▄▁▃▁▃▁▃▄▄▃▅▅▄▄▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃ ▁
50.6 ms Histogram: frequency by time 77 ms <
Memory estimate: 17.17 MiB, allocs estimate: 117.
#benchmark $a * $b
BenchmarkTools.Trial: 90 samples with 1 evaluation.
Range (min … max): 40.158 ms … 75.403 ms ┊ GC (min … max): 0.00% … 8.52%
Time (median): 55.752 ms ┊ GC (median): 0.00%
Time (mean ± σ): 55.890 ms ± 8.547 ms ┊ GC (mean ± σ): 2.50% ± 4.67%
▂ █ ▂ ▅▂ ▂ ▂▂▅ ▂ ▂ ▂
█▅▁█▅▅▁▁▅▁▅█████▁████▁███▅▁██▅▅▅▅█████▅█▅▅▅▅▅▅█▅███▁█▁▁▁▁▁▅ ▁
40.2 ms Histogram: frequency by time 73.1 ms <
Memory estimate: 17.17 MiB, allocs estimate: 2.
For totally random matrices, the results are even more prominent:
a = rand(1500, 1500)
b = rand(1500, 1500)
#benchmark matmul($a, $b)
BenchmarkTools.Trial: 94 samples with 1 evaluation.
Range (min … max): 49.783 ms … 79.087 ms ┊ GC (min … max): 0.00% … 14.77%
Time (median): 50.926 ms ┊ GC (median): 0.00%
Time (mean ± σ): 53.926 ms ± 6.676 ms ┊ GC (mean ± σ): 2.95% ± 5.54%
██▄
███▅▆▄▃▃▃▃▃▃▁▁▃▁▁▁▅▃▃▃▁▁▁▁▃▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▁▁▁▁▁▁▁▃ ▁
49.8 ms Histogram: frequency by time 79 ms <
Memory estimate: 17.17 MiB, allocs estimate: 118.
#benchmark $a * $b
BenchmarkTools.Trial: 83 samples with 1 evaluation.
Range (min … max): 39.850 ms … 76.194 ms ┊ GC (min … max): 0.00% … 9.49%
Time (median): 62.015 ms ┊ GC (median): 0.00%
Time (mean ± σ): 60.859 ms ± 8.666 ms ┊ GC (mean ± σ): 2.11% ± 3.51%
▁ ▁ ▁▃ ▁ ▁█ ▆ ▁ ▁ ▁▁
█▄▁▁▁▁▁▁▁▁▁▇▁▄▁▇▁▄█▁▁▁██▁▇█▇▇▄▁▄▄▁▇▄██▁█▇█▇▇▄█▇▄██▇▄▄▇▁▁▇▇▄ ▁
39.9 ms Histogram: frequency by time 74.9 ms <
Memory estimate: 17.17 MiB, allocs estimate: 2.
And here is another comparison with the king of performance, aka Intel Fortran, for matrix-dot like your example above.
function dot_sv(result, sv1, sv2)
for j ∈ axes(sv1,2)
for i ∈ axes(sv1,1)
result[i,j] = sv1[i,j] * sv2[i,j]
end
end
end
rnd_sv1 = rand(5000,1000) # [rand(1000) for _ ∈ 1:5000]
rnd_sv2 = rand(5000,1000) # [rand(1000) for _ ∈ 1:5000]
result = similar(rnd_sv1) # [Vector{Float64}(undef,1000) for _ ∈ 1:5000]
#btime dot_sv($result, $rnd_sv1, $rnd_sv2) # 7.987 ms (0 allocations: 0 bytes)
#btime $result .= $rnd_sv1 .* $rnd_sv2 # 7.990 ms (0 allocations: 0 bytes)
println(sum(result))
1249368.5894
And this is the Fortran code:
program dot_svs
implicit none
integer, parameter :: m = 5000, n = 1000
real(8), allocatable :: rnd_sv1(:,:), rnd_sv2(:,:), res(:,:)
integer :: t0, t1, count_max, count_rate
allocate ( rnd_sv1(m,n), rnd_sv2(m,n), res(m,n) )
call random_number(rnd_sv1)
call random_number(rnd_sv2)
call system_clock(t0, count_rate, count_max)
call dot_sv(res, rnd_sv1, rnd_sv2)
call system_clock(t1)
print *, 'Elapsed Time :', real(t1 - t0) / count_rate
print *, sum(res)
contains
subroutine dot_sv(res, sv1, sv2)
real(8) :: res(:,:), sv1(:,:), sv2(:,:)
integer :: i, j
do j = 1, size(res,2)
do i = 1, size(res,1)
res(i,j) = sv1(i,j) * sv2(i,j)
end do
end do
end
end program dot_svs
Elapsed Time : 1.1000000E-02
1249866.98320309
For massive data that don't fit into memory, I imagine there are other methods like reading data from disk or downloading from the cloud. I personally don't have such workloads to do this, but since Julia is fast at moderate workloads, it should be capable to perform well in bigger ones. There are of course some corner cases/rough edges in Julia arrays, but in the future, I think these will greatly improve. I'm waiting for fixed-size arrays, automatic SArrays for small array literals, better views, etc.

Binding multiple variables to copies in BenchmarkTools.jl setup

I am trying to use the #benchmarkable macro from BenchmarkTools.jl. In the package documentaton they explain how to pass setup expressions to #benchmark and #benchmarkable. They also explain that this can be used for in-place/mutating functions in order to bind copies of the input variables.
I am not sure, however, how to use the setup expression to copy multiple variables at the same time.
For example, imagine I want to benchmark the following function (the actual function is irrelevant):
function my_function!(x, y)
deleteat!(x, y .== 0)
deleteat!(y, y .== 0)
x .= x .* 2
end
With the following inputs:
using BenchmarkTools
a = collect(1:30)
b = rand(0:5, 30)
I would like to perform the benchmark by binding a copy of a and b to variables y and z respectively.
t = #benchmarkable my_function!(m, n) setup=(m = copy($a), n = copy($b)) evals = 30
run(t)
However, running the previous code returns the following error:
ERROR: LoadError: UndefVarError: m not defined
setup requires a block of code so:
t = #benchmarkable my_function!(m, n) setup=begin; m = copy($a); n = copy($b);end evals = 30
Now you can run it:
julia> run(t)
BenchmarkTools.Trial: 10000 samples with 30 evaluations.
Range (min … max): 173.333 ns … 131.320 μs ┊ GC (min … max): 0.00% … 99.33%
Time (median): 210.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 251.470 ns ± 1.315 μs ┊ GC (mean ± σ): 5.19% ± 0.99%
▂█ ▁
██▇█▅▇█▅▆▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
173 ns Histogram: frequency by time 807 ns <

Julia - Linear combination of row-wise outer products

I have a matrix A of dimension (n, m) and a matrix B of dimension (n, p). For each of the n rows, I would like to compute the outer product between the row of A and the row of B, which are (m, p) matrices. I then have a vector x of size n and I would like to multiply each of these matrices by the corresponding entry of x and sum everything up. How can I do that?
# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
# Slow method
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end
Here's another verison
# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
function old_way(A, B, x)
# Slow method
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end
end
function another_way(A, B, x)
sum(xi * (Arow * Brow') for (xi, Arow, Brow) in zip(x, eachrow(A), eachrow(B)))
end
And benchmarking:
julia> using BenchmarkTools
julia> #benchmark old_way(A, B, x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 83.495 μs … 2.653 ms ┊ GC (min … max): 0.00% … 95.79%
Time (median): 87.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 101.496 μs ± 115.196 μs ┊ GC (mean ± σ): 6.46% ± 5.53%
▃█▇▆▂ ▁ ▁ ▁ ▁
███████████████▇█▇██████▆▇▇▇▆▇▇▇▇▇▇▇▇▆▇▆▆▇▆▇▅▆▆▆▄▅▅▅▅▆▅▆▅▅▅▄▅ █
83.5 μs Histogram: log(frequency) by time 200 μs <
Memory estimate: 153.48 KiB, allocs estimate: 1802.
julia> #benchmark another_way(A, B, x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 25.850 μs … 923.032 μs ┊ GC (min … max): 0.00% … 95.94%
Time (median): 27.477 μs ┊ GC (median): 0.00%
Time (mean ± σ): 31.851 μs ± 35.440 μs ┊ GC (mean ± σ): 5.03% ± 4.49%
▇█▇▅▃▁ ▁▁▁ ▁
███████▇▇▆▇████████████▇█▇▇▇▆▇▆▅▆▆▆▅▆▆▅▆▆▇▆▅▅▅▆▅▆▅▅▅▄▃▅▅▅▄▄▄ █
25.8 μs Histogram: log(frequency) by time 77.4 μs <
Memory estimate: 98.31 KiB, allocs estimate: 304.
So it's a little faster, and uses a little less memory.
Does this answer your question?
General tips to save time and memory:
Put code in a method instead of the global scope, and make sure every variable in that function comes from the arguments, not global variables. That way, Julia's compiler can infer the types of variables and optimize.
Reduce allocations where possible, and you have many opportunities. The changes here distinguish the old_way and new_way methods, and it causes a 5-6x speedup and reduction to 1 allocation.
When slicing an array, use #view to avoid default behavior of allocating a copy.
You can change result in-place with .+=. += allocates a new array and reassigns the variable result to it.
For elementwise operations like x[i] * ..., chaining dotted operators fuses the underlying elementwise loops and reduces allocations of intermediate arrays.
A matrix multiplication of a column (Mx1) vector and a row (1xN) vector can be simplified to elementwise multiplication.
n, m, p = 100, 10, 3
A, B, x = randn(n, m), randn(n, p), randn(n)
# Methods below do not use the above global variables
function old_way(A, B, x, n, m, p)
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end
result
end
function new_way(A, B, x, n, m, p)
result = zeros(m, p)
for i in 1:n
result .+= x[i] .* ( #view(A[i, :]) .* #view(B[i, :])' )
end
result
end
using BenchmarkTools
#btime old_way(A, B, x, n, m, p);
# 36.753 μs (501 allocations: 125.33 KiB)
#btime new_way(A, B, x, n, m, p);
# 6.542 μs (1 allocation: 336 bytes)
old_way(A, B, x, n, m, p) == new_way(A, B, x, n, m, p)
# true
The example above avoided global variables so far, and the example below will show why. Even if you put your code in a method but still use global variables, not only is the performance just generally worse, trying to reduce allocations backfires:
# Methods below use n, m, p as global inputs
function old_oops(A, B, x)
# same code as old_way(A, B, x, n, m, p)
end
function new_oops(A, B, x)
# same code as new_way(A, B, x, n, m, p)
end
#btime old_oops(A, B, x);
# 95.317 μs (1802 allocations: 153.48 KiB)
#btime new_oops(A, B, x);
# 235.191 μs (1302 allocations: 81.61 KiB)
If your setup has the same structure as your MWE, using LinearAlgebra:
faster(A,B,x) = (diagm(x)*A)'*B
runs 4x faster:
using LinearAlgebra, BenchmarkTools
# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
# Slow method
function slow(A,B,x)
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end
result
end
faster(A,B,x) = (diagm(x)*A)'*B
#assert(slow(A,B,x) ≈ faster(A,B,x))
#btime slow($A,$B,$x) # 113.400 μs (1802 allocations: 139.39 KiB)
#btime faster($A,$B,$x) # 28.100 μs (4 allocations: 86.41 KiB)

julia vector of matrix or 3d Array

I want to multiply several matrices (all of same dimensions) to a vector beta.
I have tried two versions, to store the matrices, either as a vector of matrices, or as a tri-dimensions array.
Is it normal that the version with the vector of matrices runs faster?
using BenchmarkTools
nid =10000
npar = 5
x1 = reshape(repeat([1], nid * npar), nid,:)
x2 = reshape(repeat([2], nid * npar), nid,:)
x3 = reshape(repeat([3], nid * npar), nid,:)
X = reshape([x1 x2 x3], nid, npar, :);
X1 = [x1, x2, x3]
beta = rand(npar)
function f(X::Array{Int,3}, beta::Vector{Float64})::Array{Float64}
hcat([ X[:,:,i] * beta for i=1:size(X,3)]...)
end
function g(X::Array{Array{Int,2},1}, beta::Vector{Float64})::Array{Float64}
hcat([X[i] * beta for i=1:size(X)[1]]...)
end
f(X,beta);
g(X1,beta);
#benchmark f(X, beta)
#benchmark g(X1, beta)
Results indicate that f takes almost 2x time of g.
Is it a normal pattern, or I am not using the 3D Array properly?
That's because the slice operator forces each matrix to be copied hence memory allocation growth.
Notice the last line of the benchmark:
julia> #benchmark f(X, beta)
BenchmarkTools.Trial: 8011 samples with 1 evaluation.
Range (min … max): 356.013 μs … 4.076 ms ┊ GC (min … max): 0.00% … 87.21%
Time (median): 457.231 μs ┊ GC (median): 0.00%
Time (mean ± σ): 615.235 μs ± 351.236 μs ┊ GC (mean ± σ): 6.54% ± 11.69%
▃▇██▆▅▄▄▄▃▂ ▂▃▄▄▄▃▁▁▁ ▁ ▁ ▂
████████████▆▇▅▅▄▄▇███████████████▇█▇▇█▇▆▇█▆▆▆▆▅▆▅▅▄▃▃▂▂▄▂▄▄▄ █
356 μs Histogram: log(frequency) by time 1.96 ms <
Memory estimate: 1.60 MiB, allocs estimate: 17.
julia> #benchmark g(X1, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 174.348 μs … 2.493 ms ┊ GC (min … max): 0.00% … 83.85%
Time (median): 219.383 μs ┊ GC (median): 0.00%
Time (mean ± σ): 245.192 μs ± 119.612 μs ┊ GC (mean ± σ): 3.54% ± 7.68%
▃▇█▂
▅████▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
174 μs Histogram: frequency by time 974 μs <
Memory estimate: 469.25 KiB, allocs estimate: 11.
To avoid it, just use the macro #views that forces a reference, without allocation. The time between the two implementations then becomes the same out of random noises:
function fbis(X::Array{Int,3}, beta::Vector{Float64})::Array{Float64}
#views hcat([ X[:,:,i] * beta for i=1:size(X,3)]...)
end
julia> #benchmark fbis(X, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 175.984 μs … 2.710 ms ┊ GC (min … max): 0.00% … 79.70%
Time (median): 225.990 μs ┊ GC (median): 0.00%
Time (mean ± σ): 274.611 μs ± 166.015 μs ┊ GC (mean ± σ): 4.17% ± 7.78%
▅█▃
▆███▆▄▃▄▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
176 μs Histogram: frequency by time 1.15 ms <
Memory estimate: 469.25 KiB, allocs estimate: 11.
While in this case using references improves the benchmarks, put attention not to abuse them. If you are "creating" some matrix that you are going to use over and over again, the initial allocation time if you copy the matrix may become negligible compared to the lookup time of different memory spaces if you use the reference way.

Resources