I am trying to understand memory allocations in Julia and was playing around with the following test code:
function f()
function test(x,y)
return x-1,y+1
end
x1=5
y1=6
function stest(num_runs,x,y)
for i in 1:num_runs
x,y=test(x,y)
end
return x,y
end
x1,y1=stest(10,x1,y1)
println(x1,' ',y1)
end
#time begin
f()
end
#time begin
f()
end
When I run it, I get the following outputs:
-5 16
0.027611 seconds (20.59 k allocations: 1.039 MiB, 92.11% compilation time)
-5 16
0.000077 seconds (18 allocations: 496 bytes)
Why is there a memory allocation at all? And why is that so much the first time around? I have been reading through the docs but having trouble figuring it out.
The first time, almost all the allocations are just from compilation. The second time, the allocations are all from printing.
Try using BenchmarkTools. In this way you will avoid measuring time and memory used to compile the code (usually one is interested in the runtime not the compilation time). This tool however runs the benchmark many times so you will need to comment out the println line.
julia> using BenchmarkTools
julia> #benchmark f()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.700 ns … 48.500 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.971 ns ± 0.520 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ █
▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂ ▂
1.7 ns Histogram: frequency by time 2.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Printing allocates memory. If you comment out the line where you print, you'll see what you expect:
0.000001 seconds
0.000001 seconds
Related
Illustration of the problem: the row norms of a matrix
Consider this toy example where I compute the norms of all the columns of a random matrix M
julia> M = rand(Float64, 10000, 10000);
julia> #time map(x -> norm(x), M[:,j] for j in 1:size(M)[2]);
0.363795 seconds (166.70 k allocations: 770.086 MiB, 27.78% gc time)
Then compute the row norms
julia> #time map(x -> norm(x), M[:,i] for i in 1:size(M)[1]);
1.288872 seconds (176.19 k allocations: 770.232 MiB, 0.37% gc time)
The factor between the two executions is due (I think) to the memory layout of the matrix (column-major). Indeed the computation of the row norms is a loop on non-contiguous data, which leads to non-vectorized code with cache miss.
I would like to have the same execution time for both norms computations.
Is possible to convert the layout of M to row major to get the same speed when calculating the norms of the rows ?
What did I try
I tried with transpose and permutedims without success, it seems that when using these functions the memory is now in row-major (so columns major of the original matrix).
julia> Mt = copy(transpose(M));
julia> #time map(x -> norm(x), Mt[j,:] for j in 1:size(M)[2]);
1.581778 seconds (176.19 k allocations: 770.230 MiB)
julia> Mt = copy(permutedims(M,[2,1]));
julia> #time map(x -> norm(x), Mt[j,:] for j in 1:size(M)[2]);
1.454153 seconds (176.19 k allocations: 770.236 MiB, 9.98% gc time)
I used copy here to try to force the new layout.
How can I force the column-major layout of the transposition, or the row-major layout of the original matrix ?
EDIT
As pointed out by #mcabbott and #przemyslaw-szufel there was an error in the last code I showed, I computed the norms of the rows of Mt instead of the norms of the columns.
The test on the norms of columns of Mt give instead:
julia> Mt = transpose(M);
julia> #time map(x -> norm(x), M[:,j] for j in 1:size(M)[2]);
1.307777 seconds (204.52 k allocations: 772.032 MiB, 0.45% gc time)
julia> Mt = permutedims(M)
julia> #time map(x -> norm(x), M[:,j] for j in 1:size(M)[2]);
0.334047 seconds (166.53 k allocations: 770.079 MiB, 1.42% gc time)
So in the end it seems that the permutedims stores in column-major, as it would be expected. In fact the Julia arrays are always stored in column-major. transpose is kind of an exception because it's a row-major view of a column-major stored matrix.
There are several problems here:
you are incorrectly benchamrking your code - most likely testing compiled code in the first run and uncompiled code (and hence measure compile times) in the second run. You should always run #time twice or use BenchmarkTools instead
your code is inefficient - does unnecessary memory copying
type of M is unstable and hence measurement includes the time to find out its type which is not a case when you are normally running a Julia function
you do not need to have a lambda - you can just parse the function directly.
as mentioned by #mcabbott your code contains a bug and you are measuring twice the same thing.
After cleaning out your code looks like this:
julia> using LinearAlgebra, BenchmarkTools
julia> const M = rand(10000, 10000);
julia> #btime map(norm, #view M[:,j] for j in 1:size(M)[2]);
49.378 ms (2 allocations: 78.20 KiB)
julia> #btime map(norm, #view M[i, :] for i in 1:size(M)[1]);
1.013 s (2 allocations: 78.20 KiB)
Now the question about about data layout.
Julia is using a column-major memory layout. Hence the operations that work on columns will be faster than those that work on rows.
One possible workaround would be to have a transposed copy of M:
const Mᵀ = collect(M')
This requires some time for copying but allows you later to match the performance:
julia> #btime map(norm, #view Mᵀ[:,j] for j in 1:size(M)[2]);
48.455 ms (2 allocations: 78.20 KiB)
julia> map(norm, Mᵀ[:,j] for j in 1:size(M)[2]) == map(norm, M[i,:] for i in 1:size(M)[1])
true
You are wasting a lot of time on creating copies of each column/row when calculating the norms. Use views instead, or better yet, eachcol/eachrow, which also do not allocate:
julia> M = rand(1000, 1000);
julia> #btime map(norm, $M[:,j] for j in 1:size($M, 2)); # slow and ugly
946.301 μs (1001 allocations: 7.76 MiB)
julia> #btime map(norm, eachcol($M)); # fast and nice
223.199 μs (1 allocation: 7.94 KiB)
julia> #btime norm.(eachcol($M)); # even nicer, but allocates more for some reason.
227.701 μs (3 allocations: 47.08 KiB)
Why is garbage collection so much slower when a large number of mutable structs are loaded in memory as compared with non-mutable structs? The object tree should have the same size in both cases.
julia> struct Foo
a::Float64
b::Float64
c::Float64
end
julia> mutable struct Bar
a::Float64
b::Float64
c::Float64
end
julia> #time dat1 = [Foo(0.0, 0.0, 0.0) for i in 1:1e9];
9.706709 seconds (371.88 k allocations: 22.371 GiB, 0.14% gc time)
julia> #time GC.gc(true)
0.104186 seconds, 100.00% gc time
julia> #time GC.gc(true)
0.124675 seconds, 100.00% gc time
julia> #time dat2 = [Bar(0.0, 0.0, 0.0) for i in 1:1e9];
71.870870 seconds (1.00 G allocations: 37.256 GiB, 73.88% gc time)
julia> #time GC.gc(true)
47.695473 seconds, 100.00% gc time
julia> #time GC.gc(true)
41.809898 seconds, 100.00% gc time
Non-mutable structs may be stored directly inside an Array. This will never happen for mutable structs. In your case, the Foo objects are all stored directly in dat1, so there is effectively just a single (albeit very large) allocation reachable after creating the Arary.
In the case of dat2, every Bar object will have its own piece of memory allocated for it and the Array will contain references to these objects. So with dat2, you end up with 1G + 1 reachable allocations.
You can also see this using Base.sizeof:
julia> sizeof(dat1)
24000000000
julia> sizeof(dat2)
8000000000
You'll see that dat1 is 3 times as large, as every array entry contains the 3 Float64s directly, while the entries in dat2 take up just the space for a pointer each.
As a side note: For these kinds of tests, it is a good idea to use BenchmarkTools.#btime instead of the built-in #time. As it removes the compilation overhead from the result and also runs your code multiple times in order to give you a more representative result:
#btime dat1 = [Foo(0.0, 0.0, 0.0) for i in 1:1e6]
2.237 ms (2 allocations: 22.89 MiB)
#btime dat2 = [Bar(0.0, 0.0, 0.0) for i in 1:1e6]
6.878 ms (1000002 allocations: 38.15 MiB)
As seen above, this is particularly useful to debug allocations. For dat1 we get 2 allocations (one for the Array instance itself and one for the chunk of memory where the array stores its data), while with dat2 we have an additional allocation per element.
following from my question I would like to follow up and calculate the Q matrix in an (memory) efficient way from the output of the spqr procedure. Up to now, only matrix() seems to be implemented. However, I only need the Q matrix in a sparse format, there is not enough memory to convert it to a sparse matrix later:
using LinearAlgebra, SparseArrays
N = 500
ns = 3
d = 0.0001
A = sprand(N,N-ns,d)
H = A*A'
println("eigen")
#time eigen(Matrix(H))
println("qr")
#time F = qr(H)
println("Matrix(F.Q)")
#time Q = Matrix(F.Q)
println("sparse(Q)")
#time sparse(Q)
println("sparse(F.Q)")
#time sparse(F.Q)
Output:
eigen
0.046383 seconds (22 allocations: 7.810 MiB)
qr
0.000479 seconds (649 allocations: 125.533 KiB)
Matrix(F.Q)
0.000454 seconds (508 allocations: 1.931 MiB)
sparse(Q)
0.000371 seconds (9 allocations: 12.406 KiB)
sparse(F.Q)
1.355230 seconds (1.50 M allocations: 1.982 GiB, 33.47% gc time)
Unfortunatelly I could not find the routine in the standard library which performs Matrix(F.Q), otherwise I could replace it in a sparse way by myself.
Best,
v.
Generally, Q won't be sparse so I don't think neither we nor SuiteSparse provide such a function. You might able to write one based on the sparse reflectors in the Q struct.
I have a vector of 100 functions which I want to compose together. I need to run these 100 functions in sequence many times, so I figured composing them would be faster than making a nested loop, however, I was sadly mistaken. I tried reduce(∘, reverse(instructions))(input) and it was taking quite some time. I started timing it and was shocked to discover that composing any significant number of functions together is quite a bit slower than simply looping through the list of functions and applying them in sequence. All 100 functions I have are constant time operations, yet here's what I get when I time running the composition of any of these.
julia> #time reduce(∘, reverse(instructions[1:2]))(2019)
0.000015 seconds (9 allocations: 448 bytes)
2041
julia> #time reduce(∘, reverse(instructions[1:5]))(2019)
0.006597 seconds (4.43 k allocations: 212.517 KiB)
6951
julia> #time reduce(∘, reverse(instructions[1:10]))(2019)
0.022688 seconds (31.01 k allocations: 1.405 MiB)
4935
julia> #time reduce(∘, reverse(instructions[1:20]))(2019)
0.951510 seconds (47.97 k allocations: 2.167 MiB)
3894
julia> #time reduce(∘, reverse(instructions[1:21]))(2019)
1.894370 seconds (60.45 k allocations: 2.715 MiB)
6242
julia> #time reduce(∘, reverse(instructions[1:22]))(2019)
3.748505 seconds (50.59 k allocations: 2.289 MiB)
1669
julia> #time reduce(∘, reverse(instructions[1:23]))(2019)
6.638699 seconds (65.98 k allocations: 2.982 MiB, 0.12% gc time)
8337
julia> #time reduce(∘, reverse(instructions[1:24]))(2019)
12.456682 seconds (68.45 k allocations: 3.096 MiB)
6563
julia> #time reduce(∘, reverse(instructions[1:25]))(2019)
31.712616 seconds (73.44 k allocations: 3.296 MiB)
8178
Just adding one more composed function seems to double the time it takes to run. Rerunning all this code results in it being much faster:
julia> #time reduce(∘, reverse(instructions[1:2]))(2019)
0.000019 seconds (9 allocations: 448 bytes)
2041
julia> #time reduce(∘, reverse(instructions[1:5]))(2019)
0.000021 seconds (12 allocations: 752 bytes)
6951
julia> #time reduce(∘, reverse(instructions[1:10]))(2019)
0.000020 seconds (17 allocations: 1.359 KiB)
4935
julia> #time reduce(∘, reverse(instructions[1:20]))(2019)
0.000027 seconds (27 allocations: 4.141 KiB)
3894
julia> #time reduce(∘, reverse(instructions[1:25]))(2019)
0.000028 seconds (32 allocations: 6.109 KiB)
8178
But then if I add one more again then it takes double whatever the last one took
julia> #time reduce(∘, reverse(instructions[1:26]))(2019)
60.287693 seconds (68.03 k allocations: 3.079 MiB)
3553
So it seems like all the time it's taking is in compiling the functions together, and for 100 functions it'd take way more time than I have. This is confirmed by the following results:
julia> #time reduce(∘, reverse(instructions[1:27]))
0.000041 seconds (99 allocations: 10.859 KiB)
#52 (generic function with 1 method)
julia> #time precompile(ans, (Int,))
117.783710 seconds (79.01 k allocations: 3.650 MiB)
true
What's the deal here? I'll just run the functions in sequence I suppose, but why does this reduction take so long to compile? It seems like a failing of the ∘ function itself that deeply nested compositions take so long to compile. That was quite surprising to me, and it seems like a pretty basic use case of ∘. Basically, it appears that compile time is O(2^n), where n is the number of functions you're composing together. That seems like a big problem
I realized I was using an old version of Julia. On the latest version (1.3) it goes much quicker. It still starts getting slow if you get up into the high thousands (compiling 3000 functions composed together takes a few seconds) but it seems it's no longer O(2^n)
The Julia 1.0.0 documentation provides general tips.
It also suggests that instead of using the #time macro:
For more serious benchmarking, consider the BenchmarkTools.jl package which among other things evaluates the function multiple times in order to reduce noise.
How do they compare in use and is it worth the trouble to use something not in "base" Julia?
From a statistical point of view, #benchmark is much better than #time
TL;DR The BenchmarkTools #benchmark macro is a great micro-benchmark tool.
Use the #time macro with caution and don't take the first run seriously.
This simple example illustrates use and differences:
julia> # Fresh Julia 1.0.0 REPL
julia> # Add BenchmarkTools package using ] key package manager
(v1.0) pkg> add BenchmarkTools
julia> # Press backspace key to get back to Julia REPL
# Load BenchmarkTools package into current REPL
julia> using BenchmarkTools
julia> # Definine a function with a known elapsed time
julia> f(n) = sleep(n) # n is in seconds
f (generic function with 1 method)
# Expect just over 500 ms for elapsed time
julia> #benchmark f(0.5)
BenchmarkTools.Trial:
memory estimate: 192 bytes
allocs estimate: 5
--------------
minimum time: 501.825 ms (0.00% GC)
median time: 507.386 ms (0.00% GC)
mean time: 508.069 ms (0.00% GC)
maximum time: 514.496 ms (0.00% GC)
--------------
samples: 10
evals/sample: 1
julia> # Try second run to compare consistency
julia> # Note the very close consistency in ms for both median and mean times
julia> #benchmark f(0.5)
BenchmarkTools.Trial:
memory estimate: 192 bytes
allocs estimate: 5
--------------
minimum time: 502.603 ms (0.00% GC)
median time: 508.716 ms (0.00% GC)
mean time: 508.619 ms (0.00% GC)
maximum time: 515.602 ms (0.00% GC)
--------------
samples: 10
evals/sample: 1
julia> # Define the same function with new name for #time macro tests
julia> g(n) = sleep(n)
g (generic function with 1 method)
# First run suffers from compilation time, so 518 ms
julia> #time sleep(0.5)
0.517897 seconds (83 allocations: 5.813 KiB)
# Second run drops to 502 ms, 16 ms drop
julia> #time sleep(0.5)
0.502038 seconds (9 allocations: 352 bytes)
# Third run similar to second
julia> #time sleep(0.5)
0.503606 seconds (9 allocations: 352 bytes)
# Fourth run increases over second by about 13 ms
julia> #time sleep(0.5)
0.514629 seconds (9 allocations: 352 bytes)
This simple example illustrates how easy it is to use the #benchmark macro and the caution with which the #time macro results should be taken.
Yes, it is worth the trouble to use the #benchmark macro.