I am trying to understand memory management a little better. I have the following example code:
begin
mutable struct SimplestStruct
a::Float64
end
function SimplestLoop!(a::Array{SimplestStruct, 1}, x::Float64)
for i in 1:length(a)
#inbounds a[i].a = x
end
end
simples = fill(SimplestStruct(rand()), 100)
#time SimplestLoop!(simples, 6.0)
end
As far as I can work out from the docs + various good posts around about in-place operations, SimplestLoop! should operate on its first argument without allocating any extra memory. However, #time reports 17k allocations.
Is there an obvious reason why this is happening?
Thank you in advance!
If you perform the #time measurement several times, you'll see that the first measurement is different from the others. This is because you're actually mostly measuring (just-ahead-of-time) compilation time and memory allocations.
When the objective is to better understand runtime performance, it is generally recommended to use BenchmarkTools to perform the benchmarks:
julia> using BenchmarkTools
julia> #btime SimplestLoop!($simples, 6.0)
82.930 ns (0 allocations: 0 bytes)
BenchmarkTools's #btime macro takes care of handling compilation times, as well as averaging runtime measurements over a sufficiently large number of samples to get accurate estimations. With this, we see that there are indeed no allocations in your code, as expected.
Related
I am reading Julia performance tips,
https://docs.julialang.org/en/v1/manual/performance-tips/
At the beginning, it mentions two examples.
Example 1,
julia> x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global()
0.009639 seconds (7.36 k allocations: 300.310 KiB, 98.32% compilation time)
496.84883432553846
julia> #time sum_global()
0.000140 seconds (3.49 k allocations: 70.313 KiB)
496.84883432553846
We see a lot of memory allocations.
Now example 2,
julia> x = rand(1000);
julia> function sum_arg(x)
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_arg(x)
0.006202 seconds (4.18 k allocations: 217.860 KiB, 99.72% compilation time)
496.84883432553846
julia> #time sum_arg(x)
0.000005 seconds (1 allocation: 16 bytes)
496.84883432553846
We see that by putting x into into the argument of the function, memory allocations almost disappeared and the speed is much faster.
My question are, can anyone explain,
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
I am a little confused.
in the two examples, we see that the second time we run Julia, it is always faster than the first time.
Does that mean we need to run Julia twice? If Julia is only fast at the second run, then what is point? Why not Julia just do a compiling first, then do a run, just like Fortran?
Is there any general rule to preventing memory allocations? Or do we just always have to do a #time to identify the issue?
Thanks!
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
Example 1 needs so many allocations, because x is a global variable (defined out of scope of the function sum_arg). Therefore the type of variable x can potentially change at any time, i.e. it is possible that:
you define x and sum_arg
you compile sum_arg
you redefine x (change its type) and run sum_arg
In particular, as Julia supports multiple threading, both actions in step 3 in general could happen even in parallel (i.e. you could have changed the type of x in one thread while sum_arg would be running in another thread).
So because after compilation of sum_arg the type of x can change Julia, when compiling sum_arg has to ensure that the compiled code does not rely on the type of x that was present when the compilation took place. Instead Julia, in such cases, allows the type of x to be changed dynamically. However, this dynamic nature of allowed x means that it has to be checked in run-time (not compile time). And this dynamic checking of x causes performance degradation and memory allocations.
You could have fixed this by declaring x to be a const (as const ensures that the type of x may not change):
julia> const x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global() # this is now fast
0.000002 seconds
498.9290555615045
Why not Julia just do a compiling first, then do a run, just like Fortran?
This is exactly what Julia does. However, the benefit of Julia is that it does compilation automatically when needed. This allows you for a smooth interactive development process.
If you wanted you could compile the function before it is run with the precompile function, and then run it separately. However, normally people just run the function without doing it explicitly.
The consequence is that if you use #time:
The first time you run a function it returns you both execution time and compilation time (and as you can see in examples you have pasted - you get information what percentage of time was spent on compilation).
In the consecutive runs the function is already compiled so only execution time is returned.
Is there any general rule to preventing memory allocations?
These rules are exactly given in the Performance Tips section of the manual that you are quoting in your question. The tip on using #time is a diagnostic tip there. All other tips are rules that are recommended to get a fast code. However, I understand that the list is long so a shorter list that is good enough to start with in my experience is:
Avoid global variables
Avoid containers with abstract type parameters
Write type stable functions
Avoid changing the type of a variable
I have a simple question. I have defined a struct, and I need to inititate a lot (in the order of millions) of them and loop over them.
I am initiating one at a time and going through the loop as follows:
using Distributions
mutable struct help_me{Z<:Bool}
can_you_help_me::Z
millions_of_thanks::Z
end
for i in 1:max_iter
tmp_help = help_me(rand(Bernoulli(0.5),1)[1],rand(Bernoulli(0.99),1)[1])
# many follow-up processes
end
The memory allocation scales up in max_iter. For my purpose, I do not need to save each struct. Is there a way to "re-use" the memory allocation used by the struct?
Your main problem lies here:
rand(Bernoulli(0.5),1)[1], rand(Bernoulli(0.99),1)[1]
You are creating a length-1 array and then reading the first element from that array. This allocates unnecessary memory and takes time. Don't create an array here. Instead, write
rand(Bernoulli(0.5)), rand(Bernoulli(0.99))
This will just create random scalar numbers, no array.
Compare timings here:
julia> using BenchmarkTools
julia> #btime rand(Bernoulli(0.5),1)[1]
36.290 ns (1 allocation: 96 bytes)
false
julia> #btime rand(Bernoulli(0.5))
6.708 ns (0 allocations: 0 bytes)
false
6 times as fast, and no memory allocation.
This seems to be a general issue. Very often I see people writing rand(1)[1], when they should be using just rand().
Also, consider whether you actually need to make the struct mutable, as others have mentioned.
If the structure is not needed anymore (i.e. not referenced anywhere outside the current loop iteration), the Garbage Collector will free up its memory automatically if required.
Otherwise, I agree with the suggestions of Oscar Smith: memory allocation and garbage collection take time, avoid it for performance reasons if possible.
I am working with big data arrays, something of order of 10^10 elements. I fill the entries of these arrays by calling a certain function. All entries are independent so I would like to make use of this and fill the array simultaneously by a parallel for loop that runs over a set of indices and calls the function. I know about SharedArrays and this how I usually implement such thing but because I am using huge arrays, I don't want to share them over all the workers. I want to keep my array only on the main worker and then execute a parallel for loops and transferring the result of each loop to the main worker to be stored in the array.
For example, this is what I normally do for small arrays.
H = SharedArray{ComplexF64}(n,n) #creates a shared array of size n*n
#sync #distributed for i in 1:q
H[i] = f(i) #f is a function defined on every worker
end
The problem with such construction is that if the size of the array n is too big, sharing it with all the workers is not very efficient. Is there a way of getting around this? I realize my question might be very naive and I apologize for this.
A SharedArray is not copied among workers! It simply allows the same memory area to be accessible by all processes. This is indeed very fast because there is no communication overhead between the workers. The master process can simply look at the memory area filled by workers and that's it.
The only disadvantage of the SharedArrays is that all workers in to be on the same host. If using DistributedArrays you only add unnecessary allocations due to the inter-process communication because each worker is holding only its own part of the array.
Let us have a look (these are two equivalent codes for shared and distributed arrays):
using Distributed
using BenchmarkTools
addprocs(4)
using SharedArrays
function f1()
h = SharedArray{Float64}(10_000) #creates a shared array of size n*n
#sync #distributed for i in 1:10_000
h[i] = sum(rand(1_000))
end
h
end
using DistributedArrays
#everywhere using DistributedArrays
function f2()
d = dzeros(10_000) #creates a shared array of size n*n
#sync #distributed for i in 1:10_000
p = localpart(d)
p[((i-1) % 2500)+1] = sum(rand(1_000))
end
d
end
Now the benchamrks:
julia> #btime f1();
7.151 ms (1032 allocations: 42.97 KiB)
julia> #btime(sum(f1()));
7.168 ms (1022 allocations: 42.81 KiB)
julia> #btime f2();
7.110 ms (1057 allocations: 42.14 KiB)
julia> #btime sum(f2());
7.405 ms (1407 allocations: 55.95 KiB)
Conclusion:
on a single machine the execution times are approximately equal, but collecting the data by the master node adds a significant number of memory allocations when DistributedArrays are used. Hence, on a single machine you always want to go for SharedArrays (moreover the API is simpler as well).
I'm developing different discretization schemes and in order to find out which is the most efficient one I would like to determine the maximum RAM consumption and the time that takes to do an specific task, such as solving a system of equations, overwriting a matrix or writing the data to a file.
Is there any kind of code or something for doing what I need?
I'm using Julia in Ubuntu by the way, but I could do it in Windows as well.
Thanks a lot
I love using the built-in #time for this kind of thing. See "Measure performance with #time and pay attention to memory allocation". Example:
julia> #time myAwesomeFunction(tmp);
1.293542 seconds (22.08 M allocations: 893.866 MiB, 6.62% gc time)
This prints out time, the number of memory allocations, the size of memory allocations, and the percent time spent garbage collecting ("gc"). Always run this at least twice—the first run will be dominated by compile times!
Also consider BenchmarkTools.jl. This will run the code multiple times, with some cool variable interpolation tricks, and give you better runtime/memory estimates:
julia> using BenchmarkTools, Compat
julia> #btime myAwesomeFunction($tmp);
1.311 s (22080097 allocations: 893.87 MiB)
(My other favorite performance-related thing is the #code_* family of functions like #code_warntype.)
I think that BenchmarkTools.jl measures total memory use, not peak. I haven't found pure Julia code to measure this, but perhaps this thread is relevant.
This came up in other, more complex, code but I've written what I think is a minimum working example.
I found this behaviour surprising:
function byvecdot!(a,b,c)
for i in eachindex(a)
a[i] = vecdot(b[:,i],c[:,i])
end
return
end
function byiteration!(a,b,c)
for i in eachindex(a)
a[i] = 0.0
for j in 1:size(b,1)
a[i] += b[j,i]*c[j,i]
end
end
return
end
a = zeros(Float64,1000)
b = rand(Float64,1000,1000)
c = rand(Float64,1000,1000)
#time byvecdot!(a,b,c)
fill!(a,0.0) # Just so we have exactly the same environment
#time byiteration!(a,b,c)
Results (after warming up the JIT):
0.089517 seconds (4.98 k allocations: 15.549 MB, 88.70% gc time)
0.003165 seconds (4 allocations: 160 bytes)
I'm more surprised by the number of allocations than the time (the former is surely causing the latter, particularly given all the gc time).
I expected vecdot to be more or less the same as doing it by iteration (with a few extra allocations for length checks etc).
Making this more confusing: when I use vecdot by itself (even on slices/views/subarrays/whatever-they-are-called like b[:,i]), without inserting the result into an array element, it does behave basically the same as by iteration. I looked at the source code in Julia base and, no surprise, vecdot is just iterating over and accumulating up the result.
My question is: Can someone explain to me why vecdot generates so many (unnecessary) allocations when I try to insert it into an array element? What mechanics am I failing to grasp here?
b[:,i] allocates a new Array object so there is a big difference between the two versions. The first version creates many temporaries that the GC will have to track and free. An alternative solution is
function byvecdot2!(a,b,c)
for i in eachindex(a)
a[i] = vecdot(view(b,:,i),view(c,:,i))
end
return
end
The views also allocate but much less than the full copy that b[:,1] creates so the GC will do less work.