I have a simple question. I have defined a struct, and I need to inititate a lot (in the order of millions) of them and loop over them.
I am initiating one at a time and going through the loop as follows:
using Distributions
mutable struct help_me{Z<:Bool}
can_you_help_me::Z
millions_of_thanks::Z
end
for i in 1:max_iter
tmp_help = help_me(rand(Bernoulli(0.5),1)[1],rand(Bernoulli(0.99),1)[1])
# many follow-up processes
end
The memory allocation scales up in max_iter. For my purpose, I do not need to save each struct. Is there a way to "re-use" the memory allocation used by the struct?
Your main problem lies here:
rand(Bernoulli(0.5),1)[1], rand(Bernoulli(0.99),1)[1]
You are creating a length-1 array and then reading the first element from that array. This allocates unnecessary memory and takes time. Don't create an array here. Instead, write
rand(Bernoulli(0.5)), rand(Bernoulli(0.99))
This will just create random scalar numbers, no array.
Compare timings here:
julia> using BenchmarkTools
julia> #btime rand(Bernoulli(0.5),1)[1]
36.290 ns (1 allocation: 96 bytes)
false
julia> #btime rand(Bernoulli(0.5))
6.708 ns (0 allocations: 0 bytes)
false
6 times as fast, and no memory allocation.
This seems to be a general issue. Very often I see people writing rand(1)[1], when they should be using just rand().
Also, consider whether you actually need to make the struct mutable, as others have mentioned.
If the structure is not needed anymore (i.e. not referenced anywhere outside the current loop iteration), the Garbage Collector will free up its memory automatically if required.
Otherwise, I agree with the suggestions of Oscar Smith: memory allocation and garbage collection take time, avoid it for performance reasons if possible.
Related
I am reading Julia performance tips,
https://docs.julialang.org/en/v1/manual/performance-tips/
At the beginning, it mentions two examples.
Example 1,
julia> x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global()
0.009639 seconds (7.36 k allocations: 300.310 KiB, 98.32% compilation time)
496.84883432553846
julia> #time sum_global()
0.000140 seconds (3.49 k allocations: 70.313 KiB)
496.84883432553846
We see a lot of memory allocations.
Now example 2,
julia> x = rand(1000);
julia> function sum_arg(x)
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_arg(x)
0.006202 seconds (4.18 k allocations: 217.860 KiB, 99.72% compilation time)
496.84883432553846
julia> #time sum_arg(x)
0.000005 seconds (1 allocation: 16 bytes)
496.84883432553846
We see that by putting x into into the argument of the function, memory allocations almost disappeared and the speed is much faster.
My question are, can anyone explain,
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
I am a little confused.
in the two examples, we see that the second time we run Julia, it is always faster than the first time.
Does that mean we need to run Julia twice? If Julia is only fast at the second run, then what is point? Why not Julia just do a compiling first, then do a run, just like Fortran?
Is there any general rule to preventing memory allocations? Or do we just always have to do a #time to identify the issue?
Thanks!
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
Example 1 needs so many allocations, because x is a global variable (defined out of scope of the function sum_arg). Therefore the type of variable x can potentially change at any time, i.e. it is possible that:
you define x and sum_arg
you compile sum_arg
you redefine x (change its type) and run sum_arg
In particular, as Julia supports multiple threading, both actions in step 3 in general could happen even in parallel (i.e. you could have changed the type of x in one thread while sum_arg would be running in another thread).
So because after compilation of sum_arg the type of x can change Julia, when compiling sum_arg has to ensure that the compiled code does not rely on the type of x that was present when the compilation took place. Instead Julia, in such cases, allows the type of x to be changed dynamically. However, this dynamic nature of allowed x means that it has to be checked in run-time (not compile time). And this dynamic checking of x causes performance degradation and memory allocations.
You could have fixed this by declaring x to be a const (as const ensures that the type of x may not change):
julia> const x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global() # this is now fast
0.000002 seconds
498.9290555615045
Why not Julia just do a compiling first, then do a run, just like Fortran?
This is exactly what Julia does. However, the benefit of Julia is that it does compilation automatically when needed. This allows you for a smooth interactive development process.
If you wanted you could compile the function before it is run with the precompile function, and then run it separately. However, normally people just run the function without doing it explicitly.
The consequence is that if you use #time:
The first time you run a function it returns you both execution time and compilation time (and as you can see in examples you have pasted - you get information what percentage of time was spent on compilation).
In the consecutive runs the function is already compiled so only execution time is returned.
Is there any general rule to preventing memory allocations?
These rules are exactly given in the Performance Tips section of the manual that you are quoting in your question. The tip on using #time is a diagnostic tip there. All other tips are rules that are recommended to get a fast code. However, I understand that the list is long so a shorter list that is good enough to start with in my experience is:
Avoid global variables
Avoid containers with abstract type parameters
Write type stable functions
Avoid changing the type of a variable
I am trying to understand memory management a little better. I have the following example code:
begin
mutable struct SimplestStruct
a::Float64
end
function SimplestLoop!(a::Array{SimplestStruct, 1}, x::Float64)
for i in 1:length(a)
#inbounds a[i].a = x
end
end
simples = fill(SimplestStruct(rand()), 100)
#time SimplestLoop!(simples, 6.0)
end
As far as I can work out from the docs + various good posts around about in-place operations, SimplestLoop! should operate on its first argument without allocating any extra memory. However, #time reports 17k allocations.
Is there an obvious reason why this is happening?
Thank you in advance!
If you perform the #time measurement several times, you'll see that the first measurement is different from the others. This is because you're actually mostly measuring (just-ahead-of-time) compilation time and memory allocations.
When the objective is to better understand runtime performance, it is generally recommended to use BenchmarkTools to perform the benchmarks:
julia> using BenchmarkTools
julia> #btime SimplestLoop!($simples, 6.0)
82.930 ns (0 allocations: 0 bytes)
BenchmarkTools's #btime macro takes care of handling compilation times, as well as averaging runtime measurements over a sufficiently large number of samples to get accurate estimations. With this, we see that there are indeed no allocations in your code, as expected.
I'm new in using Julia and after some courses about numeric analysis programming became a hobby of mine.
I ran some tests with all my cores and did the same with threads to compare. I noticed that doing heavier computation went better with the threaded loop than with the process, But it was about the same when it came to addition. (operations were randomly selected for example)
After some research its all kinda vague and I ultimately want some perspective from someone that is using the same language if it matters at all.
Some technical info: 8 physical cores, julia added vector of 16 after addprocs() and nthreads() is 16
using Distributed
addprocs()
#everywhere using SharedArrays;
#everywhere using BenchmarkTools;
function test(lim)
r = zeros(Int64(lim / 16),Threads.nthreads())
Threads.#threads for i in eachindex(r)
r[Threads.threadid()] = (BigInt(i)^7 +5)%7;
end
return sum(r)
end
#btime test(10^4) # 1.178 ms (240079 allocations: 3.98 MiB)
#everywhere function test2(lim)
a = SharedArray{Int64}(lim);
#sync #distributed for i=1:lim
a[i] = (BigInt(i)^7 +5)%7;
end
return sum(a)
end
#btime test2(10^4) # 3.796 ms (4413 allocations: 189.02 KiB)
Note that your loops do very different things.
Int the first loop each thread keeps updating the same single cell the Array. Most likely since only a single memory cell is update in a single thread, the processor caching mechanism can be used to speed up things.
On the other hand the second loop each process is updating several different memory cells and such caching is not possible.
The first Array holds Float64 values while the second holds Int64 values
After correcting those things the difference gets smaller (this is on my laptop, I have only 8 threads):
julia> #btime test(10^4)
2.781 ms (220037 allocations: 3.59 MiB)
29997
julia> #btime test2(10^4)
4.867 ms (2145 allocations: 90.14 KiB)
29997
Now the other issue is that when Distributed is used you are doing inter-process communication which does not occur when using Threads.
Basically, the inter-process processing does not make sense to be used for jobs lasting few milliseconds. When you try to increase the processing volumes the difference might start to diminish.
So when to use what - it depends.. General guidelines (somewhat subjective) are following:
Processes are more robust (threads are still experimental)
Threads are easier as long as you do not need to use locking or atomic values
When the parallelism level is beyond 16 threads become inefficient and Distributed should be used (this is my personal observation)
When writing utility packages for others use threads - do not distribute code inside a package. Explanation: If you add multi-threading to a package it's behavior can be transparent to the user. On the other hand Julia's multiprocessing (Distributed package) abstraction does not distinguish between parallel and distributed - that is your workers can be either local or remote. This makes fundamental difference how code is designed (e.g. SharedArrays vs DistributedArrays), moreover the design of code might also depend on e.g. number of servers or possibilities of limiting inter-node communication. Hence normally, Distributed-related package logic should be separated from from standard utility package while the multi-threaded functionality can just be made transparent to the package user. There are of course some exceptions to this rule such as providing some distributed data processing server tools etc. but this is a general rule of thumb.
For huge scale computations I always use processes because you can easily go onto a computer cluster with them and distribute the workload across hundreds of machines.
How can we reserve memory (or allocate memory without initialization) in Julia? In C++, a common pattern is to call reserve before calling push_back several times to avoid having to call on malloc more than once. Is there an equivalent in Julia?
I think you are looking for sizehint!
help?> sizehint!
search: sizehint!
sizehint!(s, n)
Suggest that collection s reserve capacity for at least n elements. This can
improve performance.
I'm wondering about immutable types and performances in Julia.
In which case does making a composite type immutable improve perfomances? The documentation says
They are more efficient in some cases. Types like the Complex example
above can be packed efficiently into arrays, and in some cases the
compiler is able to avoid allocating immutable objects entirely.
I don't really understand the second part.
Are there cases where making a composite type immutable reduce performance (beyond the case where a field needs to be changed by reference)? I thought one example could be when an object of an immutable type is used repeatedly as an argument, since
An object with an immutable type is passed around (both in assignment statements and in function calls) by copying, whereas a mutable type is passed around by reference.
However, I can't find any difference in a simple example:
abstract MyType
type MyType1 <: MyType
v::Vector{Int}
end
immutable MyType2 <: MyType
v::Vector{Int}
end
g(x::MyType) = sum(x.v)
function f(x::MyType)
a = zero(Int)
for i in 1:10_000
a += g(x)
end
return a
end
x = fill(one(Int), 10_000)
x1 = MyType1(x)
#time f(x1)
# elapsed time: 0.030698826 seconds (96 bytes allocated)
x2 = MyType2(x)
#time f(x2)
# elapsed time: 0.031835494 seconds (96 bytes allocated)
So why isn't f slower with an immutable type? Are there cases where using immutable types makes a code slower?
Immutable types are especially fast when they are small and consist entirely of immediate data, with no references (pointers) to heap-allocated objects. For example, an immutable type that consists of two Ints can potentially be stored in registers and never exist in memory at all.
Knowing that a value won't change also helps us optimize code. For example you access x.v inside a loop, and since x.v will always refer to the same vector we can hoist the load for it outside the loop instead of re-loading on every iteration. However whether you get any benefit from that depends on whether that load was taking a significant fraction of the time in the loop.
It is rare in practice for immutables to slow down code, but there are two cases where it might happen. First, if you have a large immutable type (say 100 Ints) and do something like sorting an array of them where you need to move them around a lot, the extra copying might be slower than pointing to objects with references. Second, immutable objects are usually not allocated on the heap initially. If you need to store a heap reference to one (e.g. in an Any array), we need to move the object to the heap. From there the compiler is often not smart enough to re-use the heap-allocated version of the object, and so might copy it repeatedly. In such a case it would have been faster to just heap-allocate a single mutable object up front.
This test includes a special cases, so is not extendable and could not reject better performance of immutable types.
check following test and look at different allocation times,when create a vector of immutables compare to a vector of mutables
abstract MyType
type MyType1 <: MyType
i::Int
b::Bool
f::Float64
end
immutable MyType2 <: MyType
i::Int
b::Bool
f::Float64
end
#time x=[MyType2(i,1,1) for i=1:100_000];
# => 0.001396 seconds (2 allocations: 1.526 MB)
#time x=[MyType1(i,1,1) for i=1:100_000];
# => 0.003683 seconds (100.00 k allocations: 3.433 MB)