Avoid memory allocation when indexing an array in Julia - julia

UPDATE: Note that the relevant function in Julia v1+ is view
Question: I would like to index into an array without triggering memory allocation, especially when passing the indexed elements into a function. From reading the Julia docs, I suspect the answer revolves around using the sub function, but can't quite see how...
Working Example: I build a large vector of Float64 (x) and then an index to every observation in x.
N = 10000000
x = randn(N)
inds = [1:N]
Now I time the mean function over x and x[inds] (I run mean(randn(2)) first to avoid any compiler irregularities in the timing):
#time mean(x)
#time mean(x[inds])
It's an identical calculation, but as expected the results of the timings are:
elapsed time: 0.007029772 seconds (96 bytes allocated)
elapsed time: 0.067880112 seconds (80000208 bytes allocated, 35.38% gc time)
So, is there a way around the memory allocation problem for arbitrary choices of inds (and arbitrary choice of array and function)?

Just use xs = sub(x, 1:N). Note that this is different from x = sub(x, [1:N]); on julia 0.3 the latter will fail, and on julia 0.4-pre the latter will be considerably slower than the former. On julia 0.4-pre, sub(x, 1:N) is just as fast as view:
julia> N = 10000000;
julia> x = randn(N);
julia> xs = sub(x, 1:N);
julia> using ArrayViews
julia> xv = view(x, 1:N);
julia> mean(x)
-0.0002491126429772525
julia> mean(xs)
-0.0002491126429772525
julia> mean(xv)
-0.0002491126429772525
julia> #time mean(x);
elapsed time: 0.015345806 seconds (27 kB allocated)
julia> #time mean(xs);
elapsed time: 0.013815785 seconds (96 bytes allocated)
julia> #time mean(xv);
elapsed time: 0.015871052 seconds (96 bytes allocated)
There are several reasons why sub(x, inds) is slower than sub(x, 1:N):
Each access xs[i] corresponds to x[inds[i]]; we have to look up two memory locations rather than one
If inds is not in order, you will get poor cache behavior in accessing the elements of x
It destroys the ability to use SIMD vectorization
In this case, the latter is probably the most important effect. This is not a Julia limitation; the same thing would happen were you to write the equivalent code in C, Fortran, or assembly.
Note that it's still faster to say sum(sub(x, inds)) than sum(x[inds]), (until the latter becomes the former, which it should by the time julia 0.4 is officially out). But if you have to do many operations with xs = sub(x, inds), in some circumstances it will be worth your while to make a copy, even though it allocates memory, just so you can take advantage of the optimizations possible when values are stored in contiguous memory.

EDIT: Read tholy's answer too to get a full picture!
When using an array of indices, the situation is not great right now on Julia 0.4-pre (start of Feb 2015):
julia> N = 10000000;
julia> x = randn(N);
julia> inds = [1:N];
julia> #time mean(x)
elapsed time: 0.010702729 seconds (96 bytes allocated)
elapsed time: 0.012167155 seconds (96 bytes allocated)
julia> #time mean(x[inds])
elapsed time: 0.088312275 seconds (76 MB allocated, 17.87% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.073672734 seconds (76 MB allocated, 3.27% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.071646757 seconds (76 MB allocated, 1.08% gc time in 1 pauses with 0 full sweep)
julia> xs = sub(x,inds); # Only works on 0.4
julia> #time mean(xs)
elapsed time: 0.057446177 seconds (96 bytes allocated)
elapsed time: 0.096983673 seconds (96 bytes allocated)
elapsed time: 0.096711312 seconds (96 bytes allocated)
julia> using ArrayViews
julia> xv = view(x, 1:N) # Note use of a range, not [1:N]!
julia> #time mean(xv)
elapsed time: 0.012919509 seconds (96 bytes allocated)
elapsed time: 0.013010655 seconds (96 bytes allocated)
elapsed time: 0.01288134 seconds (96 bytes allocated)
julia> xs = sub(x,1:N) # Works on 0.3 and 0.4
julia> #time mean(xs)
elapsed time: 0.014191482 seconds (96 bytes allocated)
elapsed time: 0.014023089 seconds (96 bytes allocated)
elapsed time: 0.01257188 seconds (96 bytes allocated)
So while we can avoid the memory allocation, we are actually slower(!) still.
The issue is indexing by an array, as opposed to a range. You can't use sub for this on 0.3, but you can on 0.4.
If we can index by a range, then we can use ArrayViews.jl on 0.3 and its inbuilt on 0.4. This case is pretty much as good as the original mean.
I noticed that with a smaller number of indices used (instead of the whole range), the gap is much smaller, and the memory allocation is low, so sub might be worth:
N = 100000000
x = randn(N)
inds = [1:div(N,10)]
#time mean(x)
#time mean(x)
#time mean(x)
#time mean(x[inds])
#time mean(x[inds])
#time mean(x[inds])
xi = sub(x,inds)
#time mean(xi)
#time mean(xi)
#time mean(xi)
gives
elapsed time: 0.092831612 seconds (985 kB allocated)
elapsed time: 0.067694917 seconds (96 bytes allocated)
elapsed time: 0.066209038 seconds (96 bytes allocated)
elapsed time: 0.066816927 seconds (76 MB allocated, 20.62% gc time in 1 pauses with 1 full sweep)
elapsed time: 0.057211528 seconds (76 MB allocated, 19.57% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.046782848 seconds (76 MB allocated, 1.81% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.186084807 seconds (4 MB allocated)
elapsed time: 0.057476269 seconds (96 bytes allocated)
elapsed time: 0.05733602 seconds (96 bytes allocated)

Related

Optimize looping over a large string to reduce allocations

I am trying to loop over a string in Julia to parse it. I have a DefaultDict inside a struct, containing the number of times I have seen a particular character.
#with_kw mutable struct Metrics
...
nucleotides = DefaultDict{Char, Int64}(0)
...
end
I have written a function to loop over a string and increment the value of each character in the DefaultDict.
function compute_base_composition(sequence::String, metrics::Metrics)
for i in 1:sizeof(sequence)
metrics.nucleotides[sequence[i]] += 1
end
end
This function is called in a for loop because I need to do this for multiple strings (which can be up to 2 billions characters long). When I run the #time macro, I get this result:
#time compute_base_composition(sequence, metrics)
0.167172 seconds (606.20 k allocations: 15.559 MiB, 78.00% compilation time)
0.099403 seconds (1.63 M allocations: 24.816 MiB)
0.032346 seconds (633.24 k allocations: 9.663 MiB)
0.171382 seconds (3.06 M allocations: 46.751 MiB, 4.64% gc time)
As you can see, there are a lot of memory allocations for such a simple function. I have tried to change the for loop to something like for c in sequence but that didn't change much. Would there be a way to reduce them and make the function faster?
Work on bytes no on unicode chars
Use Vectors not Dicts
Avoid untyped fields in containers
#with_kw struct MetricsB
nucleotides::Vector{Int}=zeros(Int, 256)
end
function compute_base_composition(sequence::String, metrics::MetricsB)
bs = Vector{UInt8}(sequence)
for i in 1:length(bs)
#inbounds metrics.nucleotides[bs[i]] += 1
end
end
And a benchmark with a nice speedup of 90x :
julia> st = randstring(10_000_000);
julia> #time compute_base_composition(st, Metrics())
1.793991 seconds (19.94 M allocations: 304.213 MiB, 3.33% gc time)
julia> #time compute_base_composition(st, MetricsB())
0.019398 seconds (3 allocations: 9.539 MiB)
Actually you can almost totally avoid allocations with the following code:
function compute_base_composition2(sequence::String, metrics::MetricsB)
pp = pointer(sequence)
for i in 1:length(sequence)
#inbounds metrics.nucleotides[Base.pointerref(pp, i, 1)] += 1
end
end
and now:
julia> #time compute_base_composition2(st, MetricsB())
0.021161 seconds (1 allocation: 2.125 KiB)

Converting SuiteSparse.SPQR.QRSparseQ to SparseMatrixCSC?

I have this problem that converting the native sparse format for the QR decomposition of a sparse Matrix takes forever. However, I need it in the CSC format to use it for further computations.
using LinearAlgebra, SparseArrays
N = 1000
A = sprand(N,N,1e-4)
#time F = qr(A)
#time F.Q
#time Q_sparse = sparse(F.Q)
0.000420 seconds (1.15 k allocations: 241.017 KiB)
0.000008 seconds (6 allocations: 208 bytes)
6.067351 seconds (2.00 M allocations: 15.140 GiB, 36.25% gc time)
Any suggestions?
Okay, I found the problem. For other people trying to do it:
factors = F.Q.factors
τ = F.Q.τ
Nτ = size(factors)[2]
Isp = sparse(I(N));
#time Q_constr = prod(Isp - factors[:,i]*τ[i]*factors[:,i]' for i in 1:Nτ)
Q_constr ≈ Q_sparse
0.084461 seconds (62.64 k allocations: 3.321 MiB, 18.28% gc time)
true
You see that the method sparse(F.Q) is somehow using the wrong representation. If you construct Q as I did above, it will be considerably faster.

How to obtain the execution time of a function in Julia?

I want to obtain the execution time of a function in Julia. Here is a minimum working example:
function raise_to(n)
for i in 1:n
y = (1/7)^n
end
end
How to obtain the time it took to execute raise_to(10) ?
The recommended way to benchmark a function is to use BenchmarkTools:
julia> function raise_to(n)
y = (1/7)^n
end
raise_to (generic function with 1 method)
julia> using BenchmarkTools
julia> #btime raise_to(10)
1.815 ns (0 allocations: 0 bytes)
Note that repeating the computation numerous times (like you did in your example) is a good idea to get more accurate measurements. But BenchmarTools does it for you.
Also note that BenchmarkTools avoids many pitfalls of merely using #time. Most notably with #time, you're likely to measure compilation time in addition to run time. This is why the first invocation of #time often displays larger times/allocations:
# First invocation: the method gets compiled
# Large resource consumption
julia> #time raise_to(10)
0.007901 seconds (7.70 k allocations: 475.745 KiB)
3.5401331746414338e-9
# Subsequent invocations: stable and low timings
julia> #time raise_to(10)
0.000003 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
julia> #time raise_to(10)
0.000002 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
julia> #time raise_to(10)
0.000001 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
#time
#time works as mentioned in previous answers, but it will include compile time if it is the first time you call the function in your julia session.
https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-%5B%40time%5D%28%40ref%29-and-pay-attention-to-memory-allocation-1
#btime
You can also use #btime if you put using BenchmarkTools in your code.
https://github.com/JuliaCI/BenchmarkTools.jl
This will rerun your function many times after an initial compile run, and then average the time.
julia> using BenchmarkTools
julia> #btime sin(x) setup=(x=rand())
4.361 ns (0 allocations: 0 bytes)
0.49587200950472454
#timeit
Another super useful library for Profiling is TimerOutputs.jl
https://github.com/KristofferC/TimerOutputs.jl
using TimerOutputs
# Time a section code with the label "sleep" to the `TimerOutput` named "to"
#timeit to "sleep" sleep(0.02)
# ... several more calls to #timeit
print_timer(to::TimerOutput)
──────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 5.09s / 56.0% 106MiB / 74.6%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────
sleep 101 1.17s 41.2% 11.6ms 1.48MiB 1.88% 15.0KiB
nest 2 1 703ms 24.6% 703ms 2.38KiB 0.00% 2.38KiB
level 2.2 1 402ms 14.1% 402ms 368B 0.00% 368.0B
level 2.1 1 301ms 10.6% 301ms 368B 0.00% 368.0B
throwing 1 502ms 17.6% 502ms 384B 0.00% 384.0B
nest 1 1 396ms 13.9% 396ms 5.11KiB 0.01% 5.11KiB
level 2.2 1 201ms 7.06% 201ms 368B 0.00% 368.0B
level 2.1 3 93.5ms 3.28% 31.2ms 1.08KiB 0.00% 368.0B
randoms 1 77.5ms 2.72% 77.5ms 77.3MiB 98.1% 77.3MiB
funcdef 1 2.66μs 0.00% 2.66μs - 0.00% -
──────────────────────────────────────────────────────────────────────
Macros can have begin ... end
As seen in the docs for these functions they can cover multiple statements or functions.
#my_macro begin
statement1
statement2
# ...
statement3
end
Hope that helps.
The #time macro can be used to tell you how long the function took to evaluate. It also gives how the memory was allocated.
julia> function raise_to(n)
for i in 1:n
y = (1/7)^n
end
end
raise_to (generic function with 1 method)
julia> #time raise_to(10)
0.093018 seconds (26.00 k allocations: 1.461 MiB)
It would be nice to add that if you want to find the run time of a code block, you can do as follow:
#time begin
# your code
end

Dot syntax causes 200x slowdown with vector addition

I was experimenting with the speed of vector addition and component-wise exponentiation, when I came across a strange result with the dot vectorization syntax.
The non-vectorized version,
julia> #time exp(randn(1000) + randn(1000))
takes about 0.001 seconds after a few runs. It also gives a deprecation warning as of 0.6.
If I vectorize the exponential function,
julia> #time exp.(randn(1000) + randn(1000))
I get a 4x speedup, to around 0.00025 seconds.
However, if I vectorize both the exponential function and addition of the vectors,
julia> #time exp.(randn(1000) .+ randn(1000))
I get a large slowdown to around 0.05 seconds. Why does this occur? When should the dot syntax be avoided to maximize performance?
.+ creates an anonymous function. In the REPL, this function is created every time and will blow up your timing results. In addition, the use of global (dynamically typed, i.e. uninferrable) slow down all of your examples. In any real case your code will be in a function. When it's in a function, it's only compiled the first time the function is called. Example:
> x = randn(1000); y = randn(1000);
> #time exp(x + y);
WARNING: exp(x::AbstractArray{T}) where T <: Number is deprecated, use exp.(x) instead.
Stacktrace:
[1] depwarn(::String, ::Symbol) at .\deprecated.jl:70
[2] exp(::Array{Float64,1}) at .\deprecated.jl:57
[3] eval(::Module, ::Any) at .\boot.jl:235
[4] eval_user_input(::Any, ::Base.REPL.REPLBackend) at .\REPL.jl:66
[5] macro expansion at C:\Users\Chris\.julia\v0.6\Revise\src\Revise.jl:775 [inlined]
[6] (::Revise.##17#18{Base.REPL.REPLBackend})() at .\event.jl:73
while loading no file, in expression starting on line 237
0.620712 seconds (290.34 k allocations: 15.150 MiB)
> #time exp(x + y);
0.023072 seconds (27.09 k allocations: 1.417 MiB)
> #time exp(x + y);
0.000334 seconds (95 allocations: 27.938 KiB)
>
> #time exp.(x .+ y);
1.764459 seconds (735.52 k allocations: 39.169 MiB, 0.80% gc time)
> #time exp.(x .+ y);
0.017914 seconds (5.92 k allocations: 328.978 KiB)
> #time exp.(x .+ y);
0.017853 seconds (5.92 k allocations: 328.509 KiB)
>
> f(x,y) = exp.(x .+ y);
> #time f(x,y);
0.022357 seconds (21.59 k allocations: 959.157 KiB)
> #time f(x,y);
0.000020 seconds (5 allocations: 8.094 KiB)
> #time f(x,y);
0.000021 seconds (5 allocations: 8.094 KiB)
Notice that by putting it into a function it compiles and optimizes. This is one of the main things mentioned in the Julia Performance Tips.

Memory-efficient sortperm on a column of a matrix

I have a large-ish matrix and I'd like to apply sortperm to each column of that matrix. The naive thing to do is
order = sortperm(X[:,j])
which makes a copy. That seems like a shame, so I thought I'd try a SubArray:
order = sortperm(sub(X,1:n,j))
but that was even slower. For a laugh I tried
order = sortperm(1:n,by=i->X[i,j])
but of course that was terrible. What is the fastest way to do this?
Here is some benchmark code:
getperm1(X,n,j) = sortperm(X[:,j])
getperm2(X,n,j) = sortperm(sub(X,1:n,j))
getperm3(X,n) = mapslices(sortperm, X, 1)
n = 1000000
X = rand(n, 10)
for f in [getperm1, getperm2]
println(f)
for it in 1:5
gc()
#time f(X,n,5)
end
end
for f in [getperm3]
println(f)
for it in 1:5
gc()
#time getperm3(X,n)
end
end
results:
getperm1
elapsed time: 0.258576164 seconds (23247944 bytes allocated)
elapsed time: 0.141448346 seconds (16000208 bytes allocated)
elapsed time: 0.137306078 seconds (16000208 bytes allocated)
elapsed time: 0.137385171 seconds (16000208 bytes allocated)
elapsed time: 0.139137529 seconds (16000208 bytes allocated)
getperm2
elapsed time: 0.433251141 seconds (11832620 bytes allocated)
elapsed time: 0.33970986 seconds (8000624 bytes allocated)
elapsed time: 0.339840795 seconds (8000624 bytes allocated)
elapsed time: 0.342436716 seconds (8000624 bytes allocated)
elapsed time: 0.342867431 seconds (8000624 bytes allocated)
getperm3
elapsed time: 1.766020534 seconds (257397404 bytes allocated, 1.55% gc time)
elapsed time: 1.43763525 seconds (240007488 bytes allocated, 1.85% gc time)
elapsed time: 1.41373546 seconds (240007488 bytes allocated, 1.82% gc time)
elapsed time: 1.42215519 seconds (240007488 bytes allocated, 1.83% gc time)
elapsed time: 1.419174037 seconds (240007488 bytes allocated, 1.83% gc time)
Where the mapslices version is 10x the getperm1 version, as you'd expect.
Its worth pointing out that, on my machine at least, the copy+sortperm option is not that much slower than just a sortperm on a vector of the same length, but no memory allocation is necessary so it'd be nice to avoid it.
You can beat the performance of SubArray in a few very specific cases (like taking a continuous view of an Array) with pointer magic:
function colview(X::Matrix,j::Int)
n = size(X,1)
offset = 1+n*(j-1) # The linear start position
checkbounds(X, offset+n-1)
pointer_to_array(pointer(X, offset), (n,))
end
getperm4(X,n,j) = sortperm(colview(X,j))
The function colview will return a full-fledged Array that shares its data with the original X. Note that this is a terrible idea because the returned array is referencing data that Julia is only keeping track of through X. This means that if X goes out of scope before the column "view" data access will crash with a segfault.
With results:
getperm1
elapsed time: 0.317923176 seconds (15 MB allocated)
elapsed time: 0.252215996 seconds (15 MB allocated)
elapsed time: 0.215124686 seconds (15 MB allocated)
elapsed time: 0.210062109 seconds (15 MB allocated)
elapsed time: 0.213339974 seconds (15 MB allocated)
getperm2
elapsed time: 0.509172302 seconds (7 MB allocated)
elapsed time: 0.509961218 seconds (7 MB allocated)
elapsed time: 0.506399583 seconds (7 MB allocated)
elapsed time: 0.512562736 seconds (7 MB allocated)
elapsed time: 0.506199265 seconds (7 MB allocated)
getperm4
elapsed time: 0.225968056 seconds (7 MB allocated)
elapsed time: 0.220587707 seconds (7 MB allocated)
elapsed time: 0.219854355 seconds (7 MB allocated)
elapsed time: 0.226289377 seconds (7 MB allocated)
elapsed time: 0.220391515 seconds (7 MB allocated)
I've not looked into why the performance is worse with SubArray, but it may simply be from an extra pointer dereference on every memory access. It's very remarkable how little the allocation actually costs you in terms of time - getperm1's timings are more variable, but it still occasionally beats getperm4! I think that this is due to some extra pointer math in Array's internal implementation with shared data. There's also some crazy caching behavior… getperm1 gets significantly faster on repeated runs.

Resources