Optimize looping over a large string to reduce allocations - julia

I am trying to loop over a string in Julia to parse it. I have a DefaultDict inside a struct, containing the number of times I have seen a particular character.
#with_kw mutable struct Metrics
...
nucleotides = DefaultDict{Char, Int64}(0)
...
end
I have written a function to loop over a string and increment the value of each character in the DefaultDict.
function compute_base_composition(sequence::String, metrics::Metrics)
for i in 1:sizeof(sequence)
metrics.nucleotides[sequence[i]] += 1
end
end
This function is called in a for loop because I need to do this for multiple strings (which can be up to 2 billions characters long). When I run the #time macro, I get this result:
#time compute_base_composition(sequence, metrics)
0.167172 seconds (606.20 k allocations: 15.559 MiB, 78.00% compilation time)
0.099403 seconds (1.63 M allocations: 24.816 MiB)
0.032346 seconds (633.24 k allocations: 9.663 MiB)
0.171382 seconds (3.06 M allocations: 46.751 MiB, 4.64% gc time)
As you can see, there are a lot of memory allocations for such a simple function. I have tried to change the for loop to something like for c in sequence but that didn't change much. Would there be a way to reduce them and make the function faster?

Work on bytes no on unicode chars
Use Vectors not Dicts
Avoid untyped fields in containers
#with_kw struct MetricsB
nucleotides::Vector{Int}=zeros(Int, 256)
end
function compute_base_composition(sequence::String, metrics::MetricsB)
bs = Vector{UInt8}(sequence)
for i in 1:length(bs)
#inbounds metrics.nucleotides[bs[i]] += 1
end
end
And a benchmark with a nice speedup of 90x :
julia> st = randstring(10_000_000);
julia> #time compute_base_composition(st, Metrics())
1.793991 seconds (19.94 M allocations: 304.213 MiB, 3.33% gc time)
julia> #time compute_base_composition(st, MetricsB())
0.019398 seconds (3 allocations: 9.539 MiB)
Actually you can almost totally avoid allocations with the following code:
function compute_base_composition2(sequence::String, metrics::MetricsB)
pp = pointer(sequence)
for i in 1:length(sequence)
#inbounds metrics.nucleotides[Base.pointerref(pp, i, 1)] += 1
end
end
and now:
julia> #time compute_base_composition2(st, MetricsB())
0.021161 seconds (1 allocation: 2.125 KiB)

Related

Converting SuiteSparse.SPQR.QRSparseQ to SparseMatrixCSC?

I have this problem that converting the native sparse format for the QR decomposition of a sparse Matrix takes forever. However, I need it in the CSC format to use it for further computations.
using LinearAlgebra, SparseArrays
N = 1000
A = sprand(N,N,1e-4)
#time F = qr(A)
#time F.Q
#time Q_sparse = sparse(F.Q)
0.000420 seconds (1.15 k allocations: 241.017 KiB)
0.000008 seconds (6 allocations: 208 bytes)
6.067351 seconds (2.00 M allocations: 15.140 GiB, 36.25% gc time)
Any suggestions?
Okay, I found the problem. For other people trying to do it:
factors = F.Q.factors
τ = F.Q.τ
Nτ = size(factors)[2]
Isp = sparse(I(N));
#time Q_constr = prod(Isp - factors[:,i]*τ[i]*factors[:,i]' for i in 1:Nτ)
Q_constr ≈ Q_sparse
0.084461 seconds (62.64 k allocations: 3.321 MiB, 18.28% gc time)
true
You see that the method sparse(F.Q) is somehow using the wrong representation. If you construct Q as I did above, it will be considerably faster.

How to obtain the execution time of a function in Julia?

I want to obtain the execution time of a function in Julia. Here is a minimum working example:
function raise_to(n)
for i in 1:n
y = (1/7)^n
end
end
How to obtain the time it took to execute raise_to(10) ?
The recommended way to benchmark a function is to use BenchmarkTools:
julia> function raise_to(n)
y = (1/7)^n
end
raise_to (generic function with 1 method)
julia> using BenchmarkTools
julia> #btime raise_to(10)
1.815 ns (0 allocations: 0 bytes)
Note that repeating the computation numerous times (like you did in your example) is a good idea to get more accurate measurements. But BenchmarTools does it for you.
Also note that BenchmarkTools avoids many pitfalls of merely using #time. Most notably with #time, you're likely to measure compilation time in addition to run time. This is why the first invocation of #time often displays larger times/allocations:
# First invocation: the method gets compiled
# Large resource consumption
julia> #time raise_to(10)
0.007901 seconds (7.70 k allocations: 475.745 KiB)
3.5401331746414338e-9
# Subsequent invocations: stable and low timings
julia> #time raise_to(10)
0.000003 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
julia> #time raise_to(10)
0.000002 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
julia> #time raise_to(10)
0.000001 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
#time
#time works as mentioned in previous answers, but it will include compile time if it is the first time you call the function in your julia session.
https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-%5B%40time%5D%28%40ref%29-and-pay-attention-to-memory-allocation-1
#btime
You can also use #btime if you put using BenchmarkTools in your code.
https://github.com/JuliaCI/BenchmarkTools.jl
This will rerun your function many times after an initial compile run, and then average the time.
julia> using BenchmarkTools
julia> #btime sin(x) setup=(x=rand())
4.361 ns (0 allocations: 0 bytes)
0.49587200950472454
#timeit
Another super useful library for Profiling is TimerOutputs.jl
https://github.com/KristofferC/TimerOutputs.jl
using TimerOutputs
# Time a section code with the label "sleep" to the `TimerOutput` named "to"
#timeit to "sleep" sleep(0.02)
# ... several more calls to #timeit
print_timer(to::TimerOutput)
──────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 5.09s / 56.0% 106MiB / 74.6%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────
sleep 101 1.17s 41.2% 11.6ms 1.48MiB 1.88% 15.0KiB
nest 2 1 703ms 24.6% 703ms 2.38KiB 0.00% 2.38KiB
level 2.2 1 402ms 14.1% 402ms 368B 0.00% 368.0B
level 2.1 1 301ms 10.6% 301ms 368B 0.00% 368.0B
throwing 1 502ms 17.6% 502ms 384B 0.00% 384.0B
nest 1 1 396ms 13.9% 396ms 5.11KiB 0.01% 5.11KiB
level 2.2 1 201ms 7.06% 201ms 368B 0.00% 368.0B
level 2.1 3 93.5ms 3.28% 31.2ms 1.08KiB 0.00% 368.0B
randoms 1 77.5ms 2.72% 77.5ms 77.3MiB 98.1% 77.3MiB
funcdef 1 2.66μs 0.00% 2.66μs - 0.00% -
──────────────────────────────────────────────────────────────────────
Macros can have begin ... end
As seen in the docs for these functions they can cover multiple statements or functions.
#my_macro begin
statement1
statement2
# ...
statement3
end
Hope that helps.
The #time macro can be used to tell you how long the function took to evaluate. It also gives how the memory was allocated.
julia> function raise_to(n)
for i in 1:n
y = (1/7)^n
end
end
raise_to (generic function with 1 method)
julia> #time raise_to(10)
0.093018 seconds (26.00 k allocations: 1.461 MiB)
It would be nice to add that if you want to find the run time of a code block, you can do as follow:
#time begin
# your code
end

Profiling/Memory allocation in Julia

I am running an empty double loop in Julia
Ngal = 16000000
function get_vinz()
for i in 1:5
print(i, " ")
for j in i:Ngal
end
end
end
and the outcome of #time vinz() gives me
1 2 3 4 5 5.332660 seconds (248.94 M allocations: 4.946 GiB, 7.12% gc time)
What is the 5GB of memory allocated for?
the culprit is the use of global variables. your function calls the global variable, and with each call, a Int64 is allocated (64 bits). 64*16000000*5/1024/1024 = 4882.8125 MiB, that seems like the culprit your function doesn't know the size of the inner loop, and does a lookup on the global scope to check Ngal. It does that every single loop. compare that with this implementation:
function get_vinz(Ngal)
for i in 1:5
print(i, " ")
for j in i:Ngal
end
end
end
julia> #time get_vinz(Ngal)
1 2 3 4 5 0.043481 seconds (53.67 k allocations: 2.776 MiB)
also, the first time a function is called in julia, is compiled to machine code, so the subsecuent runs are fast. measuring time again:
julia> #time get_vinz(Ngal)
1 2 3 4 5 0.000639 seconds (50 allocations: 1.578 KiB)
The use of global variables is a bad practice in general. the recommended way is to pass those values to the function

Dot syntax causes 200x slowdown with vector addition

I was experimenting with the speed of vector addition and component-wise exponentiation, when I came across a strange result with the dot vectorization syntax.
The non-vectorized version,
julia> #time exp(randn(1000) + randn(1000))
takes about 0.001 seconds after a few runs. It also gives a deprecation warning as of 0.6.
If I vectorize the exponential function,
julia> #time exp.(randn(1000) + randn(1000))
I get a 4x speedup, to around 0.00025 seconds.
However, if I vectorize both the exponential function and addition of the vectors,
julia> #time exp.(randn(1000) .+ randn(1000))
I get a large slowdown to around 0.05 seconds. Why does this occur? When should the dot syntax be avoided to maximize performance?
.+ creates an anonymous function. In the REPL, this function is created every time and will blow up your timing results. In addition, the use of global (dynamically typed, i.e. uninferrable) slow down all of your examples. In any real case your code will be in a function. When it's in a function, it's only compiled the first time the function is called. Example:
> x = randn(1000); y = randn(1000);
> #time exp(x + y);
WARNING: exp(x::AbstractArray{T}) where T <: Number is deprecated, use exp.(x) instead.
Stacktrace:
[1] depwarn(::String, ::Symbol) at .\deprecated.jl:70
[2] exp(::Array{Float64,1}) at .\deprecated.jl:57
[3] eval(::Module, ::Any) at .\boot.jl:235
[4] eval_user_input(::Any, ::Base.REPL.REPLBackend) at .\REPL.jl:66
[5] macro expansion at C:\Users\Chris\.julia\v0.6\Revise\src\Revise.jl:775 [inlined]
[6] (::Revise.##17#18{Base.REPL.REPLBackend})() at .\event.jl:73
while loading no file, in expression starting on line 237
0.620712 seconds (290.34 k allocations: 15.150 MiB)
> #time exp(x + y);
0.023072 seconds (27.09 k allocations: 1.417 MiB)
> #time exp(x + y);
0.000334 seconds (95 allocations: 27.938 KiB)
>
> #time exp.(x .+ y);
1.764459 seconds (735.52 k allocations: 39.169 MiB, 0.80% gc time)
> #time exp.(x .+ y);
0.017914 seconds (5.92 k allocations: 328.978 KiB)
> #time exp.(x .+ y);
0.017853 seconds (5.92 k allocations: 328.509 KiB)
>
> f(x,y) = exp.(x .+ y);
> #time f(x,y);
0.022357 seconds (21.59 k allocations: 959.157 KiB)
> #time f(x,y);
0.000020 seconds (5 allocations: 8.094 KiB)
> #time f(x,y);
0.000021 seconds (5 allocations: 8.094 KiB)
Notice that by putting it into a function it compiles and optimizes. This is one of the main things mentioned in the Julia Performance Tips.

How to find the index of the last maximum in julialang?

I have an array that contains repeated nonnegative integers, e.g., A=[5,5,5,0,1,1,0,0,0,3,3,0,0]. I would like to find the position of the last maximum in A. That is the largest index i such that A[i]>=A[j] for all j. In my example, i=3.
I tried to find the indices of all maximum of A then find the maximum of these indices:
A = [5,5,5,0,1,1,0,0,0,3,3,0,0];
Amax = maximum(A);
i = maximum(find(x -> x == Amax, A));
Is there any better way?
length(A) - indmax(#view A[end:-1:1]) + 1
should be pretty fast, but I didn't benchmark it.
EDIT: I should note that by definition #crstnbr 's solution (to write the algorithm from scratch) is faster (how much faster is shown in Xiaodai's response). This is an attempt to do it using julia's inbuilt array functions.
What about findlast(A.==maximum(A)) (which of course is conceptually similar to your approach)?
The fastest thing would probably be explicit loop implementation like this:
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
I tried #Michael's solution and #crstnbr's solution and I found the latter much faster
a = rand(Int8(1):Int8(5),1_000_000_000)
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 19 seconds
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 18 seconds
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
#time lastindmax(a) # 3 seconds
#time lastindmax(a) # 2.8 seconds
Michael's solution doesn't support Strings (ERROR: MethodError: no method matching view(::String, ::StepRange{Int64,Int64})) or sequences so I add another solution:
julia> lastimax(x) = maximum((j,i) for (i,j) in enumerate(x))[2]
julia> A="abžcdž"; lastimax(A) # unicode is OK
6
julia> lastimax(i^2 for i in -10:7)
1
If you more like don't catch exception for empty Sequence:
julia> lastimax(x) = !isempty(x) ? maximum((j,i) for (i,j) in enumerate(x))[2] : 0;
julia> lastimax(i for i in 1:3 if i>4)
0
Simple(!) benchmarks:
This is up to 10 times slower than Michael's solution for Float64:
julia> mlastimax(A) = length(A) - indmax(#view A[end:-1:1]) + 1;
julia> julia> A = rand(Float64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.166389 seconds (4.00 M allocations: 91.553 MiB, 4.63% gc time)
0.019560 seconds (6 allocations: 240 bytes)
80346
(I am surprised) it is 2 times faster for Int64!
julia> A = rand(Int64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.015453 seconds (10 allocations: 304 bytes)
0.031197 seconds (6 allocations: 240 bytes)
423400
it is 2-3 times slower for Strings
julia> A = ["A$i" for i in 1:1_000_000]; #time lastimax(A); #time mlastimax(A)
0.175117 seconds (2.00 M allocations: 61.035 MiB, 41.29% gc time)
0.077098 seconds (7 allocations: 272 bytes)
999999
EDIT2:
#crstnbr solution is faster and works with Strings too (doesn't work with generators). There difference between lastindmax and lastimax - first return byte index, second return character index:
julia> S = "1š3456789ž"
julia> length(S)
10
julia> lastindmax(S) # return value is bigger than length
11
julia> lastimax(S) # return character index (which is not byte index to String) of last max character
10
julia> S[chr2ind(S, lastimax(S))]
'ž': Unicode U+017e (category Ll: Letter, lowercase)
julia> S[chr2ind(S, lastimax(S))]==S[lastindmax(S)]
true

Resources