Dot syntax causes 200x slowdown with vector addition - julia

I was experimenting with the speed of vector addition and component-wise exponentiation, when I came across a strange result with the dot vectorization syntax.
The non-vectorized version,
julia> #time exp(randn(1000) + randn(1000))
takes about 0.001 seconds after a few runs. It also gives a deprecation warning as of 0.6.
If I vectorize the exponential function,
julia> #time exp.(randn(1000) + randn(1000))
I get a 4x speedup, to around 0.00025 seconds.
However, if I vectorize both the exponential function and addition of the vectors,
julia> #time exp.(randn(1000) .+ randn(1000))
I get a large slowdown to around 0.05 seconds. Why does this occur? When should the dot syntax be avoided to maximize performance?

.+ creates an anonymous function. In the REPL, this function is created every time and will blow up your timing results. In addition, the use of global (dynamically typed, i.e. uninferrable) slow down all of your examples. In any real case your code will be in a function. When it's in a function, it's only compiled the first time the function is called. Example:
> x = randn(1000); y = randn(1000);
> #time exp(x + y);
WARNING: exp(x::AbstractArray{T}) where T <: Number is deprecated, use exp.(x) instead.
Stacktrace:
[1] depwarn(::String, ::Symbol) at .\deprecated.jl:70
[2] exp(::Array{Float64,1}) at .\deprecated.jl:57
[3] eval(::Module, ::Any) at .\boot.jl:235
[4] eval_user_input(::Any, ::Base.REPL.REPLBackend) at .\REPL.jl:66
[5] macro expansion at C:\Users\Chris\.julia\v0.6\Revise\src\Revise.jl:775 [inlined]
[6] (::Revise.##17#18{Base.REPL.REPLBackend})() at .\event.jl:73
while loading no file, in expression starting on line 237
0.620712 seconds (290.34 k allocations: 15.150 MiB)
> #time exp(x + y);
0.023072 seconds (27.09 k allocations: 1.417 MiB)
> #time exp(x + y);
0.000334 seconds (95 allocations: 27.938 KiB)
>
> #time exp.(x .+ y);
1.764459 seconds (735.52 k allocations: 39.169 MiB, 0.80% gc time)
> #time exp.(x .+ y);
0.017914 seconds (5.92 k allocations: 328.978 KiB)
> #time exp.(x .+ y);
0.017853 seconds (5.92 k allocations: 328.509 KiB)
>
> f(x,y) = exp.(x .+ y);
> #time f(x,y);
0.022357 seconds (21.59 k allocations: 959.157 KiB)
> #time f(x,y);
0.000020 seconds (5 allocations: 8.094 KiB)
> #time f(x,y);
0.000021 seconds (5 allocations: 8.094 KiB)
Notice that by putting it into a function it compiles and optimizes. This is one of the main things mentioned in the Julia Performance Tips.

Related

Optimize looping over a large string to reduce allocations

I am trying to loop over a string in Julia to parse it. I have a DefaultDict inside a struct, containing the number of times I have seen a particular character.
#with_kw mutable struct Metrics
...
nucleotides = DefaultDict{Char, Int64}(0)
...
end
I have written a function to loop over a string and increment the value of each character in the DefaultDict.
function compute_base_composition(sequence::String, metrics::Metrics)
for i in 1:sizeof(sequence)
metrics.nucleotides[sequence[i]] += 1
end
end
This function is called in a for loop because I need to do this for multiple strings (which can be up to 2 billions characters long). When I run the #time macro, I get this result:
#time compute_base_composition(sequence, metrics)
0.167172 seconds (606.20 k allocations: 15.559 MiB, 78.00% compilation time)
0.099403 seconds (1.63 M allocations: 24.816 MiB)
0.032346 seconds (633.24 k allocations: 9.663 MiB)
0.171382 seconds (3.06 M allocations: 46.751 MiB, 4.64% gc time)
As you can see, there are a lot of memory allocations for such a simple function. I have tried to change the for loop to something like for c in sequence but that didn't change much. Would there be a way to reduce them and make the function faster?
Work on bytes no on unicode chars
Use Vectors not Dicts
Avoid untyped fields in containers
#with_kw struct MetricsB
nucleotides::Vector{Int}=zeros(Int, 256)
end
function compute_base_composition(sequence::String, metrics::MetricsB)
bs = Vector{UInt8}(sequence)
for i in 1:length(bs)
#inbounds metrics.nucleotides[bs[i]] += 1
end
end
And a benchmark with a nice speedup of 90x :
julia> st = randstring(10_000_000);
julia> #time compute_base_composition(st, Metrics())
1.793991 seconds (19.94 M allocations: 304.213 MiB, 3.33% gc time)
julia> #time compute_base_composition(st, MetricsB())
0.019398 seconds (3 allocations: 9.539 MiB)
Actually you can almost totally avoid allocations with the following code:
function compute_base_composition2(sequence::String, metrics::MetricsB)
pp = pointer(sequence)
for i in 1:length(sequence)
#inbounds metrics.nucleotides[Base.pointerref(pp, i, 1)] += 1
end
end
and now:
julia> #time compute_base_composition2(st, MetricsB())
0.021161 seconds (1 allocation: 2.125 KiB)

Generic maximum/minimum function for complex numbers

In julia, one can find (supposedly) efficient implementations of the min/minimum and max/maximum over collections of real numbers.
As these concepts are not uniquely defined for complex numbers, I was wondering if a parametrized version of these functions was already implemented somewhere.
I am currently sorting elements of the array of interest, then taking the last value, which is as far as I know, much more costly than finding the value with the maximum absolute value (or something else).
This is mostly to reproduce the Matlab behavior of the max function over complex arrays.
Here is my current code
a = rand(ComplexF64,4)
b = sort(a,by = (x) -> abs(x))
c = b[end]
The probable function call would look like
c = maximum/minimum(a,by=real/imag/abs/phase)
EDIT Some performance tests in Julia 1.5.3 with the provided solutions
function maxby0(f,iter)
b = sort(iter,by = (x) -> f(x))
c = b[end]
end
function maxby1(f, iter)
reduce(iter) do x, y
f(x) > f(y) ? x : y
end
end
function maxby2(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
end
end
return res
end
compmax(CArray) = CArray[ (abs.(CArray) .== maximum(abs.(CArray))) .& (angle.(CArray) .== maximum( angle.(CArray))) ][1]
Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
function maxby5(arr)
arr_max = arr[argmax(map(abs, arr))]
end
a = rand(ComplexF64,10)
using BenchmarkTools
#btime maxby0(abs,$a)
#btime maxby1(abs,$a)
#btime maxby2(abs,$a)
#btime compmax($a)
#btime maximum($a)
#btime maxby5($a)
Output for a vector of length 10:
>841.653 ns (1 allocation: 240 bytes)
>214.797 ns (0 allocations: 0 bytes)
>118.961 ns (0 allocations: 0 bytes)
>Execution fails
>20.340 ns (0 allocations: 0 bytes)
>144.444 ns (1 allocation: 160 bytes)
Output for a vector of length 1000:
>315.100 μs (1 allocation: 15.75 KiB)
>25.299 μs (0 allocations: 0 bytes)
>12.899 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>14.199 μs (1 allocation: 7.94 KiB)
Output for a vector of length 1000 (with all comparisons made with abs2):
>35.399 μs (1 allocation: 15.75 KiB)
>3.075 μs (0 allocations: 0 bytes)
>1.460 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>2.211 μs (1 allocation: 7.94 KiB)
Some remarks :
Sorting clearly (and as expected) slows the operations
Using abs2 saves a lot of performance (expected as well)
To conclude :
As a built-in function will provide this in 1.7, I will avoid using the additional Main.isless method, though it is all things considered the most performing one, to not modify the core of my julia
The maxby1 and maxby2 allocate nothing
The maxby1 feels more readable
the winner is therefore Andrej Oskin!
EDIT n°2 a new benchmark using the corrected compmax implementation
julia> #btime maxby0(abs2,$a)
36.799 μs (1 allocation: 15.75 KiB)
julia> #btime maxby1(abs2,$a)
3.062 μs (0 allocations: 0 bytes)
julia> #btime maxby2(abs2,$a)
1.160 μs (0 allocations: 0 bytes)
julia> #btime compmax($a)
26.899 μs (9 allocations: 12.77 KiB)
julia> #btime maximum($a)
1.520 μs (0 allocations: 0 bytes)
julia> #btime maxby5(abs2,$a)
2.500 μs (4 allocations: 8.00 KiB)
In Julia 1.7 you can use argmax
julia> a = rand(ComplexF64,4)
4-element Vector{ComplexF64}:
0.3443509906876845 + 0.49984979589871426im
0.1658370274750809 + 0.47815764287341156im
0.4084798173736195 + 0.9688268736875587im
0.47476987432458806 + 0.13651720575229853im
julia> argmax(abs2, a)
0.4084798173736195 + 0.9688268736875587im
Since it will take some time to get to 1.7, you can use the following analog
maxby(f, iter) = reduce(iter) do x, y
f(x) > f(y) ? x : y
end
julia> maxby(abs2, a)
0.4084798173736195 + 0.9688268736875587im
UPD: slightly more efficient way to find such maximum is to use something like
function maxby(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
end
end
return res
end
According to octave's documentation (which presumably mimics matlab's behaviour):
For complex arguments, the magnitude of the elements are used for
comparison. If the magnitudes are identical, then the results are
ordered by phase angle in the range (-pi, pi]. Hence,
max ([-1 i 1 -i])
=> -1
because all entries have magnitude 1, but -1 has the largest phase
angle with value pi.
Therefore, if you'd like to mimic matlab/octave functionality exactly, then based on this logic, I'd construct a 'max' function for complex numbers as:
function compmax( CArray )
Absmax = CArray[ abs.(CArray) .== maximum( abs.(CArray)) ]
Totalmax = Absmax[ angle.(Absmax) .== maximum(angle.(Absmax)) ]
return Totalmax[1]
end
(adding appropriate typing as desired).
Examples:
Nums0 = [ 1, 2, 3 + 4im, 3 - 4im, 5 ]; compmax( Nums0 )
# 1-element Array{Complex{Int64},1}:
# 3 + 4im
Nums1 = [ -1, im, 1, -im ]; compmax( Nums1 )
# 1-element Array{Complex{Int64},1}:
# -1 + 0im
If this was a code for my computations, I would have made my life much simpler by:
julia> Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
julia> maximum(rand(ComplexF64, 10))
0.9876138798492835 + 0.9267321874614858im
This adds a new implementation for an existing method in Main. Therefore for a library code it is not an elegant idea, but it will you get where you need it with the least effort.
The "size" of a complex number is determined by the size of its modulus. You can use abs for that. Or get 1.7 as Andrej Oskin said.
julia> arr = rand(ComplexF64, 10)
10-element Array{Complex{Float64},1}:
0.12749588414783353 + 0.09918182087026373im
0.7486501790575264 + 0.5577981676269863im
0.9399200789666509 + 0.28339836191094747im
0.9695470502095325 + 0.9978696209350371im
0.6599207157942191 + 0.0999992072342546im
0.30059521996405425 + 0.6840859625686171im
0.22746651600614132 + 0.33739559003514885im
0.9212471084010287 + 0.2590484924393446im
0.74848598947588 + 0.41129443181449554im
0.8304447441317468 + 0.8014240389454632im
julia> arr_max = arr[argmax(map(abs, arr))]
0.9695470502095325 + 0.9978696209350371im
julia> arr_min = arr[argmin(map(abs, arr))]
0.12749588414783353 + 0.09918182087026373im

Converting SuiteSparse.SPQR.QRSparseQ to SparseMatrixCSC?

I have this problem that converting the native sparse format for the QR decomposition of a sparse Matrix takes forever. However, I need it in the CSC format to use it for further computations.
using LinearAlgebra, SparseArrays
N = 1000
A = sprand(N,N,1e-4)
#time F = qr(A)
#time F.Q
#time Q_sparse = sparse(F.Q)
0.000420 seconds (1.15 k allocations: 241.017 KiB)
0.000008 seconds (6 allocations: 208 bytes)
6.067351 seconds (2.00 M allocations: 15.140 GiB, 36.25% gc time)
Any suggestions?
Okay, I found the problem. For other people trying to do it:
factors = F.Q.factors
τ = F.Q.τ
Nτ = size(factors)[2]
Isp = sparse(I(N));
#time Q_constr = prod(Isp - factors[:,i]*τ[i]*factors[:,i]' for i in 1:Nτ)
Q_constr ≈ Q_sparse
0.084461 seconds (62.64 k allocations: 3.321 MiB, 18.28% gc time)
true
You see that the method sparse(F.Q) is somehow using the wrong representation. If you construct Q as I did above, it will be considerably faster.

How to obtain the execution time of a function in Julia?

I want to obtain the execution time of a function in Julia. Here is a minimum working example:
function raise_to(n)
for i in 1:n
y = (1/7)^n
end
end
How to obtain the time it took to execute raise_to(10) ?
The recommended way to benchmark a function is to use BenchmarkTools:
julia> function raise_to(n)
y = (1/7)^n
end
raise_to (generic function with 1 method)
julia> using BenchmarkTools
julia> #btime raise_to(10)
1.815 ns (0 allocations: 0 bytes)
Note that repeating the computation numerous times (like you did in your example) is a good idea to get more accurate measurements. But BenchmarTools does it for you.
Also note that BenchmarkTools avoids many pitfalls of merely using #time. Most notably with #time, you're likely to measure compilation time in addition to run time. This is why the first invocation of #time often displays larger times/allocations:
# First invocation: the method gets compiled
# Large resource consumption
julia> #time raise_to(10)
0.007901 seconds (7.70 k allocations: 475.745 KiB)
3.5401331746414338e-9
# Subsequent invocations: stable and low timings
julia> #time raise_to(10)
0.000003 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
julia> #time raise_to(10)
0.000002 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
julia> #time raise_to(10)
0.000001 seconds (5 allocations: 176 bytes)
3.5401331746414338e-9
#time
#time works as mentioned in previous answers, but it will include compile time if it is the first time you call the function in your julia session.
https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-%5B%40time%5D%28%40ref%29-and-pay-attention-to-memory-allocation-1
#btime
You can also use #btime if you put using BenchmarkTools in your code.
https://github.com/JuliaCI/BenchmarkTools.jl
This will rerun your function many times after an initial compile run, and then average the time.
julia> using BenchmarkTools
julia> #btime sin(x) setup=(x=rand())
4.361 ns (0 allocations: 0 bytes)
0.49587200950472454
#timeit
Another super useful library for Profiling is TimerOutputs.jl
https://github.com/KristofferC/TimerOutputs.jl
using TimerOutputs
# Time a section code with the label "sleep" to the `TimerOutput` named "to"
#timeit to "sleep" sleep(0.02)
# ... several more calls to #timeit
print_timer(to::TimerOutput)
──────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 5.09s / 56.0% 106MiB / 74.6%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────
sleep 101 1.17s 41.2% 11.6ms 1.48MiB 1.88% 15.0KiB
nest 2 1 703ms 24.6% 703ms 2.38KiB 0.00% 2.38KiB
level 2.2 1 402ms 14.1% 402ms 368B 0.00% 368.0B
level 2.1 1 301ms 10.6% 301ms 368B 0.00% 368.0B
throwing 1 502ms 17.6% 502ms 384B 0.00% 384.0B
nest 1 1 396ms 13.9% 396ms 5.11KiB 0.01% 5.11KiB
level 2.2 1 201ms 7.06% 201ms 368B 0.00% 368.0B
level 2.1 3 93.5ms 3.28% 31.2ms 1.08KiB 0.00% 368.0B
randoms 1 77.5ms 2.72% 77.5ms 77.3MiB 98.1% 77.3MiB
funcdef 1 2.66μs 0.00% 2.66μs - 0.00% -
──────────────────────────────────────────────────────────────────────
Macros can have begin ... end
As seen in the docs for these functions they can cover multiple statements or functions.
#my_macro begin
statement1
statement2
# ...
statement3
end
Hope that helps.
The #time macro can be used to tell you how long the function took to evaluate. It also gives how the memory was allocated.
julia> function raise_to(n)
for i in 1:n
y = (1/7)^n
end
end
raise_to (generic function with 1 method)
julia> #time raise_to(10)
0.093018 seconds (26.00 k allocations: 1.461 MiB)
It would be nice to add that if you want to find the run time of a code block, you can do as follow:
#time begin
# your code
end

How to find the index of the last maximum in julialang?

I have an array that contains repeated nonnegative integers, e.g., A=[5,5,5,0,1,1,0,0,0,3,3,0,0]. I would like to find the position of the last maximum in A. That is the largest index i such that A[i]>=A[j] for all j. In my example, i=3.
I tried to find the indices of all maximum of A then find the maximum of these indices:
A = [5,5,5,0,1,1,0,0,0,3,3,0,0];
Amax = maximum(A);
i = maximum(find(x -> x == Amax, A));
Is there any better way?
length(A) - indmax(#view A[end:-1:1]) + 1
should be pretty fast, but I didn't benchmark it.
EDIT: I should note that by definition #crstnbr 's solution (to write the algorithm from scratch) is faster (how much faster is shown in Xiaodai's response). This is an attempt to do it using julia's inbuilt array functions.
What about findlast(A.==maximum(A)) (which of course is conceptually similar to your approach)?
The fastest thing would probably be explicit loop implementation like this:
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
I tried #Michael's solution and #crstnbr's solution and I found the latter much faster
a = rand(Int8(1):Int8(5),1_000_000_000)
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 19 seconds
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 18 seconds
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
#time lastindmax(a) # 3 seconds
#time lastindmax(a) # 2.8 seconds
Michael's solution doesn't support Strings (ERROR: MethodError: no method matching view(::String, ::StepRange{Int64,Int64})) or sequences so I add another solution:
julia> lastimax(x) = maximum((j,i) for (i,j) in enumerate(x))[2]
julia> A="abžcdž"; lastimax(A) # unicode is OK
6
julia> lastimax(i^2 for i in -10:7)
1
If you more like don't catch exception for empty Sequence:
julia> lastimax(x) = !isempty(x) ? maximum((j,i) for (i,j) in enumerate(x))[2] : 0;
julia> lastimax(i for i in 1:3 if i>4)
0
Simple(!) benchmarks:
This is up to 10 times slower than Michael's solution for Float64:
julia> mlastimax(A) = length(A) - indmax(#view A[end:-1:1]) + 1;
julia> julia> A = rand(Float64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.166389 seconds (4.00 M allocations: 91.553 MiB, 4.63% gc time)
0.019560 seconds (6 allocations: 240 bytes)
80346
(I am surprised) it is 2 times faster for Int64!
julia> A = rand(Int64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.015453 seconds (10 allocations: 304 bytes)
0.031197 seconds (6 allocations: 240 bytes)
423400
it is 2-3 times slower for Strings
julia> A = ["A$i" for i in 1:1_000_000]; #time lastimax(A); #time mlastimax(A)
0.175117 seconds (2.00 M allocations: 61.035 MiB, 41.29% gc time)
0.077098 seconds (7 allocations: 272 bytes)
999999
EDIT2:
#crstnbr solution is faster and works with Strings too (doesn't work with generators). There difference between lastindmax and lastimax - first return byte index, second return character index:
julia> S = "1š3456789ž"
julia> length(S)
10
julia> lastindmax(S) # return value is bigger than length
11
julia> lastimax(S) # return character index (which is not byte index to String) of last max character
10
julia> S[chr2ind(S, lastimax(S))]
'ž': Unicode U+017e (category Ll: Letter, lowercase)
julia> S[chr2ind(S, lastimax(S))]==S[lastindmax(S)]
true

Resources