Order of linear algebra operations in Julia - julia

If I have a command y = A*B*x where A & B are large matrices and x & y are vectors, will Julia preform y = ((A*B)*x) or y = (A*(B*x))?
The second option should be the best as it only has to allocate an extra vector rather than a large matrix.

The best way to verify this kind of thing is to dump the lowered code via #code_lowered macro:
julia> #code_lowered A * B * x
return (Core._apply)(Base.afoldl, (Core.tuple)(Base.*, (a * b) * c), xs)
Like many other languages, Julia does y = (A*B)*x instead of y = A*(B*x), so it's up to you to explicitly use parens to reduce the allocation.
julia> using BenchmarkTools
julia> #btime $A * ($B * $x);
6.800 μs (2 allocations: 1.75 KiB)
julia> #btime $A * $B * $x;
45.453 μs (3 allocations: 79.08 KiB)


Generic maximum/minimum function for complex numbers

In julia, one can find (supposedly) efficient implementations of the min/minimum and max/maximum over collections of real numbers.
As these concepts are not uniquely defined for complex numbers, I was wondering if a parametrized version of these functions was already implemented somewhere.
I am currently sorting elements of the array of interest, then taking the last value, which is as far as I know, much more costly than finding the value with the maximum absolute value (or something else).
This is mostly to reproduce the Matlab behavior of the max function over complex arrays.
Here is my current code
a = rand(ComplexF64,4)
b = sort(a,by = (x) -> abs(x))
c = b[end]
The probable function call would look like
c = maximum/minimum(a,by=real/imag/abs/phase)
EDIT Some performance tests in Julia 1.5.3 with the provided solutions
function maxby0(f,iter)
b = sort(iter,by = (x) -> f(x))
c = b[end]
function maxby1(f, iter)
reduce(iter) do x, y
f(x) > f(y) ? x : y
function maxby2(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
return res
compmax(CArray) = CArray[ (abs.(CArray) .== maximum(abs.(CArray))) .& (angle.(CArray) .== maximum( angle.(CArray))) ][1]
Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
function maxby5(arr)
arr_max = arr[argmax(map(abs, arr))]
a = rand(ComplexF64,10)
using BenchmarkTools
#btime maxby0(abs,$a)
#btime maxby1(abs,$a)
#btime maxby2(abs,$a)
#btime compmax($a)
#btime maximum($a)
#btime maxby5($a)
Output for a vector of length 10:
>841.653 ns (1 allocation: 240 bytes)
>214.797 ns (0 allocations: 0 bytes)
>118.961 ns (0 allocations: 0 bytes)
>Execution fails
>20.340 ns (0 allocations: 0 bytes)
>144.444 ns (1 allocation: 160 bytes)
Output for a vector of length 1000:
>315.100 μs (1 allocation: 15.75 KiB)
>25.299 μs (0 allocations: 0 bytes)
>12.899 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>14.199 μs (1 allocation: 7.94 KiB)
Output for a vector of length 1000 (with all comparisons made with abs2):
>35.399 μs (1 allocation: 15.75 KiB)
>3.075 μs (0 allocations: 0 bytes)
>1.460 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>2.211 μs (1 allocation: 7.94 KiB)
Some remarks :
Sorting clearly (and as expected) slows the operations
Using abs2 saves a lot of performance (expected as well)
To conclude :
As a built-in function will provide this in 1.7, I will avoid using the additional Main.isless method, though it is all things considered the most performing one, to not modify the core of my julia
The maxby1 and maxby2 allocate nothing
The maxby1 feels more readable
the winner is therefore Andrej Oskin!
EDIT n°2 a new benchmark using the corrected compmax implementation
julia> #btime maxby0(abs2,$a)
36.799 μs (1 allocation: 15.75 KiB)
julia> #btime maxby1(abs2,$a)
3.062 μs (0 allocations: 0 bytes)
julia> #btime maxby2(abs2,$a)
1.160 μs (0 allocations: 0 bytes)
julia> #btime compmax($a)
26.899 μs (9 allocations: 12.77 KiB)
julia> #btime maximum($a)
1.520 μs (0 allocations: 0 bytes)
julia> #btime maxby5(abs2,$a)
2.500 μs (4 allocations: 8.00 KiB)
In Julia 1.7 you can use argmax
julia> a = rand(ComplexF64,4)
4-element Vector{ComplexF64}:
0.3443509906876845 + 0.49984979589871426im
0.1658370274750809 + 0.47815764287341156im
0.4084798173736195 + 0.9688268736875587im
0.47476987432458806 + 0.13651720575229853im
julia> argmax(abs2, a)
0.4084798173736195 + 0.9688268736875587im
Since it will take some time to get to 1.7, you can use the following analog
maxby(f, iter) = reduce(iter) do x, y
f(x) > f(y) ? x : y
julia> maxby(abs2, a)
0.4084798173736195 + 0.9688268736875587im
UPD: slightly more efficient way to find such maximum is to use something like
function maxby(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
return res
According to octave's documentation (which presumably mimics matlab's behaviour):
For complex arguments, the magnitude of the elements are used for
comparison. If the magnitudes are identical, then the results are
ordered by phase angle in the range (-pi, pi]. Hence,
max ([-1 i 1 -i])
=> -1
because all entries have magnitude 1, but -1 has the largest phase
angle with value pi.
Therefore, if you'd like to mimic matlab/octave functionality exactly, then based on this logic, I'd construct a 'max' function for complex numbers as:
function compmax( CArray )
Absmax = CArray[ abs.(CArray) .== maximum( abs.(CArray)) ]
Totalmax = Absmax[ angle.(Absmax) .== maximum(angle.(Absmax)) ]
return Totalmax[1]
(adding appropriate typing as desired).
Nums0 = [ 1, 2, 3 + 4im, 3 - 4im, 5 ]; compmax( Nums0 )
# 1-element Array{Complex{Int64},1}:
# 3 + 4im
Nums1 = [ -1, im, 1, -im ]; compmax( Nums1 )
# 1-element Array{Complex{Int64},1}:
# -1 + 0im
If this was a code for my computations, I would have made my life much simpler by:
julia> Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
julia> maximum(rand(ComplexF64, 10))
0.9876138798492835 + 0.9267321874614858im
This adds a new implementation for an existing method in Main. Therefore for a library code it is not an elegant idea, but it will you get where you need it with the least effort.
The "size" of a complex number is determined by the size of its modulus. You can use abs for that. Or get 1.7 as Andrej Oskin said.
julia> arr = rand(ComplexF64, 10)
10-element Array{Complex{Float64},1}:
0.12749588414783353 + 0.09918182087026373im
0.7486501790575264 + 0.5577981676269863im
0.9399200789666509 + 0.28339836191094747im
0.9695470502095325 + 0.9978696209350371im
0.6599207157942191 + 0.0999992072342546im
0.30059521996405425 + 0.6840859625686171im
0.22746651600614132 + 0.33739559003514885im
0.9212471084010287 + 0.2590484924393446im
0.74848598947588 + 0.41129443181449554im
0.8304447441317468 + 0.8014240389454632im
julia> arr_max = arr[argmax(map(abs, arr))]
0.9695470502095325 + 0.9978696209350371im
julia> arr_min = arr[argmin(map(abs, arr))]
0.12749588414783353 + 0.09918182087026373im

julia multiplication of two arrays

Is there a way to speed-up/ write more elegantly this array multiplication (which, in numpy arrays, I would write as A*B)?
A = rand(8,15,10)
B = rand(10,5)
C = zeros(8,15,5)
for i in 1:8
for j in 1:15
for k in 1:10
for l in 1:5
C[i,j,l] = A[i,j,:]⋅B[:,l]
There are a bunch of Julia packages which allow you to write your contraction in one simple line. Here a few examples based on Einsum.jl, OMEinsum.jl, and TensorOperations.jl:
using OMEinsum
f_omeinsum(A,B) = ein"ijk,km->ijm"(A,B)
using Einsum
f_einsum(A,B) = #einsum C[i,j,l] := A[i,j,k] * B[k,l]
using TensorOperations
f_tensor(A,B) = #tensor C[i,j,l] := A[i,j,k] * B[k,l]
Apart from these elegant (and fast, see below) versions, you can improve your loop code quite a bit. Here your code, wrapped into a function, and an improved version with comments:
function f(A,B)
C = zeros(8,15,5)
for i in 1:8
for j in 1:15
for k in 1:10
for l in 1:5
C[i,j,l] = A[i,j,:]⋅B[:,l]
return C
function f_fast(A,B)
# check bounds
n1,n2,n3 = size(A)
m1, m2 = size(B)
#assert m1 == n3
C = zeros(n1,n2,m2)
# * #inbounds to skip boundchecks inside the loop
# * different order of the loops to account for Julia's column major order
# * written out the k-loop (dot product) explicitly to avoid temporary allocations
#inbounds for l in 1:m2
for k in 1:m1
for j in 1:n2
for i in 1:n1
C[i,j,l] += A[i,j,k]*B[k,l]
return C
Let's compare all approaches. First we check for correctness:
using Test
#test f(A,B) ≈ f_omeinsum(A,B) # Test passed
#test f(A,B) ≈ f_einsum(A,B) # Test passed
#test f(A,B) ≈ f_tensor(A,B) # Test passed
#test f(A,B) ≈ f_fast(A,B) # Test passed
Now, let's benchmark using BenchmarkTools.jl. I put the timings on my machine as comments.
using BenchmarkTools
#btime f($A,$B); # 663.500 μs (12001 allocations: 1.84 MiB)
#btime f_omeinsum($A,$B); # 33.799 μs (242 allocations: 20.20 KiB)
#btime f_einsum($A,$B); # 4.200 μs (1 allocation: 4.81 KiB)
#btime f_tensor($A,$B); # 2.367 μs (3 allocations: 4.94 KiB)
#btime f_fast($A,$B); # 7.375 μs (1 allocation: 4.81 KiB)
As we can see, all the einsum/tensor notation based approaches are much faster than your original loop implementation - and only one liners! The performance of our f_fast is in the same ballpark but still quite a bit behind f_tensor, which is the fastest.
Finally, let's go all for performance, because we can. Utilizing the wizardry from LoopVectorization.jl, we replace the #inbounds in f_fast with #avx (we call this new version f_avx below) and automagically get another 2x speed up relative to the f_tensor performance above:
#test f(A,B) ≈ f_avx(A,B) # Test passed
#btime f_avx($A,$B); # 930.769 ns (1 allocation: 4.81 KiB)
However, because of its simplicity I'd still prefer f_tensor unless every microsecond counts in your application.

List of vectors slower in Julia than R?

I tried to speed up an R function by porting it to Julia, but to my surprise Julia was slower. The function sequentially updates a list of vectors (array of arrays in Julia). Beforehand the index of the list element to be updated is unknown and the length of the new vector is unknown.
I have written a test function that demonstrates the behavior.
function MyTest(n)
a = [[0.0] for i in 1:n]
for i in 1:n
a[i] = cumsum(ones(i))
MyTest <- function(n){
a <- as.list(rep(0, n))
for (i in 1:n)
a[[i]] <- cumsum(rep(1, i))
By setting n to 5000, 10000 and 20000, typical computing times are (median of 21 tests):
R: 0.14, 0.45, and 1.28 seconds
Julia: 0.31, 3.38, and 27.03 seconds
I used a windows-laptop with 64 bit Julia-1.3.1 and 64 bit R-3.6.1.
Both these functions use 64 bit floating-point types. My real problem involves integers and then R is even more favorable. But integer comparison isn’t fair since R uses 32 bit integers and Julia 64 bit.
Is it something I can do to speed up Julia or is really Julia much slower than R in this case?
I don't quite see how you get your test results. Assuming you want 32 bit integers, as you said, then we have
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
for i in 1:n
a[i] = cumsum(ones(i))
return a
mytest (generic function with 1 method)
julia> #btime mytest(20000);
1.108 s (111810 allocations: 3.73 GiB)
When we only get rid of those allocations, we already get down to the following:
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
#inbounds for i in 1:n
a[i] = collect(UnitRange{Int32}(1, i))
return a
mytest (generic function with 1 method)
julia> #btime mytest(20000);
115.702 ms (35906 allocations: 765.40 MiB)
Further devectorization does not even help:
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
#inbounds for i in 1:n
v = Vector{Int32}(undef, i)
v[1] = 1
#inbounds for j = 2:i
v[j] = v[j-1] + 1
a[i] = v
return a
mytest (generic function with 1 method)
julia> #btime mytest(20000);
188.856 ms (35906 allocations: 765.40 MiB)
But with a couple of threads (I assume the inner arrays are independent), we get 2x speed-up again:
julia> Threads.nthreads()
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
Threads.#threads for i in 1:n
v = Vector{Int32}(undef, i)
v[1] = 1
#inbounds for j = 2:i
v[j] = v[j-1] + 1
a[i] = v
return a
mytest (generic function with 1 method)
julia> #btime mytest(20000);
99.718 ms (35891 allocations: 763.13 MiB)
But this is only about as fast as the second variant above.
That is, for the specific case of cumsum. Other inner functions are slower, of course, but can be equally threaded, and optimized in the same ways, with possibly different results.
(This is on Julia 1.2, 12 GiB RAM, and an older i7.)
Perhaps R is doing some type of buffering for such simple functions?
Here is the Julia version with buffering:
using Memoize
#memoize function cumsum_ones(i)
function MyTest2(n)
a = Vector{Vector{Float64}}(undef, n)
for i in 1:n
a[i] = cumsum_ones(i)
In a warmed-up function, the timings look the following:
julia> #btime MyTest2(5000);
442.500 μs (10002 allocations: 195.39 KiB)
julia> #btime MyTest2(10000);
939.499 μs (20002 allocations: 390.70 KiB)
julia> #btime MyTest2(20000);
3.554 ms (40002 allocations: 781.33 KiB)

How to efficiently initialize huge sparse arrays in Julia?

There are two ways one can initialize a NXN sparse matrix, whose entries are to be read from one/multiple text files. Which one is faster? I need the more efficient one, as N is large, typically 10^6.
1). I could store the (x,y) indices in arrays x, y, the entries in an array v and declare
K = sparse(x,y,value);
2). I could declare
K = spzeros(N)
then read of the (i,j) coordinates and values v and insert them as
as they are being read.
I found no tips about this on Julia’s page on sparse arrays.
Don’t insert values one by one: that will be tremendously inefficient since the storage in the sparse matrix needs to be reallocated over and over again.
You can also use BenchmarkTools.jl to verify this:
julia> using SparseArrays
julia> using BenchmarkTools
julia> I = rand(1:1000, 1000); J = rand(1:1000, 1000); X = rand(1000);
julia> function fill_spzeros(I, J, X)
x = spzeros(1000, 1000)
#assert axes(I) == axes(J) == axes(X)
#inbounds for i in eachindex(I)
x[I[i], J[i]] = X[i]
fill_spzeros (generic function with 1 method)
julia> #btime sparse($I, $J, $X);
10.713 μs (12 allocations: 55.80 KiB)
julia> #btime fill_spzeros($I, $J, $X);
96.068 μs (22 allocations: 40.83 KiB)
Original post can be found here

Convert type Array{Union{Missing, Float64},1} to Array{Float64,1} in Julia

I have an array of floats having some missing values, hence its type is Array{Union{Missing, Float64},1}. Is there a command to convert the non-missing part into Array{Float64,1}?
Here are three solutions, in order of preference (thanks to #BogumilKaminski for the first one):
f1(x) = collect(skipmissing(x))
f2(x) = Float64[ a for a in x if !ismissing(a) ]
f3(x) = x[.!ismissing.(x)]
f1 lazy-loads the array with skipmissing (useful for e.g. iteration) and then builds the array via collect.
f2 uses a for loop but is likely to be slower than f1 since the final array length is not computed ahead of time.
f3 uses broadcasting, and allocates temporaries in the process, and so is likely to be the slowest of the three.
We can verify the above with a simple benchmark:
using BenchmarkTools
x = Array{Union{Missing,Float64}}(undef, 100);
inds = unique(rand(1:100, 50));
x[inds] = randn(length(inds));
#btime f1($x);
#btime f2($x);
#btime f3($x);
Resulting in:
julia> #btime f1($x);
377.186 ns (7 allocations: 1.22 KiB)
julia> #btime f2($x);
471.204 ns (8 allocations: 1.23 KiB)
julia> #btime f3($x);
732.726 ns (6 allocations: 4.80 KiB)
