List of vectors slower in Julia than R? - r

I tried to speed up an R function by porting it to Julia, but to my surprise Julia was slower. The function sequentially updates a list of vectors (array of arrays in Julia). Beforehand the index of the list element to be updated is unknown and the length of the new vector is unknown.
I have written a test function that demonstrates the behavior.
Julia
function MyTest(n)
a = [[0.0] for i in 1:n]
for i in 1:n
a[i] = cumsum(ones(i))
end
a
end
R
MyTest <- function(n){
a <- as.list(rep(0, n))
for (i in 1:n)
a[[i]] <- cumsum(rep(1, i))
a
}
By setting n to 5000, 10000 and 20000, typical computing times are (median of 21 tests):
R: 0.14, 0.45, and 1.28 seconds
Julia: 0.31, 3.38, and 27.03 seconds
I used a windows-laptop with 64 bit Julia-1.3.1 and 64 bit R-3.6.1.
Both these functions use 64 bit floating-point types. My real problem involves integers and then R is even more favorable. But integer comparison isn’t fair since R uses 32 bit integers and Julia 64 bit.
Is it something I can do to speed up Julia or is really Julia much slower than R in this case?

I don't quite see how you get your test results. Assuming you want 32 bit integers, as you said, then we have
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
for i in 1:n
a[i] = cumsum(ones(i))
end
return a
end
mytest (generic function with 1 method)
julia> #btime mytest(20000);
1.108 s (111810 allocations: 3.73 GiB)
When we only get rid of those allocations, we already get down to the following:
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
#inbounds for i in 1:n
a[i] = collect(UnitRange{Int32}(1, i))
end
return a
end
mytest (generic function with 1 method)
julia> #btime mytest(20000);
115.702 ms (35906 allocations: 765.40 MiB)
Further devectorization does not even help:
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
#inbounds for i in 1:n
v = Vector{Int32}(undef, i)
v[1] = 1
#inbounds for j = 2:i
v[j] = v[j-1] + 1
end
a[i] = v
end
return a
end
mytest (generic function with 1 method)
julia> #btime mytest(20000);
188.856 ms (35906 allocations: 765.40 MiB)
But with a couple of threads (I assume the inner arrays are independent), we get 2x speed-up again:
julia> Threads.nthreads()
4
julia> function mytest(n)
a = Vector{Vector{Int32}}(undef, n)
Threads.#threads for i in 1:n
v = Vector{Int32}(undef, i)
v[1] = 1
#inbounds for j = 2:i
v[j] = v[j-1] + 1
end
a[i] = v
end
return a
end
mytest (generic function with 1 method)
julia> #btime mytest(20000);
99.718 ms (35891 allocations: 763.13 MiB)
But this is only about as fast as the second variant above.
That is, for the specific case of cumsum. Other inner functions are slower, of course, but can be equally threaded, and optimized in the same ways, with possibly different results.
(This is on Julia 1.2, 12 GiB RAM, and an older i7.)

Perhaps R is doing some type of buffering for such simple functions?
Here is the Julia version with buffering:
using Memoize
#memoize function cumsum_ones(i)
cumsum(ones(i))
end
function MyTest2(n)
a = Vector{Vector{Float64}}(undef, n)
for i in 1:n
a[i] = cumsum_ones(i)
end
a
end
In a warmed-up function, the timings look the following:
julia> #btime MyTest2(5000);
442.500 μs (10002 allocations: 195.39 KiB)
julia> #btime MyTest2(10000);
939.499 μs (20002 allocations: 390.70 KiB)
julia> #btime MyTest2(20000);
3.554 ms (40002 allocations: 781.33 KiB)

Related

julia multiplication of two arrays

Is there a way to speed-up/ write more elegantly this array multiplication (which, in numpy arrays, I would write as A*B)?
A = rand(8,15,10)
B = rand(10,5)
C = zeros(8,15,5)
for i in 1:8
for j in 1:15
for k in 1:10
for l in 1:5
C[i,j,l] = A[i,j,:]⋅B[:,l]
end
end
end
end
There are a bunch of Julia packages which allow you to write your contraction in one simple line. Here a few examples based on Einsum.jl, OMEinsum.jl, and TensorOperations.jl:
using OMEinsum
f_omeinsum(A,B) = ein"ijk,km->ijm"(A,B)
using Einsum
f_einsum(A,B) = #einsum C[i,j,l] := A[i,j,k] * B[k,l]
using TensorOperations
f_tensor(A,B) = #tensor C[i,j,l] := A[i,j,k] * B[k,l]
Apart from these elegant (and fast, see below) versions, you can improve your loop code quite a bit. Here your code, wrapped into a function, and an improved version with comments:
function f(A,B)
C = zeros(8,15,5)
for i in 1:8
for j in 1:15
for k in 1:10
for l in 1:5
C[i,j,l] = A[i,j,:]⋅B[:,l]
end
end
end
end
return C
end
function f_fast(A,B)
# check bounds
n1,n2,n3 = size(A)
m1, m2 = size(B)
#assert m1 == n3
C = zeros(n1,n2,m2)
# * #inbounds to skip boundchecks inside the loop
# * different order of the loops to account for Julia's column major order
# * written out the k-loop (dot product) explicitly to avoid temporary allocations
#inbounds for l in 1:m2
for k in 1:m1
for j in 1:n2
for i in 1:n1
C[i,j,l] += A[i,j,k]*B[k,l]
end
end
end
end
return C
end
Let's compare all approaches. First we check for correctness:
using Test
#test f(A,B) ≈ f_omeinsum(A,B) # Test passed
#test f(A,B) ≈ f_einsum(A,B) # Test passed
#test f(A,B) ≈ f_tensor(A,B) # Test passed
#test f(A,B) ≈ f_fast(A,B) # Test passed
Now, let's benchmark using BenchmarkTools.jl. I put the timings on my machine as comments.
using BenchmarkTools
#btime f($A,$B); # 663.500 μs (12001 allocations: 1.84 MiB)
#btime f_omeinsum($A,$B); # 33.799 μs (242 allocations: 20.20 KiB)
#btime f_einsum($A,$B); # 4.200 μs (1 allocation: 4.81 KiB)
#btime f_tensor($A,$B); # 2.367 μs (3 allocations: 4.94 KiB)
#btime f_fast($A,$B); # 7.375 μs (1 allocation: 4.81 KiB)
As we can see, all the einsum/tensor notation based approaches are much faster than your original loop implementation - and only one liners! The performance of our f_fast is in the same ballpark but still quite a bit behind f_tensor, which is the fastest.
Finally, let's go all for performance, because we can. Utilizing the wizardry from LoopVectorization.jl, we replace the #inbounds in f_fast with #avx (we call this new version f_avx below) and automagically get another 2x speed up relative to the f_tensor performance above:
#test f(A,B) ≈ f_avx(A,B) # Test passed
#btime f_avx($A,$B); # 930.769 ns (1 allocation: 4.81 KiB)
However, because of its simplicity I'd still prefer f_tensor unless every microsecond counts in your application.

How to efficiently initialize huge sparse arrays in Julia?

There are two ways one can initialize a NXN sparse matrix, whose entries are to be read from one/multiple text files. Which one is faster? I need the more efficient one, as N is large, typically 10^6.
1). I could store the (x,y) indices in arrays x, y, the entries in an array v and declare
K = sparse(x,y,value);
2). I could declare
K = spzeros(N)
then read of the (i,j) coordinates and values v and insert them as
K[i,j]=v;
as they are being read.
I found no tips about this on Julia’s page on sparse arrays.
Don’t insert values one by one: that will be tremendously inefficient since the storage in the sparse matrix needs to be reallocated over and over again.
You can also use BenchmarkTools.jl to verify this:
julia> using SparseArrays
julia> using BenchmarkTools
julia> I = rand(1:1000, 1000); J = rand(1:1000, 1000); X = rand(1000);
julia> function fill_spzeros(I, J, X)
x = spzeros(1000, 1000)
#assert axes(I) == axes(J) == axes(X)
#inbounds for i in eachindex(I)
x[I[i], J[i]] = X[i]
end
x
end
fill_spzeros (generic function with 1 method)
julia> #btime sparse($I, $J, $X);
10.713 μs (12 allocations: 55.80 KiB)
julia> #btime fill_spzeros($I, $J, $X);
96.068 μs (22 allocations: 40.83 KiB)
Original post can be found here

Convert type Array{Union{Missing, Float64},1} to Array{Float64,1} in Julia

I have an array of floats having some missing values, hence its type is Array{Union{Missing, Float64},1}. Is there a command to convert the non-missing part into Array{Float64,1}?
Here are three solutions, in order of preference (thanks to #BogumilKaminski for the first one):
f1(x) = collect(skipmissing(x))
f2(x) = Float64[ a for a in x if !ismissing(a) ]
f3(x) = x[.!ismissing.(x)]
f1 lazy-loads the array with skipmissing (useful for e.g. iteration) and then builds the array via collect.
f2 uses a for loop but is likely to be slower than f1 since the final array length is not computed ahead of time.
f3 uses broadcasting, and allocates temporaries in the process, and so is likely to be the slowest of the three.
We can verify the above with a simple benchmark:
using BenchmarkTools
x = Array{Union{Missing,Float64}}(undef, 100);
inds = unique(rand(1:100, 50));
x[inds] = randn(length(inds));
#btime f1($x);
#btime f2($x);
#btime f3($x);
Resulting in:
julia> #btime f1($x);
377.186 ns (7 allocations: 1.22 KiB)
julia> #btime f2($x);
471.204 ns (8 allocations: 1.23 KiB)
julia> #btime f3($x);
732.726 ns (6 allocations: 4.80 KiB)

How to find the index of the last maximum in julialang?

I have an array that contains repeated nonnegative integers, e.g., A=[5,5,5,0,1,1,0,0,0,3,3,0,0]. I would like to find the position of the last maximum in A. That is the largest index i such that A[i]>=A[j] for all j. In my example, i=3.
I tried to find the indices of all maximum of A then find the maximum of these indices:
A = [5,5,5,0,1,1,0,0,0,3,3,0,0];
Amax = maximum(A);
i = maximum(find(x -> x == Amax, A));
Is there any better way?
length(A) - indmax(#view A[end:-1:1]) + 1
should be pretty fast, but I didn't benchmark it.
EDIT: I should note that by definition #crstnbr 's solution (to write the algorithm from scratch) is faster (how much faster is shown in Xiaodai's response). This is an attempt to do it using julia's inbuilt array functions.
What about findlast(A.==maximum(A)) (which of course is conceptually similar to your approach)?
The fastest thing would probably be explicit loop implementation like this:
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
I tried #Michael's solution and #crstnbr's solution and I found the latter much faster
a = rand(Int8(1):Int8(5),1_000_000_000)
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 19 seconds
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 18 seconds
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
#time lastindmax(a) # 3 seconds
#time lastindmax(a) # 2.8 seconds
Michael's solution doesn't support Strings (ERROR: MethodError: no method matching view(::String, ::StepRange{Int64,Int64})) or sequences so I add another solution:
julia> lastimax(x) = maximum((j,i) for (i,j) in enumerate(x))[2]
julia> A="abžcdž"; lastimax(A) # unicode is OK
6
julia> lastimax(i^2 for i in -10:7)
1
If you more like don't catch exception for empty Sequence:
julia> lastimax(x) = !isempty(x) ? maximum((j,i) for (i,j) in enumerate(x))[2] : 0;
julia> lastimax(i for i in 1:3 if i>4)
0
Simple(!) benchmarks:
This is up to 10 times slower than Michael's solution for Float64:
julia> mlastimax(A) = length(A) - indmax(#view A[end:-1:1]) + 1;
julia> julia> A = rand(Float64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.166389 seconds (4.00 M allocations: 91.553 MiB, 4.63% gc time)
0.019560 seconds (6 allocations: 240 bytes)
80346
(I am surprised) it is 2 times faster for Int64!
julia> A = rand(Int64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.015453 seconds (10 allocations: 304 bytes)
0.031197 seconds (6 allocations: 240 bytes)
423400
it is 2-3 times slower for Strings
julia> A = ["A$i" for i in 1:1_000_000]; #time lastimax(A); #time mlastimax(A)
0.175117 seconds (2.00 M allocations: 61.035 MiB, 41.29% gc time)
0.077098 seconds (7 allocations: 272 bytes)
999999
EDIT2:
#crstnbr solution is faster and works with Strings too (doesn't work with generators). There difference between lastindmax and lastimax - first return byte index, second return character index:
julia> S = "1š3456789ž"
julia> length(S)
10
julia> lastindmax(S) # return value is bigger than length
11
julia> lastimax(S) # return character index (which is not byte index to String) of last max character
10
julia> S[chr2ind(S, lastimax(S))]
'ž': Unicode U+017e (category Ll: Letter, lowercase)
julia> S[chr2ind(S, lastimax(S))]==S[lastindmax(S)]
true

Order of linear algebra operations in Julia

If I have a command y = A*B*x where A & B are large matrices and x & y are vectors, will Julia preform y = ((A*B)*x) or y = (A*(B*x))?
The second option should be the best as it only has to allocate an extra vector rather than a large matrix.
The best way to verify this kind of thing is to dump the lowered code via #code_lowered macro:
julia> #code_lowered A * B * x
CodeInfo(:(begin
nothing
return (Core._apply)(Base.afoldl, (Core.tuple)(Base.*, (a * b) * c), xs)
end))
Like many other languages, Julia does y = (A*B)*x instead of y = A*(B*x), so it's up to you to explicitly use parens to reduce the allocation.
julia> using BenchmarkTools
julia> #btime $A * ($B * $x);
6.800 μs (2 allocations: 1.75 KiB)
julia> #btime $A * $B * $x;
45.453 μs (3 allocations: 79.08 KiB)

Resources