Julia: nested loops for pairwise distances is really slow - julia

I have some code which loads a csv file of 2000 2D coordinates, then a function called collision_count counts the number of pairs of coordinates that are closer than a distance d of each other:
using BenchmarkTools
using CSV
using LinearAlgebra
function load_csv()::Array{Float64,2}
df = CSV.read("pos.csv", header=0)
return Matrix(df)'
end
function collision_count(pos::Array{Float64,2}, d::Float64)::Int64
count::Int64 = 0
N::Int64 = size(pos, 2)
for i in 1:N
for j in (i+1):N
#views dist = norm(pos[:,i] - pos[:,j])
count += dist < d
end
end
return count
end
Here are the results:
pos = load_csv()
#benchmark collision_count($pos, 2.0)
BenchmarkTools.Trial:
memory estimate: 366.03 MiB
allocs estimate: 5997000
--------------
minimum time: 152.070 ms (18.80% GC)
median time: 158.915 ms (20.60% GC)
mean time: 158.751 ms (20.61% GC)
maximum time: 181.726 ms (21.98% GC)
--------------
samples: 32
evals/sample: 1
This is about 30x slower than this Python code:
import numpy as np
import scipy.spatial.distance
pos = np.loadtxt('pos.csv',delimiter=',')
def collision_count(pos, d):
pdist = scipy.spatial.distance.pdist(pos)
return np.count_nonzero(pdist < d)
%timeit collision_count(pos, 2)
5.41 ms ± 63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Any way to make it faster? And what's up with all the allocations?

The fastest I can get trivially is the follows
using Distances
using StaticArrays
using LinearAlgebra
pos = [#SVector rand(2) for _ in 1:2000]
function collision_count(pos::Vector{<:AbstractVector}, d)
count = 0
#inbounds for i in axes(pos,2)
for j in (i+1):lastindex(pos,2)
dist = sqeuclidean(pos[i], pos[j])
count += dist < d*d
end
end
return count
end
There are a variety of changes here, some stylistic, some structural. Starting with style, you may note that I don't type anything more restrictively than I need to. This has no performance benefit, since Julia is smart enough to infer types for your code.
The biggest structural change is switching from using a matrix to a vector of StaticVectors. The reason for this change is that since points are your scalar type, it makes more sense to have a vector of elements where each element is a point. The next change I made is to use a squared norm, since sqrt operations are expensive. The results speak for themselves:
#benchmark collision_count($pos, .1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.182 ms (0.00% GC)
median time: 1.214 ms (0.00% GC)
mean time: 1.218 ms (0.00% GC)
maximum time: 2.160 ms (0.00% GC)
--------------
samples: 4101
evals/sample: 1
Note that there are n log(n) algorithms that may be faster, but this should be pretty close to optimal for a naive implementation.

Here's a solution that doesn't rely on specific knowledge about the dimensionality of the points:
(Edit: I updated the function to make it more robust with respect to indexing. Some AbstractArrays have indices that do not start at 1, so now I use axes and lastindex instead of size.)
function collision_count2(pos::AbstractMatrix, d)
count = 0
#inbounds for i in axes(pos, 2)
for j in (i+1):lastindex(pos, 2)
dist2 = sum(abs2(pos[k, i] - pos[k, j]) for k in axes(pos, 1))
count += dist2 < d^2
end
end
return count
end
Benchmarks:
julia> using BenchmarkTools
julia> #btime collision_count(pos, 0.7) setup=(pos=rand(2, 2000));
533.055 ms (13991005 allocations: 488.01 MiB)
julia> #btime collision_count2(pos, 0.7) setup=(pos=rand(2, 2000));
4.700 ms (0 allocations: 0 bytes)
The speed is actually close to the SVector solution. On the upcoming Julia version 1.5, the difference compared to the OP's code should be much smaller, since views become more efficient.
BTW: drop the type annotations, like these
count::Int64 = 0
N::Int64 = size(pos, 2)
it's just adding visual noise.

Related

What went wrong with my Julia loops/devectorized code

I'm using Julia 1.0. Please consider the following code:
using LinearAlgebra
using Distributions
## create random data
const data = rand(Uniform(-1,2), 100000, 2)
function test_function_1(data)
theta = [1 2]
coefs = theta * data[:,1:2]'
res = coefs' .* data[:,1:2]
return sum(res, dims = 1)'
end
function test_function_2(data)
theta = [1 2]
sum_all = zeros(2)
for i = 1:size(data)[1]
sum_all .= sum_all + (theta * data[i,1:2])[1] * data[i,1:2]
end
return sum_all
end
After running it for the first time, I timed it
julia> #time test_function_1(data)
0.006292 seconds (16 allocations: 5.341 MiB)
2×1 Adjoint{Float64,Array{Float64,2}}:
150958.47189289227
225224.0374366073
julia> #time test_function_2(data)
0.038112 seconds (500.00 k allocations: 45.777 MiB, 15.61% gc time)
2-element Array{Float64,1}:
150958.4718928927
225224.03743660534
test_function_1 is significantly superior, both in allocations and speed, but test_function_1 is not devectorized. I would expect test_function_2 to perform better. Note that both functions do the same.
I have a hunch that it's because in test_function_2, I use sum_all .= sum_all + ..., but I'm not sure why that's a problem. Can I get a hint?
So first let me comment how I would write your function if I wanted to use a loop:
function test_function_3(data)
theta = (1, 2)
sum_all = zeros(2)
for row in eachrow(data)
sum_all .+= dot(theta, row) .* row
end
return sum_all
end
Next, here is a benchmark comparison of the three options:
julia> #benchmark test_function_1($data)
BenchmarkTools.Trial:
memory estimate: 5.34 MiB
allocs estimate: 16
--------------
minimum time: 1.953 ms (0.00% GC)
median time: 1.986 ms (0.00% GC)
mean time: 2.122 ms (2.29% GC)
maximum time: 4.347 ms (8.00% GC)
--------------
samples: 2356
evals/sample: 1
julia> #benchmark test_function_2($data)
BenchmarkTools.Trial:
memory estimate: 45.78 MiB
allocs estimate: 500002
--------------
minimum time: 16.316 ms (7.44% GC)
median time: 16.597 ms (7.63% GC)
mean time: 16.845 ms (8.01% GC)
maximum time: 34.050 ms (4.45% GC)
--------------
samples: 297
evals/sample: 1
julia> #benchmark test_function_3($data)
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 1
--------------
minimum time: 777.204 μs (0.00% GC)
median time: 791.458 μs (0.00% GC)
mean time: 799.505 μs (0.00% GC)
maximum time: 1.262 ms (0.00% GC)
--------------
samples: 6253
evals/sample: 1
Next you can go a bit faster if you explicitly implement the dot in the loop:
julia> function test_function_4(data)
theta = (1, 2)
sum_all = zeros(2)
for row in eachrow(data)
#inbounds sum_all .+= (theta[1]*row[1]+theta[2]*row[2]) .* row
end
return sum_all
end
test_function_4 (generic function with 1 method)
julia> #benchmark test_function_4($data)
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 1
--------------
minimum time: 502.367 μs (0.00% GC)
median time: 502.547 μs (0.00% GC)
mean time: 505.446 μs (0.00% GC)
maximum time: 806.631 μs (0.00% GC)
--------------
samples: 9888
evals/sample: 1
To understand the differences let us have a look at this line of your code:
sum_all .= sum_all + (theta * data[i,1:2])[1] * data[i,1:2]
Let us count the memory allocations you do in this expression:
sum_all .=
sum_all
+ # allocation of a new vector as a result of addition
(theta
* # allocation of a new vector as a result of multiplication
data[i,1:2] # allocation of a new vector via getindex
)[1]
* # allocation of a new vector as a result of multiplication
data[i,1:2] # allocation of a new vector via getindex
So you can see that in each iteration of the loop you allocate five times.
Allocations are expensive. And you can see this in the benchmarks that you have 5000002 allocations in the process:
1 allocation of sum_all
1 allocation of theta
500000 allocations in the loop (5 * 100000)
Additionally you perform indexing like data[i,1:2] which performs
bounds checking, which is also a small cost (but marginal in comparison to allocations).
Now in function test_function_3 I use eachrow(data). This time I also get rows of data matrix, but they are returned as views (not new matrices) so no allocation happens inside the loop. Next I use a dot function again to avoid allocation that was earlier caused by a matrix multiplication (I have changed theta to a Tuple from a Matrix as then dot is a bit faster, but this secondary). Finally I write um_all .+= dot(theta, row) .* row and in this case all operations are broadcasted, so Julia can do broadcast fusion (again - no allocations happen).
In test_function_4 I just replace dot by unrolled loop as we know we have two elements to calculate the dot product for. Actually if you fully unroll everything and use #simd it gets even faster:
julia> function test_function_5(data)
theta = (1, 2)
s1 = 0.0
s2 = 0.0
#inbounds #simd for i in axes(data, 1)
r1 = data[i, 1]
r2 = data[i, 2]
mul = theta[1]*r1 + theta[2]*r2
s1 += mul * r1
s2 += mul * r2
end
return [s1, s2]
end
test_function_5 (generic function with 1 method)
julia> #benchmark test_function_5($data)
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 1
--------------
minimum time: 22.721 μs (0.00% GC)
median time: 23.146 μs (0.00% GC)
mean time: 24.306 μs (0.00% GC)
maximum time: 100.109 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
So you can see that this way you are around 100x faster than with test_function_1. Still already test_function_3 is relatively fast and it is fully generic so probably normally I would write something like test_function_3 unless I really needed to be super fast and knew that the dimensions of my data are fixed and small.

Julia: How to count efficiently the number of missings in a `Vector{Union{T, Missing}}`

Consider
x = rand([missing, rand(Int, 100)...], 1_000_000)
which yields typeof(x) = Array{Union{Missing, Int64},1}.
What's the most efficient way to count the number of missings in x?
The cleanest way is probably just
count(ismissing, x)
Simple, easy to remember, and fast
Since you're asking for the "most efficient" way, let me give some benchmark results. It is slightly faster than #xiaodai's answer, and as fast as a simple loop implementation:
julia> #btime count($ismissing,$x);
278.499 μs (0 allocations: 0 bytes)
julia> #btime mapreduce($ismissing, $+, $x);
293.901 μs (0 allocations: 0 bytes)
julia> #btime count_missing($x)
278.499 μs (0 allocations: 0 bytes)
where
julia> function count_missing(x)
c = 0
#inbounds for i in eachindex(x)
if ismissing(x[i])
c += 1
end
end
return c
end
Abstraction for no cost, just the way you'd want it to be.
If you know that your number of missing is less than 4 billion elements (or less than 65k elements) you can be several times faster than #crstnbr answer with the following code:
function count_missing(x, T)
c = zero(T)
for i in 1:length(x)
c += #inbounds ismissing(x[i])
end
return Int(c) #we want to have stable result type
# this could be further combined with a barrier function
# that could check the size of `x` at the runtime
end
Now the benchmarks.
This is the original time on my laptop:
julia> #btime count_missing($x, Int)
227.799 μs (0 allocations: 0 bytes)
9971
Slash the time by half if you know there is less than 4 billion matching elements:
julia> #btime count_missing($x, UInt32)
113.899 μs (0 allocations: 0 bytes)
9971
Slash the time by 8x if you know there is less than 65k matching elements:
julia> #btime count_missing($x, UInt16)
29.200 μs (0 allocations: 0 bytes)
9971
This is an unsafe answer and is not guaranteed to work in future if Julia changes the memory layout but it's fun
x = Vector{Union{Missing, Float64}}(missing, 100_000_000)
x[rand(1:100_000_000, 90_000_000)] .= rand.()
using BenchmarkTools
#benchmark count($ismissing, $x)
# BenchmarkTools.Trial:
# memory estimate: 0 bytes
# allocs estimate: 0
# --------------
# minimum time: 48.468 ms (0.00% GC)
# median time: 51.755 ms (0.00% GC)
# mean time: 66.863 ms (0.00% GC)
# maximum time: 91.449 ms (0.00% GC)
# --------------
# samples: 76
# evals/sample: 1
function unsafe_count_missing(x::Vector{Union{Missing, T}}) where T
#assert isbitstype(T)
l = length(x)
GC.#preserve x begin
y = unsafe_wrap(Vector{UInt8}, Ptr{UInt8}(pointer(x) + sizeof(T)*l), l)
res = reduce(-, y; init = l)
end
res
end
#time count(ismissing, x) == unsafe_count_missing(x)
#benchmark faster_count_missing($x)
# BenchmarkTools.Trial:
# memory estimate: 80 bytes
# allocs estimate: 1
# --------------
# minimum time: 9.190 ms (0.00% GC)
# median time: 9.718 ms (0.00% GC)
# mean time: 9.845 ms (0.00% GC)
# maximum time: 15.691 ms (0.00% GC)
# --------------
# samples: 508
# evals/sample: 1

Julia multiply each matrix along dim

I have a 3 dimensional array
x = rand(6,6,2^10)
I want to multiply each matrix along the third dimension by a vector. Is there a more clean way to do this than:
y = rand(6,1)
z = zeros(6,1,2^10)
for i in 1:2^10
z[:,:,i] = x[:,:,i] * y
end
If you are working with matrices, it may be appropriate to consider x as a vector of matrices instead of a 3D array. Then you could do
x = [rand(6,6) for _ in 1:2^10]
y = [rand(6)]
z = x .* y
z is now a vector of vectors.
And if z is preallocated, that would be
z .= x .* y
And, if you want it really fast, use vectors of StaticArrays
using StaticArrays
x = [#SMatrix rand(6, 6) for _ in 1:2^10]
y = [#SVector rand(6)]
z = x .* y
That's showing a 10x speedup on my computer, running in 12us.
mapslices(i->i*y, x, (1,2)) is maybe "cleaner" but it will be slower.
Read as: apply the function "times by y" to each slice of the first two dimensions.
function tst(x,y)
z = zeros(6,1,2^10)
for i in 1:2^10
z[:,:,i] = x[:,:,i] * y
end
return z
end
tst2(x,y) = mapslices(i->i*y, x, (1,2))
time tst(x,y);
0.002152 seconds (4.10 k allocations: 624.266 KB)
#time tst2(x,y);
0.005720 seconds (13.36 k allocations: 466.969 KB)
sum(x.*y',2) is a clean short solution.
It also has good speed and memory properties. The trick is to view matrix-vector multiplication as a linear combination of matrix's columns scaled by the vector elements. Instead of doing each linear combination for matrix x[:,:,i], we use the same scale y[i] for x[:,i,:]. In code:
const x = rand(6,6,2^10);
const y = rand(6,1);
function tst(x,y)
z = zeros(6,1,2^10)
for i in 1:2^10
z[:,:,i] = x[:,:,i]*y
end
return z
end
tst2(x,y) = mapslices(i->i*y,x,(1,2))
tst3(x,y) = sum(x.*y',2)
Benchmarking gives:
julia> using BenchmarkTools
julia> z = tst(x,y); z2 = tst2(x,y); z3 = tst3(x,y);
julia> #benchmark tst(x,y)
BenchmarkTools.Trial:
memory estimate: 688.11 KiB
allocs estimate: 8196
--------------
median time: 759.545 μs (0.00% GC)
samples: 6068
julia> #benchmark tst2(x,y)
BenchmarkTools.Trial:
memory estimate: 426.81 KiB
allocs estimate: 10798
--------------
median time: 1.634 ms (0.00% GC)
samples: 2869
julia> #benchmark tst3(x,y)
BenchmarkTools.Trial:
memory estimate: 336.41 KiB
allocs estimate: 12
--------------
median time: 114.060 μs (0.00% GC)
samples: 10000
So tst3 using sum has better performance (~7x over tst and ~15x over tst2).
Using StaticArrays as suggested by #DNF is also an option, and it would be nice to compare it to the solutions here.

Unzip an array of tuples in julia

Suppose I have an array of tuples:
arr = [(1,2), (3,4), (5,6)]
With python I can do zip(*arr) == [(1, 3, 5), (2, 4, 6)]
What is the equivalent of this in julia?
As an alternative to splatting (since that's pretty slow), you could do something like:
unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
This is pretty quick.
julia> using BenchmarkTools
julia> a = collect(zip(1:10000, 10000:-1:1));
julia> #benchmark unzip(a)
BenchmarkTools.Trial:
memory estimate: 156.45 KiB
allocs estimate: 6
--------------
minimum time: 25.260 μs (0.00% GC)
median time: 31.997 μs (0.00% GC)
mean time: 48.429 μs (25.03% GC)
maximum time: 36.130 ms (98.67% GC)
--------------
samples: 10000
evals/sample: 1
By comparison, I have yet to see this complete:
#time collect(zip(a...))
For larger arrays use #ivirshup's solution below.
For smaller arrays, you can use zip and splitting.
You can achieve the same thing in Julia by using the zip() function (docs here). zip() expects many tuples to work with so you have to use the splatting operator ... to supply your arguments. Also in Julia you have to use the collect() function to then transform your iterables into an array (if you want to).
Here are these functions in action:
arr = [(1,2), (3,4), (5,6)]
# wtihout splatting
collect(zip((1,2), (3,4), (5,6)))
# Output is a vector of arrays:
> ((1,3,5), (2,4,6))
# same results with splatting
collect(zip(arr...))
> ((1,3,5), (2,4,6))
julia:
use ...
for r in zip(arr...)
println(r)
end
There is also the Unzip.jl package:
julia> using Unzip
julia> unzip([(1,2), (3,4), (5,6)])
([1, 3, 5], [2, 4, 6])
which seems to work a bit faster than the selected answer:
julia> using Unzip, BenchmarkTools
julia> a = collect(zip(1:10000, 10000:-1:1));
julia> unzip_ivirshup(a) = map(x->getfield.(a, x), fieldnames(eltype(a))) ;
julia> #btime unzip_ivirshup($a);
18.439 μs (4 allocations: 156.41 KiB)
julia> #btime unzip($a); # unzip from Unzip.jl is faster
12.798 μs (4 allocations: 156.41 KiB)
julia> unzip(a) == unzip_ivirshup(a) # check output is the same
true
Following up on #ivirshup 's answer I would like to add a version that is still an iterator
unzip(a) = (getfield.(a, x) for x in fieldnames(eltype(a)))
which keeps the result unevaluated until used. It even gives a (very slight) speed improvement when comparing
#benchmark a1, b1 = unzip(a)
BenchmarkTools.Trial:
memory estimate: 156.52 KiB
allocs estimate: 8
--------------
minimum time: 33.185 μs (0.00% GC)
median time: 76.581 μs (0.00% GC)
mean time: 83.808 μs (18.35% GC)
maximum time: 7.679 ms (97.82% GC)
--------------
samples: 10000
evals/sample: 1
vs.
BenchmarkTools.Trial:
memory estimate: 156.52 KiB
allocs estimate: 8
--------------
minimum time: 33.914 μs (0.00% GC)
median time: 39.020 μs (0.00% GC)
mean time: 64.788 μs (16.52% GC)
maximum time: 7.853 ms (98.18% GC)
--------------
samples: 10000
evals/sample: 1
I will add a solution based on the following simple macro
"""
#unzip xs, ys, ... = us
will expand the assignment into the following code
xs, ys, ... = map(x -> x[1], us), map(x -> x[2], us), ...
"""
macro unzip(args)
args.head != :(=) && error("Expression needs to be of form `xs, ys, ... = us`")
lhs, rhs = args.args
items = isa(lhs, Symbol) ? [lhs] : lhs.args
rhs_items = [:(map(x -> x[$i], $rhs)) for i in 1:length(items)]
rhs_expand = Expr(:tuple, rhs_items...)
esc(Expr(:(=), lhs, rhs_expand))
end
Since it's just a syntactic expansion, there shouldn't be any performance or type instability issue. Compare to other solutions based on fieldnames, this has the advantage of also working when the array element type is abstract. For example, while
julia> unzip_get_field(a) = map(x->getfield.(a, x), fieldnames(eltype(a)));
julia> unzip_get_field(Any[("a", 3), ("b", 4)])
ERROR: ArgumentError: type does not have a definite number of fields
the macro version still works:
julia> #unzip xs, ys = Any[("a", 3), ("b",4)]
(["a", "b"], [3, 4])

Multi-dimensional diff/gradient in Julia

I am looking for an efficient way to compute the derivatives of a multidimensional array in Julia. To be precise, I would like to have an equivalent of numpy.gradient in Julia. However, the Julia function diff :
works only for 2-dimensional arrays
reduces the size of the array by one along the differentiated dimension
It is straightforward to extend the definition of diff of Julia so it can work on 3-dimensional arrays, e.g. with
function diff3D(A::Array, dim::Integer)
if dim == 1
[A[i+1,j,k] - A[i,j,k] for i=1:size(A,1)-1, j=1:size(A,2), k=1:size(A,3)]
elseif dim == 2
[A[i,j+1,k] - A[i,j,k] for i=1:size(A,1), j=1:size(A,2)-1, k=1:size(A,3)]
elseif dim == 3
[A[i,j,k+1] - A[i,j,k] for i=1:size(A,1), j=1:size(A,2), k=1:size(A,3)-1]
else
throw(ArgumentError("dimension dim must be 1, 2, or 3 got $dim"))
end
end
which would work with e.g.
a = [i*j*k for i in 1:10, j in 1:10, k in 1:20]
However, the extension to an arbitrary dimension is not possible, and the boundary are not taken into account so the gradient can have the same dimension as the original array.
I have some ideas to implement an analogue of numpy's gradient in Julia, but I fear they would be extremely slow and ugly, hence my questions : is there a canonical way to do this in Julia that I missed ? And if there is none, what would be optimal ?
Thanks.
I'm not too familiar with diff, but from what I understand about what its doing I've made a n-dimensional implementation, that uses Julia features like parametric types and splatting:
function mydiff{T,N}(A::Array{T,N}, dim::Int)
#assert dim <= N
idxs_1 = [1:size(A,i) for i in 1:N]
idxs_2 = copy(idxs_1)
idxs_1[dim] = 1:(size(A,dim)-1)
idxs_2[dim] = 2:size(A,dim)
return A[idxs_2...] - A[idxs_1...]
end
with some sanity checks:
A = rand(3,3)
#assert diff(A,1) == mydiff(A,1) # Base diff vs my impl.
#assert diff(A,2) == mydiff(A,2) # Base diff vs my impl.
A = rand(3,3,3)
#assert diff3D(A,3) == mydiff(A,3) # Your impl. vs my impl.
Note that there are more magical ways to do this, like using code generation to make specialized methods up to a finite dimension, but I think thats probably not needed to get good-enough performance.
Even simpler way to do it:
mydiff(A::AbstractArray,dim) = mapslices(diff, A, dim)
Not sure how this would compare in terms of speed though.
Edit: Maybe slightly slower, but this is a more general solution to extending functions to higher-order arrays:
julia> using BenchmarkTools
julia> function mydiff{T,N}(A::Array{T,N}, dim::Int)
#assert dim <= N
idxs_1 = [1:size(A,i) for i in 1:N]
idxs_2 = copy(idxs_1)
idxs_1[dim] = 1:(size(A,dim)-1)
idxs_2[dim] = 2:size(A,dim)
return A[idxs_2...] - A[idxs_1...]
end
mydiff (generic function with 1 method)
julia> X = randn(500,500,500);
julia> #benchmark mydiff($X,3)
BenchmarkTools.Trial:
samples: 3
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 2.79 gb
allocs estimate: 22
minimum time: 2.05 s (15.64% GC)
median time: 2.15 s (14.62% GC)
mean time: 2.16 s (11.05% GC)
maximum time: 2.29 s (3.61% GC)
julia> #benchmark mapslices(diff,$X,3)
BenchmarkTools.Trial:
samples: 2
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.99 gb
allocs estimate: 3750056
minimum time: 2.52 s (7.90% GC)
median time: 2.61 s (9.17% GC)
mean time: 2.61 s (9.17% GC)
maximum time: 2.70 s (10.37% GC)

Resources