Improving the speed of a for loop in Julia - julia

Here is my code in Julia platform and I like to speed it up. Is there anyway that I can make this faster? It takes 0.5 seconds for a dataset of 50k*50k. I was expecting Julia to be a lot faster than this or I am not sure if I am doing a silly implementation.
ar = [[1,2,3,4,5], [2,3,4,5,6,7,8], [4,7,8,9], [9,10], [2,3,4,5]]
SV = rand(10,5)
function h_score_0(ar ,SV)
m = length(ar)
SC = Array{Float64,2}(undef, size(SV, 2), m)
for iter = 1:m
nodes = ar[iter]
for jj = 1:size(SV, 2)
mx = maximum(SV[nodes, jj])
mn = minimum(SV[nodes, jj])
term1 = (mx - mn)^2;
SC[jj, iter] = (term1);
end
end
return score = sum(SC, dims = 1)
end

You have some unnecessary allocations in your code:
mx = maximum(SV[nodes, jj])
mn = minimum(SV[nodes, jj])
Slices allocate, so each line makes a copy of the data here, you're actually copying the data twice, once on each line. You can either make sure to copy only once, or even better: use view, so there is no copy at all (note that view is much faster on Julia v1.5, in case you are using an older version).
SC = Array{Float64,2}(undef, size(SV, 2), m)
And no reason to create a matrix here, and sum over it afterwards, just accumulate while you are iterating:
score[i] += (mx - mn)^2
Here's a function that is >5x as fast on my laptop for the input data you specified:
function h_score_1(ar, SV)
score = zeros(eltype(SV), length(ar))
#inbounds for i in eachindex(ar)
nodes = ar[i]
for j in axes(SV, 2)
SVview = view(SV, nodes, j)
mx = maximum(SVview)
mn = minimum(SVview)
score[i] += (mx - mn)^2
end
end
return score
end
This function outputs a one-dimensional vector instead of a 1xN matrix in your original function.
In principle, this could be even faster if we replace
mx = maximum(SVview)
mn = minimum(SVview)
with
(mn, mx) = extrema(SVview)
which only traverses the vector once, instead of twice. Unfortunately, there is a performance issue with extrema, so it is currently not as fast as separate maximum/minimum calls: https://github.com/JuliaLang/julia/issues/31442
Finally, for absolutely getting the best performance at the cost of brevity, we can avoid creating a view at all and turn the calls to maximum and minimum into a single explicit loop traversal:
function h_score_2(ar, SV)
score = zeros(eltype(SV), length(ar))
#inbounds for i in eachindex(ar)
nodes = ar[i]
for j in axes(SV, 2)
mx, mn = -Inf, +Inf
for node in nodes
x = SV[node, j]
mx = ifelse(x > mx, x, mx)
mn = ifelse(x < mn, x, mn)
end
score[i] += (mx - mn)^2
end
end
return score
end
This also avoids the performance issue that extrema suffers, and looks up the SV element once per node. Although this version is annoying to write, it's substantially faster, even on Julia 1.5 where views are free. Here are some benchmark timings with your test data:
julia> using BenchmarkTools
julia> #btime h_score_0($ar, $SV)
2.344 μs (52 allocations: 6.19 KiB)
1×5 Matrix{Float64}:
1.95458 2.94592 2.79438 0.709745 1.85877
julia> #btime h_score_1($ar, $SV)
392.035 ns (1 allocation: 128 bytes)
5-element Vector{Float64}:
1.9545848011260765
2.9459235098820167
2.794383144368953
0.7097448590904598
1.8587691646610984
julia> #btime h_score_2($ar, $SV)
118.243 ns (1 allocation: 128 bytes)
5-element Vector{Float64}:
1.9545848011260765
2.9459235098820167
2.794383144368953
0.7097448590904598
1.8587691646610984
So explicitly writing out the innermost loop is worth it here, reducing time by another 3x or so. It's annoying that the Julia compiler isn't yet able to generate code this efficient, but it does get smarter with every version. On the other hand, the explicit loop version will be fast forever, so if this code is really performance critical, it's probably worth writing it out like this.

Related

Fast summation of symmetric matrix in Julia

I want to sum all elements in a matrix A with dimension n times n. The matrix is symmetric and has 0s on the diagonal. The fastest way to do so that I have found is simply
sum(A). However this seems wasteful since it doesn't use the fact that I only need to calculate the lower triangle of the matrix. However, sum(tril(A, -1)) is significantly slower, and sum(A[i, j] for i = 1:n-1 for j = i+1:n) even more so. Is there a more efficient way to sum the matrix?
Edit: The solution by #AboAmmar performs well. Here is code (with summing the diagonal separately, something that can be removed if there is only zeros on the diagonal) to compare:
using BenchmarkTools
using LinearAlgebra
function sum_triu(A)
m, n = size(A)
#assert m == n
s = zero(eltype(A))
for j = 2:n
#simd for i = 1:j-1
s += #inbounds A[i,j]
end
end
s *= 2
for i = 1:n
s += A[i, i]
end
return s
end
N = 1000
A = Symmetric(rand(0:9,N,N))
A -= diagm(diag(A))
#btime sum(A)
#btime 2 * sum(tril(A))
#btime sum_triu(A)
This is 2.7X faster than sum for n = 1000 matrix. Make sure to add a #simd before the loop and use #inbounds. Also, use the correct loop order for fast memory access.
function sum_triu(A)
m, n = size(A)
#assert m == n
s = zero(eltype(A))
for j = 1:n
#simd for i = 1:j
s += #inbounds A[i,j]
end
end
return 2 * s
end
Example run on my PC:
sum_triu(A) = 499268.7328022966
sum(A) = 499268.73280229873
93.000 μs (0 allocations: 0 bytes)
249.900 μs (0 allocations: 0 bytes)
How about
2 * sum(LowerTriangular(A))
help?> LA.LowerTriangular
LowerTriangular(A::AbstractMatrix)
Construct a LowerTriangular view of the matrix A.
tril creates a new matrix, which allocates memory. Since a LowerTriangular is a view into the existing matrix, there's no memory allocation.

Parallelizing code for solving simultaneous ODEs (DifferentialEquations.jl) - Julia

I have the following coupled system of ODEs (that come from discretizing an integrodifferential PDE):
The xi's are points on an x-grid that I control. I can solve this with the following simple piece of code:
using DifferentialEquations
function ode_syst(du,u,p, t)
N = Int64(p[1])
beta= p[2]
deltax = 1/(N+1)
xs = [deltax*i for i in 1:N]
for j in 1:N
du[j] = -xs[j]^(beta)*u[j]+deltax*sum([u[i]*xs[i]^(beta) for i in 1:N])
end
end
N = 1000
u0 = ones(N)
beta = 2.0
p = [N, beta]
tspan = (0.0, 10^3);
prob = ODEProblem(ode_syst,u0,tspan,p);
sol = solve(prob);
However, as I make my grid finer, i.e. increase N, the computation time grows rapidly (I guess the scaling is quadratic in N). Is there any suggestion on how to implement this using either distributed parallelism or multithreading?
Additional information: I attach the profiling diagram that might be useful to understand where the program spends most of the time
I've looked into your code and found a few problems such as an accidentally introduced O(N^2) behavior due to recalculating the sum term.
My improved version uses the Tullio package to get further speed up from vectorization. Tullio also has tuneable parameters which would allow for multi threading if your system becomes big enough. See here what parameters you can tune in the options section. You might also see GPU support there, i didn't test that but it might yield further speed up or break hooribly. I also choose get the length from the acutal array which should make the use more economical and less error prone.
using Tullio
function ode_syst_t(du,u,p, t)
N = length(du)
beta = p[1]
deltax = 1/(N+1)
#tullio s := deltax*(u[i]*(i*deltax)^(beta))
#tullio du[j] = -(j*deltax)^(beta)*u[j] + s
return nothing
end
Your code:
#btime sol = solve(prob);
80.592 s (1349001 allocations: 10.22 GiB)
My code:
prob2 = ODEProblem(ode_syst_t,u0,tspan,[2.0]);
#btime sol2 = solve(prob2);
1.171 s (696 allocations: 18.50 MiB)
and the result more or less agree:
julia> sum(abs2, sol2(1000.0) .- sol(1000.0))
1.079046922815598e-14
Lutz Lehmanns solution i also benchmarked:
prob3 = ODEProblem(ode_syst_lehm,u0,tspan,p);
#btime sol3 = solve(prob3);
1.338 s (3348 allocations: 39.38 MiB)
However as we scale N to 1000000 with a tspan of (0.0, 10.0)
prob2 = ODEProblem(ode_syst_t,u0,tspan,[2.0]);
#time solve(prob2);
2.429239 seconds (280 allocations: 869.768 MiB, 13.60% gc time)
prob3 = ODEProblem(ode_syst_lehm,u0,tspan,p);
#time solve(prob3);
5.961889 seconds (580 allocations: 1.967 GiB, 11.08% gc time)
My code becomes more than twice as fast due to using the 2 cores in my old and rusty machine.
Analyze the formula. Obviously, the atomic terms repeat. So they should only be computed once.
function ode_syst(du,u,p, t)
N = Int64(p[1])
beta= p[2]
deltax = 1/(N+1)
xs = [deltax*i for i in 1:N]
term = [ xs[i]^(beta)*u[i] for i in 1:N]
term_sum = deltax*sum(term)
for j in 1:N
du[j] = -term[j]+term_sum
end
end
This should only linearly increase in N.

Julia: Searching for a column in a sorted matrix

I have a matrix that is sorted like the one shown below
1 1 2 2 3
1 2 3 4 1
2 1 2 1 1
It's a bit hard for me to describe the ordering, but hopefully it's clear from the example. The rough idea is that we first sort on the first row, then the second, etc.
I would like to find a specific column in the matrix, and that column may or may not exist in it.
I tried the following code:
index = searchsortedfirst(1:total_cols, col, lt=(index,x) -> (matrix[: index] < x))
The above code works, but it is slow. I profiled the code, and it spends a lot of time in "_get_index". I then tried the following
#views index = searchsortedfirst(1:total_cols, col, lt=(index,x) -> (matrix[: index] < x))
As expected this helped a lot, likely due to the slices I'm taking. However, is there a better way to go about this? There still seems to be a lot of overhead, and I feel like there might be a cleaner way to write this, which would be easier to optimize.
However, I absolutely value speed over clarity.
Here is some code I wrote to compare binary vs. linear search.
using Profile
function test_search()
max_val = 20
rows = 4
matrix = rand(1:max_val, rows, 10^5)
matrix = Array{Int64,2}(sortslices(matrix, dims=2))
indices = #time #profile lin_search(matrix, rows, max_val, 10^3)
indices = #time #profile bin_search(matrix, rows, max_val, 10^3)
end
function bin_search(matrix, rows, max_val, repeats)
indices = zeros(repeats)
x = zeros(Int64, rows)
cols = size(matrix)[2]
for i = 1:repeats
x = rand(1:max_val, rows)
#inbounds #views index = searchsortedfirst(1:cols, x, lt=(index,x)->(matrix[:,index] < x))
indices[i] = index
end
return indices
end
function array_eq(matrix, index, y, rows)
for i=1:rows
#inbounds if view(matrix, i, index) != y[i]
return false
end
end
return true
end
function lin_search(matrix, rows, max_val, repeats)
indices = zeros(repeats)
x = zeros(Int64, rows)
cols = size(matrix)[2]
for i = 1:repeats
index = cols + 1
x = rand(1:max_val, rows)
for j=1:cols
if array_eq(matrix, j, x, rows)
index = j;
break
end
end
indices[i] = index
end
return indices
end
Profile.clear()
test_search()
Here is some sample output
0.041356 seconds (68.90 k allocations: 3.431 MiB)
0.070224 seconds (110.45 k allocations: 5.418 MiB)
After adding some more #inbounds, it looks like a linear search is faster than binary. Seems strange when there are 10^5 columns.
If speed is most important, why not simply use the fact that Julia allows you to write fast loops?
julia> function findcol(M, col)
#inbounds #views for c in axes(M, 2)
M[:,c] == col && return c
end
return nothing
end
findcol (generic function with 1 method)
julia> col = [2,3,2];
julia> M = [1 1 2 2 3;
1 2 3 4 1;
2 1 2 1 1];
julia> #btime findcol($M, $col)
32.854 ns (3 allocations: 144 bytes)
3
This should probably be fast enough and does not even take into account any ordering.
I discovered two issues, that when fixed result in both linear and binary searches being much faster. And the binary search becomes faster than linear.
First, there was some type instability. I changed on one of the lines to
matrix::Array{Int64,2} = Array{Int64,2}(sortslices(matrix, dims=2))
This resulted in an order of magnitude speedup. Also it turns out that using #views does not do anything in the following code
#inbounds #views index = searchsortedfirst(1:cols, x, lt=(index,x)->(matrix[:,index] < x))
I am new to Julia, but my hunch is that since matrix[:,index] is copied no matter what in the anonymous function. This would make sense, since it allows for closures.
If I write a separate non-anonymous function, then that copy goes away. Linear search didn't copy the slices, so this also really sped up the binary search.

Julia significantly slower with #parallel

I have this code(primitive heat transfer):
function heat(first, second, m)
#sync #parallel for d = 2:m - 1
for c = 2:m - 1
#inbounds second[c,d] = (first[c,d] + first[c+1, d] + first[c-1, d] + first[c, d+1] + first[c, d-1]) / 5.0;
end
end
end
m = parse(Int,ARGS[1]) #size of matrix
firstm = SharedArray(Float64, (m,m))
secondm = SharedArray(Float64, (m,m))
for c = 1:m
for d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
end
#time for i = 0:opak
heat(firstm, secondm, m)
firstm, secondm = secondm, firstm
end
This code give good times when run sequentially, but when I add #parallel it slow down even if I run on one thread. I just need explanation why this is happening? Code only if it doesn't change algorithm of heat function.
Have a look at http://docs.julialang.org/en/release-0.4/manual/performance-tips/ . Contrary to advised, you make use of global variables a lot. They are considered to change types anytime so they have to be boxed and unboxed everytime they are referenced. This question also Julia pi aproximation slow suffers from the same. In order to make your function faster, have global variables as input arguments to the function.
There are some points to consider. One of them is the size of m. If it is small, parallelism would give much overhead for not a big gain:
julia 36967257.jl 4
# Parallel:
0.040434 seconds (4.44 k allocations: 241.606 KB)
# Normal:
0.042141 seconds (29.13 k allocations: 1.308 MB)
For bigger m you could have better results:
julia 36967257.jl 4000
# Parallel:
0.054848 seconds (4.46 k allocations: 241.935 KB)
# Normal:
3.779843 seconds (29.13 k allocations: 1.308 MB)
Plus two remarks:
1/ initialisation could be simplified to:
for c = 1:m, d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
2/ your finite difference schema does not look stable. Please take a look at Linear multistep method or ADI/Crank Nicolson.

Updating a dense vector by a sparse vector in Julia is slow

I am using Julia version 0.4.5 and I am experiencing the following issue:
As far as I know, taking inner product between a sparse vector and a dense vector should be as fast as updating the dense vector by a sparse vector. The latter one is much slower.
A = sprand(100000,100000,0.01)
w = rand(100000)
#time for i=1:100000
w += A[:,i]
end
26.304380 seconds (1.30 M allocations: 150.556 GB, 8.16% gc time)
#time for i=1:100000
A[:,i]'*w
end
0.815443 seconds (921.91 k allocations: 1.540 GB, 5.58% gc time)
I created a simple sparse matrix type of my own, and the addition code was ~ the same as the inner product.
Am I doing something wrong? I feel like there should be a special function doing the operation w += A[:,i], but I couldn't find it.
Any help is appreciated.
I asked the same question on GitHub and we came to the following conclusion. The type SparseVector was added as of Julia 0.4 and with it the BLAS function LinAlg.axpy!, which updates in-place a (possibly dense) vector x by a sparse vector y multiplied by a scalar a, i.e. performs x += a*y efficiently. However, in Julia 0.4 it is not implemented properly. It works only in Julia 0.5
#time for i=1:100000
LinAlg.axpy!(1,A[:,i],w)
end
1.041587 seconds (799.49 k allocations: 1.530 GB, 8.01% gc time)
However, this code is still sub-optimal, as it creates the SparseVector A[:,i]. One can get an even faster version with the following function:
function upd!(w,A,i,c)
rowval = A.rowval
nzval = A.nzval
#inbounds for j = nzrange(A,i)
w[rowval[j]] += c* nzval[j]
end
return w
end
#time for i=1:100000
upd!(w,A,i,1)
end
0.500323 seconds (99.49 k allocations: 1.518 MB)
This is exactly what I needed to achieve, after some research we managed to get there, thanks everyone!
Assuming you want to compute w += c * A[:, i], there is an easy way to vectorize it:
>>> A = sprand(100000, 100000, 0.01)
>>> c = rand(100000)
>>> r1 = zeros(100000)
>>> #time for i = 1:100000
>>> r1 += A[:, i] * c[i]
>>> end
29.997412 seconds (1.90 M allocations: 152.077 GB, 12.73% gc time)
>>> #time r2 = sum(A .* c', 2);
1.191850 seconds (50 allocations: 1.493 GB, 0.14% gc time)
>>> all(r1 == r2)
true
First, create a vector c of the constants to multiply with. Then multiplay de columns of A element-wise by the values of c (A .* c', it does broadcasting inside). Last, reduce over the columns of A (the part sum(.., 2)).

Resources