Parallelizing code for solving simultaneous ODEs (DifferentialEquations.jl) - Julia - julia

I have the following coupled system of ODEs (that come from discretizing an integrodifferential PDE):
The xi's are points on an x-grid that I control. I can solve this with the following simple piece of code:
using DifferentialEquations
function ode_syst(du,u,p, t)
N = Int64(p[1])
beta= p[2]
deltax = 1/(N+1)
xs = [deltax*i for i in 1:N]
for j in 1:N
du[j] = -xs[j]^(beta)*u[j]+deltax*sum([u[i]*xs[i]^(beta) for i in 1:N])
end
end
N = 1000
u0 = ones(N)
beta = 2.0
p = [N, beta]
tspan = (0.0, 10^3);
prob = ODEProblem(ode_syst,u0,tspan,p);
sol = solve(prob);
However, as I make my grid finer, i.e. increase N, the computation time grows rapidly (I guess the scaling is quadratic in N). Is there any suggestion on how to implement this using either distributed parallelism or multithreading?
Additional information: I attach the profiling diagram that might be useful to understand where the program spends most of the time

I've looked into your code and found a few problems such as an accidentally introduced O(N^2) behavior due to recalculating the sum term.
My improved version uses the Tullio package to get further speed up from vectorization. Tullio also has tuneable parameters which would allow for multi threading if your system becomes big enough. See here what parameters you can tune in the options section. You might also see GPU support there, i didn't test that but it might yield further speed up or break hooribly. I also choose get the length from the acutal array which should make the use more economical and less error prone.
using Tullio
function ode_syst_t(du,u,p, t)
N = length(du)
beta = p[1]
deltax = 1/(N+1)
#tullio s := deltax*(u[i]*(i*deltax)^(beta))
#tullio du[j] = -(j*deltax)^(beta)*u[j] + s
return nothing
end
Your code:
#btime sol = solve(prob);
80.592 s (1349001 allocations: 10.22 GiB)
My code:
prob2 = ODEProblem(ode_syst_t,u0,tspan,[2.0]);
#btime sol2 = solve(prob2);
1.171 s (696 allocations: 18.50 MiB)
and the result more or less agree:
julia> sum(abs2, sol2(1000.0) .- sol(1000.0))
1.079046922815598e-14
Lutz Lehmanns solution i also benchmarked:
prob3 = ODEProblem(ode_syst_lehm,u0,tspan,p);
#btime sol3 = solve(prob3);
1.338 s (3348 allocations: 39.38 MiB)
However as we scale N to 1000000 with a tspan of (0.0, 10.0)
prob2 = ODEProblem(ode_syst_t,u0,tspan,[2.0]);
#time solve(prob2);
2.429239 seconds (280 allocations: 869.768 MiB, 13.60% gc time)
prob3 = ODEProblem(ode_syst_lehm,u0,tspan,p);
#time solve(prob3);
5.961889 seconds (580 allocations: 1.967 GiB, 11.08% gc time)
My code becomes more than twice as fast due to using the 2 cores in my old and rusty machine.

Analyze the formula. Obviously, the atomic terms repeat. So they should only be computed once.
function ode_syst(du,u,p, t)
N = Int64(p[1])
beta= p[2]
deltax = 1/(N+1)
xs = [deltax*i for i in 1:N]
term = [ xs[i]^(beta)*u[i] for i in 1:N]
term_sum = deltax*sum(term)
for j in 1:N
du[j] = -term[j]+term_sum
end
end
This should only linearly increase in N.

Related

Compute eigenvalues of complex-hermitian sparsematrix in Julia

Im working with some roughly 100000x100000 hermitian complex sparse-matrices, with roughly 5% of entries populated, and want to calculate the eigenvalues/eigenvectors.
Sofar ive been using Arpack.jl eigs(A).
But this is not working well as soon as i crank the size to higher then 5000.
For the benchmarks ive been using the following code to generate some TestMatrices:
using Arpack
using SparseArrays
using ProgressMeter
pop = 0.05
n = 3000 # for example
A = spzeros(Complex{Float64}, n, n)
#showprogress for _ in 1:round(Int,pop * (n^2))
A[rand(1:n), rand(1:n)] = rand(Complex{Float64})
end
# make A hermite
A = A + conj(A)
t = #elapsed eigs(A,maxiter=1500) # ends up being ~ 13 seconds
For n ~ 3000 the eigs() call already takes 13 seconds on my machine, and for bigger n it doesn't finish in any 'reasonable' time or outright quits.
Is there a specialized package/method for this ?
Any help is appreciated
https://github.com/JuliaLinearAlgebra/ArnoldiMethod.jl seems to be what you want:
julia> let pop=0.05, n=3000
A = sprand(Complex{Float64},n,n, 0.05)
A = A + conj(A)
#time eigs(A; maxiter=1500)
#time decomp, history = partialschur(A, nev=10, tol=1e-6, which=LM());
end;
10.521786 seconds (50.73 k allocations: 3.458 MiB, 0.04% gc time)
2.244129 seconds (19 allocations: 1.892 MiB)
sanity check:
julia> a,(b,c) = let pop=0.05, n=300
A = sprand(Complex{Float64},n,n, 0.05)
A = A + conj(A)
eigs(A; maxiter=2500), partialschur(A, nev=6, tol=1e-6, which=LM());
end;
julia> a[1]
6-element Vector{ComplexF64}:
14.5707071003175 + 8.218901803015509e-16im
4.493079744504954 - 0.8390429567118733im
4.493079744504933 + 0.8390429567118641im
-0.3415176925293196 + 4.254184281244591im
-0.3415176925293088 - 4.25418428124452im
0.49406553681541177 - 4.229680489599233im
julia> b
PartialSchur decomposition (ComplexF64) of dimension 6
eigenvalues:
6-element Vector{ComplexF64}:
14.570707100307926 + 7.10698633463049e-12im
4.493079906269516 + 0.8390429076809746im
4.493079701528448 - 0.8390430155670777im
-0.3415174262177961 + 4.254183175902487im
-0.34151626930774975 - 4.25418321627979im
0.49406543866702 + 4.229680079205066im

Improving the speed of a for loop in Julia

Here is my code in Julia platform and I like to speed it up. Is there anyway that I can make this faster? It takes 0.5 seconds for a dataset of 50k*50k. I was expecting Julia to be a lot faster than this or I am not sure if I am doing a silly implementation.
ar = [[1,2,3,4,5], [2,3,4,5,6,7,8], [4,7,8,9], [9,10], [2,3,4,5]]
SV = rand(10,5)
function h_score_0(ar ,SV)
m = length(ar)
SC = Array{Float64,2}(undef, size(SV, 2), m)
for iter = 1:m
nodes = ar[iter]
for jj = 1:size(SV, 2)
mx = maximum(SV[nodes, jj])
mn = minimum(SV[nodes, jj])
term1 = (mx - mn)^2;
SC[jj, iter] = (term1);
end
end
return score = sum(SC, dims = 1)
end
You have some unnecessary allocations in your code:
mx = maximum(SV[nodes, jj])
mn = minimum(SV[nodes, jj])
Slices allocate, so each line makes a copy of the data here, you're actually copying the data twice, once on each line. You can either make sure to copy only once, or even better: use view, so there is no copy at all (note that view is much faster on Julia v1.5, in case you are using an older version).
SC = Array{Float64,2}(undef, size(SV, 2), m)
And no reason to create a matrix here, and sum over it afterwards, just accumulate while you are iterating:
score[i] += (mx - mn)^2
Here's a function that is >5x as fast on my laptop for the input data you specified:
function h_score_1(ar, SV)
score = zeros(eltype(SV), length(ar))
#inbounds for i in eachindex(ar)
nodes = ar[i]
for j in axes(SV, 2)
SVview = view(SV, nodes, j)
mx = maximum(SVview)
mn = minimum(SVview)
score[i] += (mx - mn)^2
end
end
return score
end
This function outputs a one-dimensional vector instead of a 1xN matrix in your original function.
In principle, this could be even faster if we replace
mx = maximum(SVview)
mn = minimum(SVview)
with
(mn, mx) = extrema(SVview)
which only traverses the vector once, instead of twice. Unfortunately, there is a performance issue with extrema, so it is currently not as fast as separate maximum/minimum calls: https://github.com/JuliaLang/julia/issues/31442
Finally, for absolutely getting the best performance at the cost of brevity, we can avoid creating a view at all and turn the calls to maximum and minimum into a single explicit loop traversal:
function h_score_2(ar, SV)
score = zeros(eltype(SV), length(ar))
#inbounds for i in eachindex(ar)
nodes = ar[i]
for j in axes(SV, 2)
mx, mn = -Inf, +Inf
for node in nodes
x = SV[node, j]
mx = ifelse(x > mx, x, mx)
mn = ifelse(x < mn, x, mn)
end
score[i] += (mx - mn)^2
end
end
return score
end
This also avoids the performance issue that extrema suffers, and looks up the SV element once per node. Although this version is annoying to write, it's substantially faster, even on Julia 1.5 where views are free. Here are some benchmark timings with your test data:
julia> using BenchmarkTools
julia> #btime h_score_0($ar, $SV)
2.344 μs (52 allocations: 6.19 KiB)
1×5 Matrix{Float64}:
1.95458 2.94592 2.79438 0.709745 1.85877
julia> #btime h_score_1($ar, $SV)
392.035 ns (1 allocation: 128 bytes)
5-element Vector{Float64}:
1.9545848011260765
2.9459235098820167
2.794383144368953
0.7097448590904598
1.8587691646610984
julia> #btime h_score_2($ar, $SV)
118.243 ns (1 allocation: 128 bytes)
5-element Vector{Float64}:
1.9545848011260765
2.9459235098820167
2.794383144368953
0.7097448590904598
1.8587691646610984
So explicitly writing out the innermost loop is worth it here, reducing time by another 3x or so. It's annoying that the Julia compiler isn't yet able to generate code this efficient, but it does get smarter with every version. On the other hand, the explicit loop version will be fast forever, so if this code is really performance critical, it's probably worth writing it out like this.

Julia vs Mathematica: Numerical integration performance

Julia absolute newcomer here with a question for you.
I'm teaching myself some Julia by porting some of my Mathematica and Python code (mostly scientific computations in physics etc.), and seeing what's what.
Until now things have been pretty smooth. And fast. Until now.
Now, I'm simulating an elementary lock-in amplifier, which, in essence, takes a - possibly very complicated - time-dependent signal, Uin(t), and produces an output, Uout(t), phase-locked at some reference frequency fref (that is, it highlights the component of Uin(t), which has a certain phase relationship with a reference sine wave). Little does the description matter, what matters is that it basically does that by calculating the integral (I'm actually omitting the phase here for clarity):
So, I set out and tested this in Mathematica and Julia:
I define a mockup Uin(t), pass some parameter values, and then build an array of Uout(t), at time t = 0, for a range of fref.
Julia: I used the QuadGK package for numerical integration.
T = 0.1
f = 100.
Uin(t) = sin(2pi * f * t) + sin(2pi * 2f *t)
Uout(t, fref)::Float64 = quadgk(s -> Uin(s) * sin(2pi * fref * s), t-T, t, rtol=1E-3)[1]/T
frng = 80.:1.:120.
print(#time broadcast(Uout, 0., frng))
Mathematica
T = 0.1;
f = 100.;
Uin[t_] := Sin[2 π f t] + Sin[2 π 2 f t]
Uout[t_, fref_] := NIntegrate[Sin[2 π fref s] Uin[s], {s, t - T, t}]/T
frng = Table[i, {i, 80, 120, 1}];
Timing[Table[Uout[0, fr], {fr, frng}]]
Results:
Julia timed the operation anywhere between 45 and 55 seconds, on an i7-5xxx laptop on battery power, which is a lot, while Mathematica did it in ~2 seconds. The difference is abysmal and, honestly, hard to believe. I know Mathematica has some pretty sweet and refined algorithms in its kernel, but Julia is Julia. So, question is: what gives?
P.S.: setting f and T as const does reduce Julia's time to ~8-10 seconds, but f and T cannot be consts in the actual program. Other than that, is there something obvious I'm missing?
EDIT Feb 2, 2020:
The slowing down seem to be due to the algorithm trying to hunt down precision when the value is near-zero, e.g. see below: for fref = 95 the calculation takes 1 whole second(!), while for adjacent frequency values it computes instantly (returned result is a tuple of (res, error)). Seems the quadgk function stalls at very small values):
0.000124 seconds (320 allocations: 7.047 KiB)
fref = 94.0 (-0.08637214864144352, 9.21712218998258e-6)
1.016830 seconds (6.67 M allocations: 139.071 MiB, 14.35% gc time)
fref = 95.0 (-6.088184966010742e-16, 1.046186419361636e-16)
0.000124 seconds (280 allocations: 6.297 KiB)
fref = 96.0 (0.1254003757465191, 0.00010132083518769636)
Notes: this is regardless of what tolerance I ask to be produced. Also, Mathematica generally hits machine precision tolerances by default, while somewhat slowing down at near-zeros, and numpy/scipy just fly through the whole thing, but produce less precise results than Mathematica (at default settings; didn't look much into this).
Your problem is related to the choice of error tolerance. Relative error of 1e-3 doesn't sound so bad, but it actually is when the integral is close to zero. In particular, this happens when fref = 80.0 (and 85, 90, 95, not 100, 105, etc.):
julia> Uout(0.0, 80.0, f, T)
1.2104987553880609e-16
To quote from the docstring of quadgk:
(Note that it is useful to specify a positive atol in cases where
norm(I) may be zero.)
Let's try to set an absolute tolerance, for example 1e-6, and compare. First the code (using the code from #ARamirez):
Uin(t, f) = sin(2π * f * t) + sin(4π * f * t)
function Uout(t, fref, f , T)
quadgk(s -> Uin(s, f) * sin(2π * fref * s), t-T, t, rtol=1e-3)[1]/T
end
function Uout_new(t, fref, f , T) # with atol
quadgk(s -> Uin(s, f) * sin(2π * fref * s), t-T, t, rtol=1e-3, atol=1e-6)[1]/T
end
Then the benchmarking (use BenchmarkTools for this)
using BenchmarkTools
T = 0.1
f = 100.0
freqs = 80.0:1.0:120.0
#btime Uout.(0.0, $freqs, $f, $T);
6.302 s (53344283 allocations: 1.09 GiB)
#btime Uout_new.(0.0, $freqs, $f, $T);
1.270 ms (11725 allocations: 262.08 KiB)
OK, that's 5000 times faster. Is that ok?
The first problem I see with your code is that it is type unstable. This is caused because you are using global variables (see Performance Tip number one at Julia Performance Tips) :
The compiler cannot know the types of f and T, which you are using inside your functions, therefore it cannot do an efficient compilation. This is also why when you mark them as const, the performance improves: now the compiler has the guarantee that they will not change their type, so it can efficiently compile your two functions.
How to see that your code is unstable
If you run your first function with the macro #code_warntype like this:
#code_warntype Uin(0.1,f)
You will see an output like this:
julia> #code_warntype Uin(0.1)
Variables
#self#::Core.Compiler.Const(Uin, false)
t::Float64
Body::Any
1 ─ %1 = (2.0 * Main.pi)::Core.Compiler.Const(6.283185307179586, false)
│ %2 = (%1 * Main.f * t)::Any
│ %3 = Main.sin(%2)::Any
│ %4 = (2.0 * Main.pi)::Core.Compiler.Const(6.283185307179586, false)
│ %5 = (2.0 * Main.f)::Any
│ %6 = (%4 * %5 * t)::Any
│ %7 = Main.sin(%6)::Any
│ %8 = (%3 + %7)::Any
└── return %8
All those Anys tell you that the compile doesn't know the type of the output at any step.
How to fix
You can redefine your functions to take in f and T as variables:
Uin(t,f) = sin(2.0pi * f * t) + sin(2.0pi * 2.0f *t)
Uout(t, fref,f,T)::Float64 = quadgk(s -> Uin(s,f) * sin(2pi * fref * s), t-T, t, rtol=1E-3)[1]/T
With these redefinitions, your code runs much faster. If you try to check them with #code_warntype you will see that now the compiler correctly infers the type of everything.
For further performance improvements, you can check out the Julia Performance Tips
In particular, the generally adviced method to measure performance instead of using #time is #btime from the package BenchmarkTools. It is so because when running #time you are measuring also compilation times (another option is to run #time two times - the second measure will be correct since all functions had a chance to compile).
There are various things you can do to speed it up further. Chaning the order of the integration did help a bit, using Float32 instead of Float64 made a small improvement and using #fastmath made a further small improvement. One can also make use of SLEEFPirates.sin_fast
using QuadGK, ChangePrecision
#changeprecision Float32 begin
T = 0.1
f = 100.
#inline #fastmath Uin(t,f) = sin(2pi * f * t) + sin(2pi * 2f *t)
#fastmath Uout(t, fref,f,T) = first(quadgk(s -> Uin(s,f) * sin(2pi * fref * s), t-T, t, rtol=1e-2, order=10))/T
frng = 80.:1.:120.
#time Uout.(0., frng, f, T)
end

Julia significantly slower with #parallel

I have this code(primitive heat transfer):
function heat(first, second, m)
#sync #parallel for d = 2:m - 1
for c = 2:m - 1
#inbounds second[c,d] = (first[c,d] + first[c+1, d] + first[c-1, d] + first[c, d+1] + first[c, d-1]) / 5.0;
end
end
end
m = parse(Int,ARGS[1]) #size of matrix
firstm = SharedArray(Float64, (m,m))
secondm = SharedArray(Float64, (m,m))
for c = 1:m
for d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
end
#time for i = 0:opak
heat(firstm, secondm, m)
firstm, secondm = secondm, firstm
end
This code give good times when run sequentially, but when I add #parallel it slow down even if I run on one thread. I just need explanation why this is happening? Code only if it doesn't change algorithm of heat function.
Have a look at http://docs.julialang.org/en/release-0.4/manual/performance-tips/ . Contrary to advised, you make use of global variables a lot. They are considered to change types anytime so they have to be boxed and unboxed everytime they are referenced. This question also Julia pi aproximation slow suffers from the same. In order to make your function faster, have global variables as input arguments to the function.
There are some points to consider. One of them is the size of m. If it is small, parallelism would give much overhead for not a big gain:
julia 36967257.jl 4
# Parallel:
0.040434 seconds (4.44 k allocations: 241.606 KB)
# Normal:
0.042141 seconds (29.13 k allocations: 1.308 MB)
For bigger m you could have better results:
julia 36967257.jl 4000
# Parallel:
0.054848 seconds (4.46 k allocations: 241.935 KB)
# Normal:
3.779843 seconds (29.13 k allocations: 1.308 MB)
Plus two remarks:
1/ initialisation could be simplified to:
for c = 1:m, d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
2/ your finite difference schema does not look stable. Please take a look at Linear multistep method or ADI/Crank Nicolson.

Updating a dense vector by a sparse vector in Julia is slow

I am using Julia version 0.4.5 and I am experiencing the following issue:
As far as I know, taking inner product between a sparse vector and a dense vector should be as fast as updating the dense vector by a sparse vector. The latter one is much slower.
A = sprand(100000,100000,0.01)
w = rand(100000)
#time for i=1:100000
w += A[:,i]
end
26.304380 seconds (1.30 M allocations: 150.556 GB, 8.16% gc time)
#time for i=1:100000
A[:,i]'*w
end
0.815443 seconds (921.91 k allocations: 1.540 GB, 5.58% gc time)
I created a simple sparse matrix type of my own, and the addition code was ~ the same as the inner product.
Am I doing something wrong? I feel like there should be a special function doing the operation w += A[:,i], but I couldn't find it.
Any help is appreciated.
I asked the same question on GitHub and we came to the following conclusion. The type SparseVector was added as of Julia 0.4 and with it the BLAS function LinAlg.axpy!, which updates in-place a (possibly dense) vector x by a sparse vector y multiplied by a scalar a, i.e. performs x += a*y efficiently. However, in Julia 0.4 it is not implemented properly. It works only in Julia 0.5
#time for i=1:100000
LinAlg.axpy!(1,A[:,i],w)
end
1.041587 seconds (799.49 k allocations: 1.530 GB, 8.01% gc time)
However, this code is still sub-optimal, as it creates the SparseVector A[:,i]. One can get an even faster version with the following function:
function upd!(w,A,i,c)
rowval = A.rowval
nzval = A.nzval
#inbounds for j = nzrange(A,i)
w[rowval[j]] += c* nzval[j]
end
return w
end
#time for i=1:100000
upd!(w,A,i,1)
end
0.500323 seconds (99.49 k allocations: 1.518 MB)
This is exactly what I needed to achieve, after some research we managed to get there, thanks everyone!
Assuming you want to compute w += c * A[:, i], there is an easy way to vectorize it:
>>> A = sprand(100000, 100000, 0.01)
>>> c = rand(100000)
>>> r1 = zeros(100000)
>>> #time for i = 1:100000
>>> r1 += A[:, i] * c[i]
>>> end
29.997412 seconds (1.90 M allocations: 152.077 GB, 12.73% gc time)
>>> #time r2 = sum(A .* c', 2);
1.191850 seconds (50 allocations: 1.493 GB, 0.14% gc time)
>>> all(r1 == r2)
true
First, create a vector c of the constants to multiply with. Then multiplay de columns of A element-wise by the values of c (A .* c', it does broadcasting inside). Last, reduce over the columns of A (the part sum(.., 2)).

Resources