Converting SuiteSparse.SPQR.QRSparseQ to SparseMatrixCSC? - julia

I have this problem that converting the native sparse format for the QR decomposition of a sparse Matrix takes forever. However, I need it in the CSC format to use it for further computations.
using LinearAlgebra, SparseArrays
N = 1000
A = sprand(N,N,1e-4)
#time F = qr(A)
#time F.Q
#time Q_sparse = sparse(F.Q)
0.000420 seconds (1.15 k allocations: 241.017 KiB)
0.000008 seconds (6 allocations: 208 bytes)
6.067351 seconds (2.00 M allocations: 15.140 GiB, 36.25% gc time)
Any suggestions?

Okay, I found the problem. For other people trying to do it:
factors = F.Q.factors
τ = F.Q.τ
Nτ = size(factors)[2]
Isp = sparse(I(N));
#time Q_constr = prod(Isp - factors[:,i]*τ[i]*factors[:,i]' for i in 1:Nτ)
Q_constr ≈ Q_sparse
0.084461 seconds (62.64 k allocations: 3.321 MiB, 18.28% gc time)
true
You see that the method sparse(F.Q) is somehow using the wrong representation. If you construct Q as I did above, it will be considerably faster.

Related

Optimize looping over a large string to reduce allocations

I am trying to loop over a string in Julia to parse it. I have a DefaultDict inside a struct, containing the number of times I have seen a particular character.
#with_kw mutable struct Metrics
...
nucleotides = DefaultDict{Char, Int64}(0)
...
end
I have written a function to loop over a string and increment the value of each character in the DefaultDict.
function compute_base_composition(sequence::String, metrics::Metrics)
for i in 1:sizeof(sequence)
metrics.nucleotides[sequence[i]] += 1
end
end
This function is called in a for loop because I need to do this for multiple strings (which can be up to 2 billions characters long). When I run the #time macro, I get this result:
#time compute_base_composition(sequence, metrics)
0.167172 seconds (606.20 k allocations: 15.559 MiB, 78.00% compilation time)
0.099403 seconds (1.63 M allocations: 24.816 MiB)
0.032346 seconds (633.24 k allocations: 9.663 MiB)
0.171382 seconds (3.06 M allocations: 46.751 MiB, 4.64% gc time)
As you can see, there are a lot of memory allocations for such a simple function. I have tried to change the for loop to something like for c in sequence but that didn't change much. Would there be a way to reduce them and make the function faster?
Work on bytes no on unicode chars
Use Vectors not Dicts
Avoid untyped fields in containers
#with_kw struct MetricsB
nucleotides::Vector{Int}=zeros(Int, 256)
end
function compute_base_composition(sequence::String, metrics::MetricsB)
bs = Vector{UInt8}(sequence)
for i in 1:length(bs)
#inbounds metrics.nucleotides[bs[i]] += 1
end
end
And a benchmark with a nice speedup of 90x :
julia> st = randstring(10_000_000);
julia> #time compute_base_composition(st, Metrics())
1.793991 seconds (19.94 M allocations: 304.213 MiB, 3.33% gc time)
julia> #time compute_base_composition(st, MetricsB())
0.019398 seconds (3 allocations: 9.539 MiB)
Actually you can almost totally avoid allocations with the following code:
function compute_base_composition2(sequence::String, metrics::MetricsB)
pp = pointer(sequence)
for i in 1:length(sequence)
#inbounds metrics.nucleotides[Base.pointerref(pp, i, 1)] += 1
end
end
and now:
julia> #time compute_base_composition2(st, MetricsB())
0.021161 seconds (1 allocation: 2.125 KiB)

Dot syntax causes 200x slowdown with vector addition

I was experimenting with the speed of vector addition and component-wise exponentiation, when I came across a strange result with the dot vectorization syntax.
The non-vectorized version,
julia> #time exp(randn(1000) + randn(1000))
takes about 0.001 seconds after a few runs. It also gives a deprecation warning as of 0.6.
If I vectorize the exponential function,
julia> #time exp.(randn(1000) + randn(1000))
I get a 4x speedup, to around 0.00025 seconds.
However, if I vectorize both the exponential function and addition of the vectors,
julia> #time exp.(randn(1000) .+ randn(1000))
I get a large slowdown to around 0.05 seconds. Why does this occur? When should the dot syntax be avoided to maximize performance?
.+ creates an anonymous function. In the REPL, this function is created every time and will blow up your timing results. In addition, the use of global (dynamically typed, i.e. uninferrable) slow down all of your examples. In any real case your code will be in a function. When it's in a function, it's only compiled the first time the function is called. Example:
> x = randn(1000); y = randn(1000);
> #time exp(x + y);
WARNING: exp(x::AbstractArray{T}) where T <: Number is deprecated, use exp.(x) instead.
Stacktrace:
[1] depwarn(::String, ::Symbol) at .\deprecated.jl:70
[2] exp(::Array{Float64,1}) at .\deprecated.jl:57
[3] eval(::Module, ::Any) at .\boot.jl:235
[4] eval_user_input(::Any, ::Base.REPL.REPLBackend) at .\REPL.jl:66
[5] macro expansion at C:\Users\Chris\.julia\v0.6\Revise\src\Revise.jl:775 [inlined]
[6] (::Revise.##17#18{Base.REPL.REPLBackend})() at .\event.jl:73
while loading no file, in expression starting on line 237
0.620712 seconds (290.34 k allocations: 15.150 MiB)
> #time exp(x + y);
0.023072 seconds (27.09 k allocations: 1.417 MiB)
> #time exp(x + y);
0.000334 seconds (95 allocations: 27.938 KiB)
>
> #time exp.(x .+ y);
1.764459 seconds (735.52 k allocations: 39.169 MiB, 0.80% gc time)
> #time exp.(x .+ y);
0.017914 seconds (5.92 k allocations: 328.978 KiB)
> #time exp.(x .+ y);
0.017853 seconds (5.92 k allocations: 328.509 KiB)
>
> f(x,y) = exp.(x .+ y);
> #time f(x,y);
0.022357 seconds (21.59 k allocations: 959.157 KiB)
> #time f(x,y);
0.000020 seconds (5 allocations: 8.094 KiB)
> #time f(x,y);
0.000021 seconds (5 allocations: 8.094 KiB)
Notice that by putting it into a function it compiles and optimizes. This is one of the main things mentioned in the Julia Performance Tips.

How to find the index of the last maximum in julialang?

I have an array that contains repeated nonnegative integers, e.g., A=[5,5,5,0,1,1,0,0,0,3,3,0,0]. I would like to find the position of the last maximum in A. That is the largest index i such that A[i]>=A[j] for all j. In my example, i=3.
I tried to find the indices of all maximum of A then find the maximum of these indices:
A = [5,5,5,0,1,1,0,0,0,3,3,0,0];
Amax = maximum(A);
i = maximum(find(x -> x == Amax, A));
Is there any better way?
length(A) - indmax(#view A[end:-1:1]) + 1
should be pretty fast, but I didn't benchmark it.
EDIT: I should note that by definition #crstnbr 's solution (to write the algorithm from scratch) is faster (how much faster is shown in Xiaodai's response). This is an attempt to do it using julia's inbuilt array functions.
What about findlast(A.==maximum(A)) (which of course is conceptually similar to your approach)?
The fastest thing would probably be explicit loop implementation like this:
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
I tried #Michael's solution and #crstnbr's solution and I found the latter much faster
a = rand(Int8(1):Int8(5),1_000_000_000)
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 19 seconds
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 18 seconds
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
#time lastindmax(a) # 3 seconds
#time lastindmax(a) # 2.8 seconds
Michael's solution doesn't support Strings (ERROR: MethodError: no method matching view(::String, ::StepRange{Int64,Int64})) or sequences so I add another solution:
julia> lastimax(x) = maximum((j,i) for (i,j) in enumerate(x))[2]
julia> A="abžcdž"; lastimax(A) # unicode is OK
6
julia> lastimax(i^2 for i in -10:7)
1
If you more like don't catch exception for empty Sequence:
julia> lastimax(x) = !isempty(x) ? maximum((j,i) for (i,j) in enumerate(x))[2] : 0;
julia> lastimax(i for i in 1:3 if i>4)
0
Simple(!) benchmarks:
This is up to 10 times slower than Michael's solution for Float64:
julia> mlastimax(A) = length(A) - indmax(#view A[end:-1:1]) + 1;
julia> julia> A = rand(Float64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.166389 seconds (4.00 M allocations: 91.553 MiB, 4.63% gc time)
0.019560 seconds (6 allocations: 240 bytes)
80346
(I am surprised) it is 2 times faster for Int64!
julia> A = rand(Int64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.015453 seconds (10 allocations: 304 bytes)
0.031197 seconds (6 allocations: 240 bytes)
423400
it is 2-3 times slower for Strings
julia> A = ["A$i" for i in 1:1_000_000]; #time lastimax(A); #time mlastimax(A)
0.175117 seconds (2.00 M allocations: 61.035 MiB, 41.29% gc time)
0.077098 seconds (7 allocations: 272 bytes)
999999
EDIT2:
#crstnbr solution is faster and works with Strings too (doesn't work with generators). There difference between lastindmax and lastimax - first return byte index, second return character index:
julia> S = "1š3456789ž"
julia> length(S)
10
julia> lastindmax(S) # return value is bigger than length
11
julia> lastimax(S) # return character index (which is not byte index to String) of last max character
10
julia> S[chr2ind(S, lastimax(S))]
'ž': Unicode U+017e (category Ll: Letter, lowercase)
julia> S[chr2ind(S, lastimax(S))]==S[lastindmax(S)]
true

Updating a dense vector by a sparse vector in Julia is slow

I am using Julia version 0.4.5 and I am experiencing the following issue:
As far as I know, taking inner product between a sparse vector and a dense vector should be as fast as updating the dense vector by a sparse vector. The latter one is much slower.
A = sprand(100000,100000,0.01)
w = rand(100000)
#time for i=1:100000
w += A[:,i]
end
26.304380 seconds (1.30 M allocations: 150.556 GB, 8.16% gc time)
#time for i=1:100000
A[:,i]'*w
end
0.815443 seconds (921.91 k allocations: 1.540 GB, 5.58% gc time)
I created a simple sparse matrix type of my own, and the addition code was ~ the same as the inner product.
Am I doing something wrong? I feel like there should be a special function doing the operation w += A[:,i], but I couldn't find it.
Any help is appreciated.
I asked the same question on GitHub and we came to the following conclusion. The type SparseVector was added as of Julia 0.4 and with it the BLAS function LinAlg.axpy!, which updates in-place a (possibly dense) vector x by a sparse vector y multiplied by a scalar a, i.e. performs x += a*y efficiently. However, in Julia 0.4 it is not implemented properly. It works only in Julia 0.5
#time for i=1:100000
LinAlg.axpy!(1,A[:,i],w)
end
1.041587 seconds (799.49 k allocations: 1.530 GB, 8.01% gc time)
However, this code is still sub-optimal, as it creates the SparseVector A[:,i]. One can get an even faster version with the following function:
function upd!(w,A,i,c)
rowval = A.rowval
nzval = A.nzval
#inbounds for j = nzrange(A,i)
w[rowval[j]] += c* nzval[j]
end
return w
end
#time for i=1:100000
upd!(w,A,i,1)
end
0.500323 seconds (99.49 k allocations: 1.518 MB)
This is exactly what I needed to achieve, after some research we managed to get there, thanks everyone!
Assuming you want to compute w += c * A[:, i], there is an easy way to vectorize it:
>>> A = sprand(100000, 100000, 0.01)
>>> c = rand(100000)
>>> r1 = zeros(100000)
>>> #time for i = 1:100000
>>> r1 += A[:, i] * c[i]
>>> end
29.997412 seconds (1.90 M allocations: 152.077 GB, 12.73% gc time)
>>> #time r2 = sum(A .* c', 2);
1.191850 seconds (50 allocations: 1.493 GB, 0.14% gc time)
>>> all(r1 == r2)
true
First, create a vector c of the constants to multiply with. Then multiplay de columns of A element-wise by the values of c (A .* c', it does broadcasting inside). Last, reduce over the columns of A (the part sum(.., 2)).

writedlm a matrix of Bool as a 0,1 matrix

I have a matrix of Bool values, for example
x = bitrand(2,3)
If I try to save this to a file:
writedlm("mat.txt", x)
I get a matrix of true and false. I would like to get instead a matrix of 0 and 1 (where 0 replaces false and 1 replaces true). Is there a simple way to do this, perhaps by some options in writedlm, without writing the file line by line myself?
writedlm("mat.txt", map(Int8,x))
Takes each element of x and converts it to an integer using the Int8 function/constructor.
You could also use other integer types but Int8 is more memory efficient than for example Int64.
Try 1*x, it gets the numerical version (perhaps not super memory/time efficient, but good enough for non "big data" stuff). 0x1*x will get a UInt8 - more memory compact (but probably slower).
Another option that is slightly faster is to copy the array to UInt8 on the fly as Array{UInt8, ndims(x)}(x), rather than applying map:
>>> x = bitrand(100,100)
>>> a = map(UInt8, x)
>>> b = Array{UInt8, ndims(x)}(x)
>>> all(a .== b)
true
I ran quick some tests and it is sightly faster the larger the matrices are (at least in my computer).
for i in [10, 100, 1_000, 10_000]
x = bitrand(i,i)
println("$i x $i")
#time map(UInt8, x)
#time Array{UInt8, ndims(x)}(x)
end
Outputs:
10 x 10
0.000002 seconds (2 allocations: 208 bytes)
0.000006 seconds (2 allocations: 208 bytes)
100 x 100
0.000053 seconds (2 allocations: 9.891 KB)
0.000018 seconds (2 allocations: 9.891 KB)
1000 x 1000
0.001945 seconds (5 allocations: 976.703 KB)
0.001490 seconds (5 allocations: 976.703 KB)
10000 x 10000
0.224491 seconds (5 allocations: 95.368 MB)
0.117774 seconds (5 allocations: 95.368 MB)

Resources