Type-stability in Julia's product iterator - julia

I am trying to make A in the following code type-stable.
using Primes: factor
function f(n::T, p::T, k::T) where {T<:Integer}
return rand(T, n * p^k)
end
function g(m::T, n::T) where {T<:Integer}
i = 0
for A in Iterators.product((f(n, p, T(k)) for (p, k) in factor(m))...)
i = sum(A)
end
return i
end
Note that f is type-stable. The variable A is not type-stable because the product iterator will return different sized tuples depending on the values of n and m. If there was an iterator like the product iterator that returned a Vector instead of a Tuple, I believe that the type-instability would go away.
Does anyone have any suggestions to make A type-stable in the above code?
Edit: I should add that f returns a variable-sized Vector of type T.
One way I have solved the type-stability is by doing this.
function g(m::T, n::T) where {T<:Integer}
B = Vector{T}[T[]]
for (p, k) in factor(m)
C = Vector{T}[]
for (b, r) in Iterators.product(B, f(n, p, T(k)))
c = copy(b)
push!(c, r)
push!(C, c)
end
B = C
end
for A in B
i = sum(A)
end
return i
end
This (and in particular, A) is now type-stable, but at the cost lots of memory. I'm not sure of a better way to do this.

It's not easy to get this completely type stable, but you can isolate the type instability with a function barrier. Convert the factorization to a tuple in an outer function, which you pass to an inner function which is type stable. This gives just one dynamic dispatch, instead of many:
# inner, type stable
function _g(n, tup)
i = 0
for A in Iterators.product((f(n, p, k) for (p, k) in tup)...)
i += sum(A) # or i = sum(A), whatever
end
return i
end
# outer function
g(m::T, n::T) where {T<:Integer} = _g(n, Tuple(factor(m)))
Some benchmarks:
julia> #btime g(7, 210); # OP version
149.600 μs (7356 allocations: 172.62 KiB)
julia> #btime g(7, 210); # my version
1.140 μs (6 allocations: 11.91 KiB)
You should expect to hit compilation occasionally, whenever you get a number that contains a new number of factors.

Related

Julia function to return non-unique elements of an array

Julia base has the unique function that returns a vector containing only the unique elements of an array (or any iterable). I was looking for a nonunique function to return an array containing all the elements that appear at least twice in its input. As far as I can tell Julia does not have such a function, which I found a bit surprising.
My first attempt was as follows:
function nonunique(x::AbstractArray)
uniqueindexes = indexin(unique(x),x)
nonuniqueindexes = setdiff(1:length(x),uniqueindexes)
unique(x[nonuniqueindexes])
end
But inspired by Bogumił Kamiński's answer to indices of unique elements of vector in Julia I wrote a second version:
function nonunique(x::AbstractArray{T}) where T
uniqueset = Set{T}()
duplicatedset = Set{T}()
duplicatedvector = Vector{T}()
for i in x
if(i in uniqueset)
if !(i in duplicatedset)
push!(duplicatedset, i)
push!(duplicatedvector, i)
end
else
push!(uniqueset, i)
end
end
duplicatedvector
end
In my tests, this version is about 4 times faster. It has the nice property that the return is ordered in the order that the second (first repeat) of each set of equivalent elements originally appear. I believe that in is faster when checking for membership of a Set than an Array, which accounts for having the two variables duplicatedset and duplicatedvector.
Is it really necessary for me to "roll my own" nonunique function and can the second version be improved?
You can get higher performance by sorting the list and then searching for duplicates:
function nonunique2(x::AbstractArray{T}) where T
xs = sort(x)
duplicatedvector = T[]
for i=2:length(xs)
if (isequal(xs[i],xs[i-1]) && (length(duplicatedvector)==0 || !isequal(duplicatedvector[end], xs[i])))
push!(duplicatedvector,xs[i])
end
end
duplicatedvector
end
Here are sample results:
julia> x = rand(1:1000,1000);
julia> using BenchmarkTools
julia> nn = #btime nonunique($x);
42.240 μs (39 allocations: 71.23 KiB)
julia> nn2s = #btime nonunique2($x);
26.453 μs (10 allocations: 16.33 KiB)
julia> sort(nn) == sort(nn2s)
true
It will be much better if you can do in-place sorting:
function nonunique2!(x::AbstractArray{T}) where T
sort!(x)
duplicatedvector = T[]
for i=2:length(x)
if (isequal(x[i],x[i-1]) && (length(duplicatedvector)==0 || !isequal(duplicatedvector[end], x[i])))
push!(duplicatedvector,x[i])
end
end
duplicatedvector
end
Here are the results (the same data)
julia> nn2 = #btime nonunique2!($x)
9.813 μs (9 allocations: 8.39 KiB)
julia> sort(nn) == sort(nns)
true
To add to the answer above, as its limitation is that the type T must be sortable and it is not order-preserving I have two possible solutions.
Here is another non-order preserving solution that uses StatsBase.jl. It can be faster than the sorting solution or slower depending on the density of the duplicates (also it does more work, but in some applications this information might be useful):
nonunique3(x) = [k for (k, v) in countmap(x) if v > 1]
If you want to speed up the order preserving approach you could do something like:
function nonunique4(x::AbstractArray{T}) where T
status = Dict{T, Bool}()
duplicatedvector = Vector{T}()
for i in x
if haskey(status, i)
if status[i]
push!(duplicatedvector, i)
status[i] = false
end
else
status[i] = true
end
end
duplicatedvector
end
In general benchmarking them is tricky as performance will depend on:
density of duplicates and over double duplicates in x
the size of type T (e.g. if it were a very large immutable type things might change vs. standard situation)
Not really an answer (excellent answers are above) but a comment that the original implementation can be cleaned a little to:
function nonunique1(x::AbstractArray{T}) where T
uniqueset = Set{T}()
duplicatedset = Set{T}()
for i in x
if(i in uniqueset)
push!(duplicatedset, i)
else
push!(uniqueset, i)
end
end
collect(duplicatedset)
end
i.e. you don't need to check for existence before pushing to a set, and you don't need to fill a vector and set separately. It's still not as fast as the sorting implementation.

Speed up deepcopy for new type

Question: I have a new type type MyFloat; x::Float64 ; end. I want to perform a deepcopy on a Vector{MyFloat}. Using Julia v0.5.0 on Ubuntu 16.04, the operation runs roughly 150 times slower than a deepcopy call on an equivalent length Vector{Float64}. Is it possible to speed up a deepcopy on my Vector{MyFloat}?
Code snippet: The 150 times slowdown can be seen with the following code snippet which can be pasted to the REPL:
#Just my own floating point type
type MyFloat
x::Float64
end
#This function performs N deepcopy operations on a Vector{MyFloat} of length J
function f1(J::Int, N::Int)
v = MyFloat.(rand(J))
x = [ deepcopy(v) for n = 1:N ]
end
#The same as f1, but on Vector{Float64} instead of Vector{MyFloat}
function f2(J::Int, N::Int)
v = rand(J)
x = [ deepcopy(v) for n = 1:N ]
end
#Pre-compilation step
f1(2, 2);
f2(2, 2);
#Timings
#time f1(100, 15000);
#time f2(100, 15000);
On my machine this produces:
julia> #time f1(100, 15000);
1.944410 seconds (4.61 M allocations: 167.888 MB, 7.72% gc time)
julia> #time f2(100, 15000);
0.013513 seconds (45.01 k allocations: 19.113 MB, 78.80% gc time)
Looking at the answer here it sounds like I can speed things up by defining my own copy method for MyFloat. I've tried things like:
Base.deepcopy(x::MyFloat)::MyFloat = MyFloat(x.x);
Base.deepcopy(v::Vector{MyFloat})::Vector{MyFloat} = [ MyFloat(y.x) for y in v ]
Base.copy(x::MyFloat)::MyFloat = MyFloat(x.x)
Base.copy(v::Vector{MyFloat})::Vector{MyFloat} = [ MyFloat(y.x) for y in v ]
but this doesn't make any difference.
Final note: Letting a = MyFloat.([1.0, 2.0]), I could just use b = copy(a) and there is no speed penalty. This is fine, as long as I am careful to only ever do operations like b[1] = MyFloat(3.0) (which will modify b but not a). But if I get sloppy and accidentally write b[1].x = 3.0, then this will modify both a and b.
By the way, it is entirely possible that I do not have a deep understanding of the differences between copy and deepcopy... I have read this great blog post (thanks #ChrisRackauckas), but I'm certainly a bit fuzzy about what is happening at a deeper level.
Try changing type MyFloat in the definition to immutable MyFloat or struct MyFloat (the keyword changed in 0.6). This makes the times almost equal.
As #Gnimuc mentioned, a mutable, which is not a bitstype, makes Julia keep track of a lot of other stuff. See here and in the comments.

Recursive call signature keeps changing

I am going to implement a program that uses recursion quite a bit. So, before I started to get stack overflows exceptions, I figured it would be nice to have a trampoline implemented and use thunks in case it was needed.
A first try I did was with factorial. Here the code:
callable(f) = !isempty(methods(f))
function trampoline(f, arg1, arg2)
v = f(arg1, arg2)
while callable(v)
v = v()
end
return v
end
function factorial(n, continuation)
if n == 1
continuation(1)
else
(() -> factorial(n-1, (z -> (() -> continuation(n*z)))))
end
end
function cont(x)
x
end
Also, I implemented a naive factorial to check if, as a matter of fact, I would be preventing stack overflows:
function factorial_overflow(n)
if n == 1
1
else
n*factorial_overflow(n-1)
end
end
The results are:
julia> factorial_overflow(140000)
ERROR: StackOverflowError:
#JITing with a small input
julia> trampoline(factorial, 10, cont)
3628800
#Testing
julia> trampoline(factorial, 140000, cont)
0
So, yes, I am avoiding StacksOverflows. And yes, I know the result is nonsense as I am getting integers overflows, but here I just cared about the stack. A production version of course would have that fixed.
(Also, I know for the factorial case there is a built-in, I wouldn't use either of these, I made them for testing my trampoline).
The trampoline version takes a lot of time when running for the first time, and then it gets quick... when computing the same or lower values.
If I did trampoline(factorial, 150000, cont) I will have some compiling time again.
It seems to me (educated guess) that I am JITing many different signatures for factorial: one for every thunk generated.
My question is: can I avoid this?
I think the problem is that every closure is its own type, which is specialized on the captured variables. To avoid this specialization, one can instead use functors, that are not fully specialized:
struct L1
f
n::Int
z::Int
end
(o::L1)() = o.f(o.n*o.z)
struct L2
f
n::Int
end
(o::L2)(z) = L1(o.f, o.n, z)
struct Factorial
f
c
n::Int
end
(o::Factorial)() = o.f(o.n-1, L2(o.c, o.n))
callable(f) = false
callable(f::Union{Factorial, L1, L2}) = true
function myfactorial(n, continuation)
if n == 1
continuation(1)
else
Factorial(myfactorial, continuation, n)
end
end
function cont(x)
x
end
function trampoline(f, arg1, arg2)
v = f(arg1, arg2)
while callable(v)
v = v()
end
return v
end
Note that the function fields are untyped. Now the function run much faster on the first run:
julia> #time trampoline(myfactorial, 10, cont)
0.020673 seconds (4.24 k allocations: 264.427 KiB)
3628800
julia> #time trampoline(myfactorial, 10, cont)
0.000009 seconds (37 allocations: 1.094 KiB)
3628800
julia> #time trampoline(myfactorial, 14000, cont)
0.001277 seconds (55.55 k allocations: 1.489 MiB)
0
julia> #time trampoline(myfactorial, 14000, cont)
0.001197 seconds (55.55 k allocations: 1.489 MiB)
0
I just translated every closure in your code into a corresponding functor. This might not be needed and probably there are be better solutions, but it works and hopefully demonstrates the approach.
Edit:
To make the reason for the slowdown more clear, one can use:
function factorial(n, continuation)
if n == 1
continuation(1)
else
tmp = (z -> (() -> continuation(n*z)))
#show typeof(tmp)
(() -> factorial(n-1, tmp))
end
end
This outputs:
julia> trampoline(factorial, 10, cont)
typeof(tmp) = ##31#34{Int64,#cont}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,#cont}}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,##31#34{Int64,#cont}}}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,#cont}}}}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,#cont}}}}}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,#cont}}}}}}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,#cont}}}}}}}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,#cont}}}}}}}}
typeof(tmp) = ##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,##31#34{Int64,#cont}}}}}}}}}
3628800
tmp is a closure. Its automatically created type ##31#34 looks similar to
struct Tmp{T,F}
n::T
continuation::F
end
The specialization on the type F of the continuation field is the reason for the long compilation times.
By using L2 instead, which is not specialized on the corresponding field f, the continuation argument to factorial has always the type L2 and the problem is avoided.

Julia pi approximation slow

I have pi approximation code very similar to that on official page:
function piaprox()
sum = 1.0
for i = 2:m-1
sum = sum + (1.0/(i*i))
end
end
m = parse(Int,ARGS[1])
opak = parse(Int,ARGS[2])
#time for i = 0:opak
piaprox()
end
When I try to compare time of C and Julia, then Julia is significantly slower, almost 38 sec for m = 100000000 (time of C is 0.1608328933 sec). Why this is happening?
julia> m=100000000
julia> function piaprox()
sum = 1.0
for i = 2:m-1
sum = sum + (1.0/(i*i))
end
end
piaprox (generic function with 1 method)
julia> #time piaprox()
28.482094 seconds (600.00 M allocations: 10.431 GB, 3.28% gc time)
I would like to mention two very important paragraphs from Performance Tips section of julia documentation:
Avoid global variables A global variable might have its value, and
therefore its type, change at any point. This makes it difficult for
the compiler to optimize code using global variables. Variables should
be local, or passed as arguments to functions, whenever possible.....
The macro #code_warntype (or its function variant code_warntype()) can
sometimes be helpful in diagnosing type-related problems.
julia> #code_warntype piaprox();
Variables:
sum::Any
#s1::Any
i::Any
It's clear from #code_warntype output that compiler could not recognize types of local variables in piaprox(). So we try to declare types and remove global variables:
function piaprox(m::Int)
sum::Float64 = 1.0
i::Int = 0
for i = 2:m-1
sum = sum + (1.0/(i*i))
end
end
julia> #time piaprox(100000000 )
0.009023 seconds (11.10 k allocations: 399.769 KB)
julia> #code_warntype piaprox(100000000);
Variables:
m::Int64
sum::Float64
i::Int64
#s1::Int64
EDIT
as #user3662120 commented, the super fast behavior of the answer is result of a mistake, without a return value LLVM might ignore the for loop, by adding a return line the #time result would be:
julia> #time piaprox(100000000)
0.746795 seconds (11.11 k allocations: 400.294 KB, 0.45% gc time)
1.644934057834575

Copying array columns

I have an algorithm that requires one column of an array to be replaced by another column of the same array.
I tried doing it with slices, and element-wise.
const M = 10^4
const N = 10^4
A = rand(Float32, M, N)
B = rand(Float32, N, M)
function copy_col!(A::Array{Float32,2},col1::Int,col2::Int)
A[1:end,col2] = A[1:end,col1]
end
function copy_col2!(A::Array{Float32,2},col1::Int,col2::Int)
for i in 1:size(A,1)
A[i,col2] = A[i,col1]
end
end
[Both functions+rand are called here once for compilation]
#time (for i in 1:20000 copy_col!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
#time (for i in 1:20000 copy_col2!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
>> 0.607899 seconds (314.81 k allocations: 769.879 MB, 25.05% gc time)
>> 0.213387 seconds (117.96 k allocations: 2.410 MB)
Why does copying using slices perform so much worse? Is there a better way than what copy_col2! does?
A[1:end,col1] makes a copy of indexed column first then it copies over to A[1:end,col2] so copy_col! allocates more and runs longer. There are sub, slice, and view that may remedy allocations in this case.

Resources