I am just starting to evaluate Julia (version 0.6.0) and I tested how resize! and sizehint! could impact performance. I used #time macro.
Documentation says "# Run once to JIT-compile" but it seems that running once could not be enough if we check number of allocations.
module Test
function test(x::Int64; hint::Bool=false, resize::Bool=false)
A::Array{Int64} = []
n::Int64 = x
if resize
resize!(A, n)
for i in 1:n
A[i]=i
end
else
if hint sizehint!(A, n) end
for i in 1:n
push!(A, i)
end
end
A[end]
end
end
import Test
#Test.test(1); # (1)
#Test.test(1, hint=true); # (2)
#Test.test(1, resize=true); # (3)
#time Test.test(10_000_000)
#time Test.test(10_000_000, hint=true)
#time Test.test(10_000_000, resize=true)
I got different results for different "JIT-precompile" callings:
Result from code above:
0.494120 seconds (11.02 k allocations: 129.706 MiB, 22.77% gc time)
0.141155 seconds (3.43 k allocations: 76.537 MiB, 41.94% gc time)
0.068319 seconds (9 allocations: 76.294 MiB, 76.99% gc time)
If (1) is uncommented:
0.520939 seconds (112 allocations: 129.007 MiB, 21.79% gc time)
0.140845 seconds (3.43 k allocations: 76.537 MiB, 42.35% gc time)
0.068741 seconds (9 allocations: 76.294 MiB, 77.55% gc time)
if (1) && (2) are uncommented:
0.586479 seconds (112 allocations: 129.007 MiB, 19.28% gc time)
0.117521 seconds (9 allocations: 76.294 MiB, 50.56% gc time)
0.068275 seconds (9 allocations: 76.294 MiB, 76.84% gc time)
if (1) && (2) && (3) are uncommented:
0.509668 seconds (112 allocations: 129.007 MiB, 21.61% gc time)
0.112276 seconds (9 allocations: 76.294 MiB, 50.58% gc time)
0.065123 seconds (9 allocations: 76.294 MiB, 76.34% gc time)
if (3) is uncommented:
0.497802 seconds (240 allocations: 129.016 MiB, 22.53% gc time)
0.117035 seconds (11 allocations: 76.294 MiB, 52.56% gc time)
0.067170 seconds (11 allocations: 76.294 MiB, 76.93% gc time)
My questions:
Is it bug?
If it is not bug then is there possibility to invoke complete compilation?
No, the doc here clearly tells this is due to you were running #time in global scope:
julia> function foo()
Test.test(1) # warm-up
#time Test.test(10_000_000)
#time Test.test(10_000_000, hint=true)
#time Test.test(10_000_000, resize=true)
end
foo (generic function with 1 method)
julia> foo()
0.401256 seconds (26 allocations: 129.001 MiB, 47.38% gc time)
0.185094 seconds (6 allocations: 76.294 MiB, 37.13% gc time)
0.034649 seconds (6 allocations: 76.294 MiB, 30.99% gc time)
Related
In Julia, one can find the conj and conj! for the respectively out-of-place and in-place conjugate of a complex type object. Surprisingly, I could not find an in-place version for the opposite (additive inverse) of an array.
The main interest is related to allocation as the in-place version allocates nothing.
Here is some benchmarking.
# using BenchmarkTools
a = rand(ComplexF64,1000,1000);
#btime conj($a);
#btime conj!($a);
#btime -$a;
#btime -1 .* $a;
#btime flipsign.(a,-1);
#btime .-$a;
julia> using BenchmarkTools
julia> a = rand(ComplexF64,1000,1000);
julia> #btime conj($a);
3.594 ms (2 allocations: 15.26 MiB)
julia> #btime conj!($a);
979.401 μs (0 allocations: 0 bytes)
julia> #btime -$a;
3.594 ms (2 allocations: 15.26 MiB)
julia> #btime -1 .* $a;
3.586 ms (2 allocations: 15.26 MiB)
julia> #btime flipsign.(a,-1);
3.588 ms (4 allocations: 15.26 MiB)
julia> #btime .-$a;
3.588 ms (2 allocations: 15.26 MiB)
In all cases except the one with conj! you are also measuring the allocation of the result of the expression. This goes away if you use broadcast assignment:
julia> #btime(#. $a = conj($a))
2.409 ms (0 allocations: 0 bytes)
julia> #btime(#. $a = -($a));
2.386 ms (0 allocations: 0 bytes)
The #. macro is just a shortcut for "put dots everywhere", as in a .= .-(a).
Maybe your confusion arises from thinking about Julia types the wrong way. Unlike (I believe) in Matlab, scalars and arrays are strictly different. conj operates on complex scalars, while conj! is just a helper for arrays of complex numbers, equivalent to a -> a .= conj.(a). In-place operations can naturally only work on arrays (i.e., reference types), not scalars.
How can I check if a string is empty?
I am currently using the == operator:
julia> x = "";
julia> x == "";
true
Use isempty. It is more explicit and more likely to be optimized for its use case.
For example, on the latest Julia:
julia> using BenchmarkTools
julia> myisempty(x::String) = x == ""
foo (generic function with 1 method)
julia> #btime myisempty("")
2.732 ns (0 allocations: 0 bytes)
true
julia> #btime myisempty("bar")
3.001 ns (0 allocations: 0 bytes)
false
julia> #btime isempty("")
1.694 ns (0 allocations: 0 bytes)
true
julia> #btime isempty("bar")
1.594 ns (0 allocations: 0 bytes)
false
In Python one can use if in the list comprehension to filter out elements. In Julia is there a lazy filter equivalent?
for x in filter(x->x<2, 1:3)
println(x)
end
works and prints only 1 but filter(x->x<2, 1:3) is eager and so may not be desirable for billions of records.
You can do this just like in Python:
julia> function f()
for x in (i for i in 1:10^9 if i == 10^9)
println(x)
end
end
f (generic function with 1 method)
julia> #time f()
1000000000
3.293702 seconds (139.87 k allocations: 7.107 MiB)
julia> #time f()
1000000000
3.224707 seconds (11 allocations: 352 bytes)
and you see that it does not allocate. But it is faster to just perform a filter test inside the loop without using a generator:
julia> function g()
for x in 1:10^9
x == 10^9 && println(x)
end
end
g (generic function with 1 method)
julia> #time g()
1000000000
2.098305 seconds (53.49 k allocations: 2.894 MiB)
julia> #time g()
1000000000
2.094018 seconds (11 allocations: 352 bytes)
Edit Finally you can use Iterators.filter:
julia> function h()
for x in Iterators.filter(==(10^9), 1:10^9)
println(x)
end
end
h (generic function with 1 method)
julia>
julia> #time h()
1000000000
0.390966 seconds (127.96 k allocations: 6.599 MiB)
julia> #time h()
1000000000
0.311650 seconds (12 allocations: 688 bytes)
which in this case will be fastest (see also https://docs.julialang.org/en/latest/base/iterators/#Iteration-utilities-1).
You might also want to check out https://github.com/JuliaCollections/IterTools.jl.
EDIT 2
Sometimes Julia is more powerful than you would think. Check this out:
julia> function g2()
for x in 1:1_000_000_000
x == 1_000_000_000 && println(x)
end
end
g2 (generic function with 1 method)
julia>
julia> #time g2()
1000000000
0.029332 seconds (62.91 k allocations: 3.244 MiB)
julia> #time g2()
1000000000
0.000636 seconds (11 allocations: 352 bytes)
and we see that the compiler has essentially compiled out all our computations.
In essence - in the earlier example constant propagation kicked in and replaced 10^9 by 1_000_000_000 in the Iterators.filter example.
Therefore we have to devise a smarter test. Here it goes:
julia> using BenchmarkTools
julia> function f_rand(x)
s = 0.0
for v in (v for v in x if 0.1 < v < 0.2)
s += v
end
s
end
f_rand (generic function with 1 method)
julia> function g_rand(x)
s = 0.0
for v in x
if 0.1 < v < 0.2
s += v
end
end
s
end
g_rand (generic function with 1 method)
julia> function h_rand(x)
s = 0.0
for v in Iterators.filter(v -> 0.1 < v < 0.2, x)
s += v
end
s
end
h_rand (generic function with 1 method)
julia> x = rand(10^6);
julia> #btime f_rand($x)
2.032 ms (0 allocations: 0 bytes)
14922.291597613703
julia> #btime g_rand($x)
1.804 ms (0 allocations: 0 bytes)
14922.291597613703
julia> #btime h_rand($x)
2.035 ms (0 allocations: 0 bytes)
14922.291597613703
And now we get what I was originally expecting (a plain loop with if is the fastest).
I am beginning to learn Julia after using Matlab for several years. I started by implementing a simple polynomial multiplication (without FFT) to try and understand the role of type stability. A big part of this project is the requirement for a fast polynomial multiplier. However, I have the following timings which I can't understand at all.
function cauchyproduct(L::Array{Float64},R::Array{Float64})
# good one for floats
N = length(L)
prodterm = zeros(1,2N-1)
for n=1:N
Lterm = view(L,1:n)
Rterm = view(R,n:-1:1)
prodterm[n] = dot(Lterm,Rterm)
end
for n = 1:N-1
Lterm = view(L,n+1:N)
Rterm = view(R,N:-1:n+1)
prodterm[N+n] = dot(Lterm,Rterm)
end
prodterm
end
testLength = 10000
goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j in 1:10
#time cauchyproduct(goodL,goodR)
end
#which cauchyproduct(goodL,goodR)
I get the following timings from 2 sequential runs of this code. These timings from one run to another are completely erratic. In general, the timing I get per test can range between .05s to 2s. Typically, the timings for a single run through the for loop will all have similar timings (as in the example below), but even this isn't always the case. Occasionally, I have it alternate such as
.05s
.05s
1.9s
.04s
.05s
2.1s
etc etc.
Any idea why this is happening?
0.544795 seconds (131.08 k allocations: 5.812 MiB)
0.510395 seconds (120.00 k allocations: 5.340 MiB)
0.528362 seconds (120.00 k allocations: 5.340 MiB, 0.94% gc time)
0.507156 seconds (120.00 k allocations: 5.340 MiB)
0.507566 seconds (120.00 k allocations: 5.340 MiB)
0.507932 seconds (120.00 k allocations: 5.340 MiB)
0.527383 seconds (120.00 k allocations: 5.340 MiB)
0.513301 seconds (120.00 k allocations: 5.340 MiB, 0.83% gc time)
0.509347 seconds (120.00 k allocations: 5.340 MiB)
0.509177 seconds (120.00 k allocations: 5.340 MiB)
0.052247 seconds (120.00 k allocations: 5.340 MiB, 7.95% gc time)
0.049644 seconds (120.00 k allocations: 5.340 MiB)
0.047275 seconds (120.00 k allocations: 5.340 MiB)
0.049163 seconds (120.00 k allocations: 5.340 MiB)
0.049029 seconds (120.00 k allocations: 5.340 MiB)
0.054050 seconds (120.00 k allocations: 5.340 MiB, 8.36% gc time)
0.047010 seconds (120.00 k allocations: 5.340 MiB)
0.051240 seconds (120.00 k allocations: 5.340 MiB)
0.050961 seconds (120.00 k allocations: 5.340 MiB)
0.049841 seconds (120.00 k allocations: 5.340 MiB, 4.90% gc time)
Edit: The timings shown are obtained by executing the code beneath the defined functions twice in a row. Specifically, the code block
goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j in 1:10
#time cauchyproduct(goodL,goodR)
end
gives vastly different timings on different runs (without recompiling the functions above it). In all of the timings, the same method of cauchyproduct (the top version) is being called. Hopefully this clarifies the problem.
Edit 2: I changed the code block at the end to the following
testLength = 10000
goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j = 1:3
#time cauchyproduct(goodL,goodR)
end
for j = 1:3
goodL = rand(1,testLength)
goodR = rand(1,testLength)
#time cauchyproduct(goodL,goodR)
end
#time cauchyproduct(goodL,goodR)
#time cauchyproduct(goodL,goodR)
#time cauchyproduct(goodL,goodR)
and got the following timings on 2 repeated executions of the new block.
Timing 1:
0.045936 seconds (120.00 k allocations: 5.340 MiB)
0.045740 seconds (120.00 k allocations: 5.340 MiB)
0.045768 seconds (120.00 k allocations: 5.340 MiB)
1.549157 seconds (120.00 k allocations: 5.340 MiB, 0.14% gc time)
0.046797 seconds (120.00 k allocations: 5.340 MiB)
0.046637 seconds (120.00 k allocations: 5.340 MiB)
0.047143 seconds (120.00 k allocations: 5.341 MiB)
0.049088 seconds (120.00 k allocations: 5.341 MiB, 3.88% gc time)
0.049246 seconds (120.00 k allocations: 5.341 MiB)
Timing 2:
2.250852 seconds (120.00 k allocations: 5.340 MiB)
2.370882 seconds (120.00 k allocations: 5.340 MiB)
2.247676 seconds (120.00 k allocations: 5.340 MiB, 0.14% gc time)
1.550661 seconds (120.00 k allocations: 5.340 MiB)
0.047258 seconds (120.00 k allocations: 5.340 MiB)
0.047169 seconds (120.00 k allocations: 5.340 MiB)
0.048625 seconds (120.00 k allocations: 5.341 MiB, 4.02% gc time)
0.045489 seconds (120.00 k allocations: 5.341 MiB)
0.049457 seconds (120.00 k allocations: 5.341 MiB)
So confused.
Short answer: Your code is a bit odd, and so probably triggering garbage collection in unexpected ways, resulting in varied timings.
Long answer: I agree that the timings you are getting are a bit strange. I'm not completely sure I can nail down exactly what is causing the problem, but I'm 99% certain it is something to do with garbage collection.
So, your code is a bit odd, because you allow input arrays of any dimension, even though you then call the dot function (a BLAS routine for taking the dot product of two vectors). In case you didn't realise, if you want a vector, use Array{Float64,1}, and for a matrix Array{Float64,2} and so on. Or you could also uses the aliases Vector{Float64} and Matrix{Float64}.
The second odd thing I noticed is that in your test, you generate rand(1, N). This returns an Array{Float64,2}, i.e. a matrix. To get an Array{Float64, 1}, i.e. a vector, you would use rand(N). Then within your function you take views into your matrix, which are of size 1xN. Now, Julia uses column-major ordering, so using a 1xN object for a vector is going to be really inefficient, and is probably the source of your strange timings. Under the hood, I suspect the call to dot is going to involve converting these things into regular vectors of floats, since dot eventually feeds through to the underlying BLAS routine which will need this input type. All these conversions will mean plenty of temporary storage, which needs to be garbage collected at some point, and this will probably be the source of the varying timings (90% of the time, varied timings on the same code are the result of the garbage collector being triggered - and sometimes in quite unexpected ways).
So, there is probably several ways to improve the following, but my quick-and-dirty version of your function looks like this:
function cauchyproduct(L::AbstractVector{<:Number}, R::AbstractVector{<:Number})
length(L) != length(R) && error("Length mismatch in inputs")
N = length(L)
prodterm = zeros(1,2*N-1)
R = flipdim(R, 1)
for n=1:N
prodterm[n] = dot(view(L, 1:n), view(R, N-n+1:N))
end
for n = 1:N-1
prodterm[N+n] = dot(view(L, n+1:N), view(R, 1:N-n))
end
return prodterm
end
Note, I flip R before the loop so the memory doesn't need to be re-ordered over and over within the loop. This was no doubt contributing to your strange garbage collection issues. Then, applying your test (I think it is a better idea to move array generation inside the loop in case some clever caching issue throws off timings):
testLength = 10000
for j = 1:20
goodL = rand(testLength);
goodR = rand(testLength);
#time cauchyproduct(goodL,goodR);
end
we get something like this:
0.105550 seconds (78.19 k allocations: 3.935 MiB, 2.91% gc time)
0.022421 seconds (40.00 k allocations: 2.060 MiB)
0.022527 seconds (40.00 k allocations: 2.060 MiB)
0.022333 seconds (40.00 k allocations: 2.060 MiB)
0.021568 seconds (40.00 k allocations: 2.060 MiB)
0.021837 seconds (40.00 k allocations: 2.060 MiB)
0.022155 seconds (40.00 k allocations: 2.060 MiB)
0.022071 seconds (40.00 k allocations: 2.060 MiB)
0.021720 seconds (40.00 k allocations: 2.060 MiB)
0.024774 seconds (40.00 k allocations: 2.060 MiB, 9.13% gc time)
0.021714 seconds (40.00 k allocations: 2.060 MiB)
0.022066 seconds (40.00 k allocations: 2.060 MiB)
0.021815 seconds (40.00 k allocations: 2.060 MiB)
0.021819 seconds (40.00 k allocations: 2.060 MiB)
0.021928 seconds (40.00 k allocations: 2.060 MiB)
0.021795 seconds (40.00 k allocations: 2.060 MiB)
0.021837 seconds (40.00 k allocations: 2.060 MiB)
0.022285 seconds (40.00 k allocations: 2.060 MiB)
0.021380 seconds (40.00 k allocations: 2.060 MiB)
0.023828 seconds (40.00 k allocations: 2.060 MiB, 6.91% gc time)
The first iteration is measuring compile time, rather than runtime, and so should be ignored (if you don't know what I mean by this then check the performance tips section of the official docs). As you can see, the remaining iterations are much faster, and quite stable.
In one of my application, I have to store elements of different subtypes in the array and I got big hit by the JIT performance.
Below is a minimal example.
abstract A
immutable B <: A end
immutable C <: A end
b = B()
c = C()
#time getindex(A, b, b)
#time getindex(A, b, c)
#time getindex(A, c, c)
#time getindex(A, c, b)
#time getindex(A, b, c, b)
#time getindex(A, b, c, c);
0.007756 seconds (6.03 k allocations: 276.426 KB)
0.007878 seconds (5.01 k allocations: 223.087 KB)
0.005175 seconds (2.44 k allocations: 128.773 KB)
0.004276 seconds (2.42 k allocations: 127.546 KB)
0.004107 seconds (2.45 k allocations: 129.983 KB)
0.004090 seconds (2.45 k allocations: 129.983 KB)
As you see, each time I construct the array for different combination of elements, it has to do a JIT.
I also tried [...] instead of T[...], it appeared worse.
Restart the kernel and run the following:
b = B()
c = C()
#time Base.vect(b, b)
#time Base.vect(b, c)
#time Base.vect(c, c)
#time Base.vect(c, b)
#time Base.vect(b, c, b)
#time Base.vect(b, c, c);
0.008252 seconds (6.87 k allocations: 312.395 KB)
0.149397 seconds (229.26 k allocations: 12.251 MB)
0.006778 seconds (6.86 k allocations: 312.270 KB)
0.113640 seconds (178.26 k allocations: 9.132 MB, 3.04% gc time)
0.050561 seconds (99.19 k allocations: 5.194 MB)
0.031053 seconds (72.50 k allocations: 3.661 MB)
In my application I face a lot of different subtypes: each element is of type NTuple{N, A} where N can change. So in the end the application was stuck in JIT.
What's the best way to get around it? The only way I can think of is to create a wrapper, say W, and box all my element into W before entering the array. So the compiler only compiles the array function once.
immutable W
value::NTuple
end
Thanks to #Matt B. after overloading his getindex,
c = C()
#time getindex(A, b, b)
#time getindex(A, b, c)
#time getindex(A, c, c)
#time getindex(A, c, b)
#time getindex(A, b, c, b)
#time getindex(A, b, c, c);
0.008493 seconds (6.43 k allocations: 289.646 KB)
0.000867 seconds (463 allocations: 19.012 KB)
0.000005 seconds (5 allocations: 240 bytes)
0.000003 seconds (5 allocations: 240 bytes)
0.004035 seconds (2.37 k allocations: 122.535 KB)
0.000003 seconds (5 allocations: 256 bytes)
Also, I realized the JIT of tuple is actually quite efficient.
#time tuple(1,2)
#time tuple(b, b)
#time tuple(b, c)
#time tuple(c, c)
#time tuple(c, b)
#time tuple(b, c, b)
#time tuple(b, c, c);
#time tuple(b, b)
#time tuple(b, c)
#time tuple(c, c)
#time tuple(c, b)
#time tuple(b, c, b)
#time tuple(b, c, c);
0.000004 seconds (149 allocations: 10.183 KB)
0.000011 seconds (7 allocations: 336 bytes)
0.000008 seconds (7 allocations: 336 bytes)
0.000007 seconds (7 allocations: 336 bytes)
0.000007 seconds (7 allocations: 336 bytes)
0.000005 seconds (7 allocations: 352 bytes)
0.000004 seconds (7 allocations: 352 bytes)
0.000003 seconds (5 allocations: 192 bytes)
0.000004 seconds (5 allocations: 192 bytes)
0.000002 seconds (5 allocations: 192 bytes)
0.000002 seconds (5 allocations: 192 bytes)
0.000002 seconds (5 allocations: 192 bytes)
0.000002 seconds (5 allocations: 192 bytes)
The JIT heuristics here could probably be better tuned in the base library. While Julia does default to generating specialized methods for unique permutations of argument types, there are a few escape hatches you can use to reduce the number of specializations:
Use f(T::Type) instead of f{T}(::Type{T}). Both are well-typed and behave nicely through inference, but the former will only generate one method for all types.
Use the undocumented all-caps g(::ANY) flag instead of g(::Any). It's semantically identical, but ANY will prevent specialization for that argument.
In this case, you probably want to specialize on the type but not the values:
function Base.getindex{T<:A}(::Type{T}, vals::ANY...)
a = Array(T,length(vals))
#inbounds for i = 1:length(vals)
a[i] = vals[i]
end
return a
end