I have an algorithm that requires one column of an array to be replaced by another column of the same array.
I tried doing it with slices, and element-wise.
const M = 10^4
const N = 10^4
A = rand(Float32, M, N)
B = rand(Float32, N, M)
function copy_col!(A::Array{Float32,2},col1::Int,col2::Int)
A[1:end,col2] = A[1:end,col1]
end
function copy_col2!(A::Array{Float32,2},col1::Int,col2::Int)
for i in 1:size(A,1)
A[i,col2] = A[i,col1]
end
end
[Both functions+rand are called here once for compilation]
#time (for i in 1:20000 copy_col!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
#time (for i in 1:20000 copy_col2!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
>> 0.607899 seconds (314.81 k allocations: 769.879 MB, 25.05% gc time)
>> 0.213387 seconds (117.96 k allocations: 2.410 MB)
Why does copying using slices perform so much worse? Is there a better way than what copy_col2! does?
A[1:end,col1] makes a copy of indexed column first then it copies over to A[1:end,col2] so copy_col! allocates more and runs longer. There are sub, slice, and view that may remedy allocations in this case.
Related
I am trying to make A in the following code type-stable.
using Primes: factor
function f(n::T, p::T, k::T) where {T<:Integer}
return rand(T, n * p^k)
end
function g(m::T, n::T) where {T<:Integer}
i = 0
for A in Iterators.product((f(n, p, T(k)) for (p, k) in factor(m))...)
i = sum(A)
end
return i
end
Note that f is type-stable. The variable A is not type-stable because the product iterator will return different sized tuples depending on the values of n and m. If there was an iterator like the product iterator that returned a Vector instead of a Tuple, I believe that the type-instability would go away.
Does anyone have any suggestions to make A type-stable in the above code?
Edit: I should add that f returns a variable-sized Vector of type T.
One way I have solved the type-stability is by doing this.
function g(m::T, n::T) where {T<:Integer}
B = Vector{T}[T[]]
for (p, k) in factor(m)
C = Vector{T}[]
for (b, r) in Iterators.product(B, f(n, p, T(k)))
c = copy(b)
push!(c, r)
push!(C, c)
end
B = C
end
for A in B
i = sum(A)
end
return i
end
This (and in particular, A) is now type-stable, but at the cost lots of memory. I'm not sure of a better way to do this.
It's not easy to get this completely type stable, but you can isolate the type instability with a function barrier. Convert the factorization to a tuple in an outer function, which you pass to an inner function which is type stable. This gives just one dynamic dispatch, instead of many:
# inner, type stable
function _g(n, tup)
i = 0
for A in Iterators.product((f(n, p, k) for (p, k) in tup)...)
i += sum(A) # or i = sum(A), whatever
end
return i
end
# outer function
g(m::T, n::T) where {T<:Integer} = _g(n, Tuple(factor(m)))
Some benchmarks:
julia> #btime g(7, 210); # OP version
149.600 μs (7356 allocations: 172.62 KiB)
julia> #btime g(7, 210); # my version
1.140 μs (6 allocations: 11.91 KiB)
You should expect to hit compilation occasionally, whenever you get a number that contains a new number of factors.
Let x = randn(100, 2). I want to write x to its own file. This file will contain x, and only x, and x will only ever be of type Matrix{Float64}. In the past, I have always used HDF5 for this, but it occurs to me that this is over-kill, since in this setup I will only have one array per file. Note that JLD uses HDF5, and so is also over-kill.
1) What is the fastest method for reading and writing x assuming I will only ever want to read the entire matrix?
2) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix?
3) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix, or over-write a slice of the matrix (but not change the matrix size)?
You could use the serialize function, provided you heed the warnings in the documentation about non-guarantees between versions etc.
serialize(stream::IO, value)
Write an arbitrary value to a stream in an opaque format, such that it can be read back by deserialize. The read-back value will be as identical as possible to the original. In general, this process will not work if the reading and writing are done by different
versions of Julia, or an instance of Julia with a different system image. Ptr values are serialized as all-zero bit patterns (NULL).
An 8-byte identifying header is written to the stream first. To avoid writing the header, construct a SerializationState and use it as the first argument to serialize instead. See also Serializer.writeheader.
Really though, JLD (or in fact, its successor, JLD2) is generally the recommended way*.
*Of particular interest to you might be the statements that: "JLD2 saves and loads Julia data structures in a format comprising a subset of HDF5, without any dependency on the HDF5 C library" and that "it typically outperforms the previous JLD package (sometimes by multiple orders of magnitude) and often outperforms Julia's built-in serializer".
Based on the suggestions made by Tasos above, I put together a rudimentary speed test for both writes and reads using 4 different methods:
h5 (using the HDF5 package)
jld (using the JLD2 package)
slz (using serialize and deserialize)
dat (write to a binary file, using the first 128 bits to store the dimension of the matrix)
I've pasted the test code at the bottom of this answer. The results are:
julia> #time f_write_test(N, "h5")
0.191555 seconds (2.11 k allocations: 76.380 MiB, 26.39% gc time)
julia> #time f_write_test(N, "jld")
0.774857 seconds (8.33 k allocations: 77.058 MiB, 0.32% gc time)
julia> #time f_write_test(N, "slz")
0.108687 seconds (2.61 k allocations: 76.495 MiB, 1.91% gc time)
julia> #time f_write_test(N, "dat")
0.087488 seconds (1.61 k allocations: 76.379 MiB, 1.08% gc time)
julia> #time f_read_test(N, "h5")
0.051646 seconds (5.81 k allocations: 76.515 MiB, 14.80% gc time)
julia> #time f_read_test(N, "jld")
0.071249 seconds (10.04 k allocations: 77.136 MiB, 57.60% gc time)
julia> #time f_read_test(N, "slz")
0.038967 seconds (3.11 k allocations: 76.527 MiB, 22.17% gc time)
julia> #time f_read_test(N, "dat")
0.068544 seconds (1.81 k allocations: 76.405 MiB, 59.21% gc time)
So for writes, the write to binary option outperforms even serialize, and is twice as fast as HDF5 and almost an order of magnitude faster than JLD2.
For reads, deserialize has the best performance, while HDF5, JLD2 and reading from binary are all fairly close in performance, with HDF5 being slightly ahead.
I haven't included a test for writing to slices, but may come back to this in the future. Obviously writing to slices is impossible using serialize (not to mention the versioning/system image issues that serialize also faces), and I'm not really sure how to do it using JLD2. My gut feel writing a slice to binary will easily beat HDF5 if the slice is contiguous on disk, but will probably be significantly slower than HDF5 if it is non-contiguous and if the HDF5 method optimally exploits chunking. If HDF5 doesn't exploit chunking (which implies knowing at write time what slices you will want), then I suspect the binary method will come out ahead.
In summary, I'm going to go with the binary method, as I think that at this stage it is clearly the overall winner.
I suspect that eventually, JLD2 will probably be the method of choice, but there is a fair way to go here (the package itself is very new so not much time for the community to work on optimisations etc).
Test code follows:
using JLD2, HDF5
f_write_h5(fp::String, x::Matrix{Float64}) = h5write(fp, "G/D", x)
f_write_jld(fp::String, x::Matrix{Float64}) = #save fp x
f_write_slz(fp::String, x::Matrix{Float64}) = open(fid->serialize(fid, x), fp, "w")
f_write_dat_inner(fid1::IOStream, x::Matrix{Float64}) = begin ; write(fid1, size(x,1)) ; write(fid1, size(x,2)) ; write(fid1, x) ; end
f_write_dat(fp::String, x::Matrix{Float64}) = open(fid1->f_write_dat_inner(fid1, x), fp, "w")
f_read_h5(fp::String) = h5read(fp, "G/D")
f_read_jld(fp::String) = #load fp x
f_read_slz(fp::String) = open(deserialize, fp, "r")
f_read_dat_inner(fid1::IOStream) = begin ; d1 = read(fid1, Int) ; d2 = read(fid1, Int) ; read(fid1, Float64, (d1, d2)) ; end
f_read_dat(fp::String) = open(f_read_dat_inner, fp, "r")
function f_write_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_write_h5("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "jld" && [ f_write_jld("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "slz" && [ f_write_slz("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "dat" && [ f_write_dat("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
#[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
nothing
end
function f_read_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_read_h5("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "jld" && [ f_read_jld("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "slz" && [ f_read_slz("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "dat" && [ f_read_dat("$(dp)$(n).$(filetype)") for n = 1:N ]
[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
nothing
end
f_write_test(1, "h5")
f_write_test(1, "jld")
f_write_test(1, "slz")
f_write_test(1, "dat")
f_read_test(1, "h5")
f_read_test(1, "jld")
f_read_test(1, "slz")
f_read_test(1, "dat")
N = 100
#time f_write_test(N, "h5")
#time f_write_test(N, "jld")
#time f_write_test(N, "slz")
#time f_write_test(N, "dat")
#time f_read_test(N, "h5")
#time f_read_test(N, "jld")
#time f_read_test(N, "slz")
#time f_read_test(N, "dat")
Julia has two build-in functions readdlm & writedlm for doing this:
julia> x = randn(5, 5)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
julia> writedlm("txtmat.txt", x, use_mmap=true)
julia> readdlm("txtmat.txt", use_mmap=true)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
Definitely not the fastest way(use Mmap.mmap directly as DanGetz suggested in the comment if performance is a big deal), but it seems this is the simplest way and the output file is human-readable.
Question: I have a new type type MyFloat; x::Float64 ; end. I want to perform a deepcopy on a Vector{MyFloat}. Using Julia v0.5.0 on Ubuntu 16.04, the operation runs roughly 150 times slower than a deepcopy call on an equivalent length Vector{Float64}. Is it possible to speed up a deepcopy on my Vector{MyFloat}?
Code snippet: The 150 times slowdown can be seen with the following code snippet which can be pasted to the REPL:
#Just my own floating point type
type MyFloat
x::Float64
end
#This function performs N deepcopy operations on a Vector{MyFloat} of length J
function f1(J::Int, N::Int)
v = MyFloat.(rand(J))
x = [ deepcopy(v) for n = 1:N ]
end
#The same as f1, but on Vector{Float64} instead of Vector{MyFloat}
function f2(J::Int, N::Int)
v = rand(J)
x = [ deepcopy(v) for n = 1:N ]
end
#Pre-compilation step
f1(2, 2);
f2(2, 2);
#Timings
#time f1(100, 15000);
#time f2(100, 15000);
On my machine this produces:
julia> #time f1(100, 15000);
1.944410 seconds (4.61 M allocations: 167.888 MB, 7.72% gc time)
julia> #time f2(100, 15000);
0.013513 seconds (45.01 k allocations: 19.113 MB, 78.80% gc time)
Looking at the answer here it sounds like I can speed things up by defining my own copy method for MyFloat. I've tried things like:
Base.deepcopy(x::MyFloat)::MyFloat = MyFloat(x.x);
Base.deepcopy(v::Vector{MyFloat})::Vector{MyFloat} = [ MyFloat(y.x) for y in v ]
Base.copy(x::MyFloat)::MyFloat = MyFloat(x.x)
Base.copy(v::Vector{MyFloat})::Vector{MyFloat} = [ MyFloat(y.x) for y in v ]
but this doesn't make any difference.
Final note: Letting a = MyFloat.([1.0, 2.0]), I could just use b = copy(a) and there is no speed penalty. This is fine, as long as I am careful to only ever do operations like b[1] = MyFloat(3.0) (which will modify b but not a). But if I get sloppy and accidentally write b[1].x = 3.0, then this will modify both a and b.
By the way, it is entirely possible that I do not have a deep understanding of the differences between copy and deepcopy... I have read this great blog post (thanks #ChrisRackauckas), but I'm certainly a bit fuzzy about what is happening at a deeper level.
Try changing type MyFloat in the definition to immutable MyFloat or struct MyFloat (the keyword changed in 0.6). This makes the times almost equal.
As #Gnimuc mentioned, a mutable, which is not a bitstype, makes Julia keep track of a lot of other stuff. See here and in the comments.
I have pi approximation code very similar to that on official page:
function piaprox()
sum = 1.0
for i = 2:m-1
sum = sum + (1.0/(i*i))
end
end
m = parse(Int,ARGS[1])
opak = parse(Int,ARGS[2])
#time for i = 0:opak
piaprox()
end
When I try to compare time of C and Julia, then Julia is significantly slower, almost 38 sec for m = 100000000 (time of C is 0.1608328933 sec). Why this is happening?
julia> m=100000000
julia> function piaprox()
sum = 1.0
for i = 2:m-1
sum = sum + (1.0/(i*i))
end
end
piaprox (generic function with 1 method)
julia> #time piaprox()
28.482094 seconds (600.00 M allocations: 10.431 GB, 3.28% gc time)
I would like to mention two very important paragraphs from Performance Tips section of julia documentation:
Avoid global variables A global variable might have its value, and
therefore its type, change at any point. This makes it difficult for
the compiler to optimize code using global variables. Variables should
be local, or passed as arguments to functions, whenever possible.....
The macro #code_warntype (or its function variant code_warntype()) can
sometimes be helpful in diagnosing type-related problems.
julia> #code_warntype piaprox();
Variables:
sum::Any
#s1::Any
i::Any
It's clear from #code_warntype output that compiler could not recognize types of local variables in piaprox(). So we try to declare types and remove global variables:
function piaprox(m::Int)
sum::Float64 = 1.0
i::Int = 0
for i = 2:m-1
sum = sum + (1.0/(i*i))
end
end
julia> #time piaprox(100000000 )
0.009023 seconds (11.10 k allocations: 399.769 KB)
julia> #code_warntype piaprox(100000000);
Variables:
m::Int64
sum::Float64
i::Int64
#s1::Int64
EDIT
as #user3662120 commented, the super fast behavior of the answer is result of a mistake, without a return value LLVM might ignore the for loop, by adding a return line the #time result would be:
julia> #time piaprox(100000000)
0.746795 seconds (11.11 k allocations: 400.294 KB, 0.45% gc time)
1.644934057834575
I want to find the key corresponding to the min or max value of a dictionary in julia. In Python I would to the following:
my_dict = {1:20, 2:10}
min(my_dict, my_dict.get)
Which would return the key 2.
How can I do the same in julia ?
my_dict = Dict(1=>20, 2=>10)
minimum(my_dict)
The latter returns 1=>20 instead of 2=>10 or 2.
You could use reduce like this, which will return the key of the first smallest value in d:
reduce((x, y) -> d[x] ≤ d[y] ? x : y, keys(d))
This only works for non-empty Dicts, though. (But the notion of the “key of the minimal value of no values” does not really make sense, so that case should usually be handled seperately anyway.)
Edit regarding efficiency.
Consider these definitions (none of which handle empty collections)...
m1(d) = reduce((x, y) -> d[x] ≤ d[y] ? x : y, keys(d))
m2(d) = collect(keys(d))[indmin(collect(values(d)))]
function m3(d)
minindex(x, y) = d[x] ≤ d[y] ? x : y
reduce(minindex, keys(d))
end
function m4(d)
minkey, minvalue = next(d, start(d))[1]
for (key, value) in d
if value < minvalue
minkey = key
minvalue = value
end
end
minkey
end
...along with this code:
function benchmark(n)
d = Dict{Int, Int}(1 => 1)
m1(d); m2(d); m3(d); m4(d); m5(d)
while length(d) < n
setindex!(d, rand(-n:n), rand(-n:n))
end
#time m1(d)
#time m2(d)
#time m3(d)
#time m4(d)
end
Calling benchmark(10000000) will print something like this:
1.455388 seconds (30.00 M allocations: 457.748 MB, 4.30% gc time)
0.380472 seconds (6 allocations: 152.588 MB, 0.21% gc time)
0.982006 seconds (10.00 M allocations: 152.581 MB, 0.49% gc time)
0.204604 seconds
From this we can see that m2 (from user3580870's answer) is indeed faster than my original solution m1 by a factor of around 3 to 4, and also uses less memory. This is appearently due to the function call overhead, but also the fact that the λ expression in m1 is not optimized very well. We can alleviate the second problem by defining a helper function like in m3, which is better than m1, but not as good as m2.
However, m2 still allocates O(n) memory, which can be avoided: If you really need the efficiency, you should use an explicit loop like in m4, which allocates almost no memory and is also faster.
another option is:
collect(keys(d))[indmin(collect(values(d)))]
it depends on properties of keys and values iterators which are not guaranteed, but in fact work for Dicts (and are guaranteed for OrderedDicts). like the reduce answer, d must be non-empty.
why mention this, when the reduce, pretty much nails it? it is 3 to 4 times faster (at least on my computer) !
Here is another way to find Min with Key and Value
my_dict = Dict(1 => 20, 2 =>10)
findmin(my_dict) gives the output as below
(10, 2)
to get only key use
findmin(my_dict)[2]
to get only value use
findmin(my_dict)[1]
Hope this helps.
If you only need the minimum value, you can use
minimum(values(my_dict))
If you need the key as well, I don't know a built-in function to do so, but you can easily write it yourself for numeric keys and values:
function find_min_key{K,V}(d::Dict{K,V})
minkey = typemax(K)
minval = typemax(V)
for key in keys(d)
if d[key] < minval
minkey = key
minval = d[key]
end
end
minkey => minval
end
my_dict = Dict(1=>20, 2=>10)
find_min_key(my_dict)
findmax(dict)[2]
findmin(dict)[2]
Should also return the key corresponding to the max and min value(s). Here [2] is the index of the key in the returned tuple.