Julia: Why is the memory blowing up inside this loop?

Julia: Why is the memory blowing up inside this loop? - julia

I have some multi-threaded code in which each thread calls a function f(df::DataFrame) which reads a column of that DataFrame and finds the indices where the column is greater than 0:
function f(df::DataFrame)
X = df[:time]
return findall(x->x>0, X)
end
Inside the main thread I read in an R *.rds file which Julia converts to a DataFrame which I'm passing to f() as follows:
rds = "blabla.rds"
objs = load(rds);
params = collect(0.5:0.005:0.7)
for i in 1:length(objs)
cols = [string(name) for name in names(objs.data[i]) if occursin("bla",string(name))]
hypers = [(a,b) for a in cols, b in params] # length ~2000
Threads.#threads for hi in 1:length(hypers) # MEMORY BLOWS UP HERE
df = f(objs.data[i])
end
end
Each df that is passed to f() is roughly 0.7GB. Analysing the memory usage when the multi-threaded loop is run, the memory usage goes up to ~30GB. There are 25 threads and ~2000 calls to f(). Any idea why the memory is exploding?
NOTE: The problem seems to be ameliorated by calling GC.gc() inside the loop every so often, which seems like a botch...
NOTE also: This happens whether or not I use a regular or multi-threaded loop.
EDIT:
Profiling the code as follows:
function foo(objs)
for i in 1:length(objs)
df = objs.data[i]
Threads.#threads for hi in 1:2000
tmp = f(df)
end
end
end
#benchmark(foo($objs))
gives
BenchmarkTools.Trial:
memory estimate: 32.93 GiB
allocs estimate: 48820
--------------
minimum time: 2.577 s (0.00% GC)
median time: 2.614 s (0.00% GC)
mean time: 2.614 s (0.00% GC)
maximum time: 2.651 s (0.00% GC)
--------------
samples: 2
evals/sample: 1

Related

Create a primitive Type that behaves like another Type

I would like to create a primitive type that behaves like let's say Int64, but named Foo. Is this possible?
I learned here that I can do something like this
# Declare the new type.
primitive type MyInt8 <: Signed 8 end
# A constructor to create values of the type MyInt8.
MyInt8(x :: Int8) = reinterpret(MyInt8, x)
# A constructor to convert back.
Int8(x :: MyInt8) = reinterpret(Int8, x)
# This allows the REPL to show values of type MyInt8.
Base.show(io :: IO, x :: MyInt8) = print(io, Int8(x))
But I do not understand why, even if MyInt8 is a subtype of Signed (MyInt8 <: Signed) there are no methods for this type:
a = MyInt8(Int8(3))
b = MyInt8(Int8(4))
a + b
ERROR: promotion of types MyInt8 and MyInt8 failed to change any arguments
I thought that as a subtype the primitive type would automatically "get" all the methods of the supertype as well.
Where am I wrong here?

As a subtype of Signed, MyInt8 does automatically "get" all methods defined for numbers. And there is actually quite a lot of them:
julia> methodswith(MyInt8, supertypes=true) |> length
1218
julia> methodswith(MyInt8, supertypes=true)[1:10]
[1] poll_fd(s::RawFD, timeout_s::Real; readable, writable) in FileWatching at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/FileWatching/src/FileWatching.jl:649
[2] poll_file(s::AbstractString, interval_seconds::Real) in FileWatching at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/FileWatching/src/FileWatching.jl:784
[3] poll_file(s::AbstractString, interval_seconds::Real, timeout_s::Real) in FileWatching at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/FileWatching/src/FileWatching.jl:784
[4] watch_file(s::AbstractString, timeout_s::Real) in FileWatching at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/FileWatching/src/FileWatching.jl:687
[5] watch_folder(s::String, timeout_s::Real) in FileWatching at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/FileWatching/src/FileWatching.jl:716
[6] watch_folder(s::AbstractString, timeout_s::Real) in FileWatching at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/FileWatching/src/FileWatching.jl:715
[7] randcycle(r::Random.AbstractRNG, n::T) where T<:Integer in Random at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/Random/src/misc.jl:348
[8] randcycle(n::Integer) in Random at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/Random/src/misc.jl:349
[9] randexp(rng::Random.AbstractRNG, ::Type{T}, dim1::Integer, dims::Integer...) where T in Random at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/Random/src/normal.jl:204
[10] randperm(r::Random.AbstractRNG, n::T) where T<:Integer in Random at /home/francois/.local/julia-1.5.2/share/julia/stdlib/v1.5/Random/src/misc.jl:282
Some of them don't require that MyInt8 conform to any specific interface, and will work "out of the box":
julia> a = MyInt8(Int8(3))
3
julia> conj(a)
3
But some will provide an implementation that depends on MyInt8 conforming to an interface:
# Here `imag` does not know how to build a zero of the MyInt8 type
julia> imag(a)
ERROR: MethodError: no method matching MyInt8(::Int64)
Closest candidates are:
MyInt8(::T) where T<:Number at boot.jl:716
MyInt8(::Int8) at REPL[7]:2
MyInt8(::Float16) where T<:Integer at float.jl:71
...
Stacktrace:
[1] convert(::Type{MyInt8}, ::Int64) at ./number.jl:7
[2] oftype(::MyInt8, ::Int64) at ./essentials.jl:367
[3] zero(::MyInt8) at ./number.jl:241
[4] imag(::MyInt8) at ./complex.jl:78
[5] top-level scope at REPL[37]:1
The addition falls into that last category:
# There is indeed a generic implementation that MyInt8 inherits
julia> which(+, (MyInt8, MyInt8))
+(a::Integer, b::Integer) in Base at int.jl:918
# But it relies on promotion features, which are not implemented for MyInt8
julia> a + a
ERROR: promotion of types MyInt8 and MyInt8 failed to change any arguments
Stacktrace:
[1] error(::String, ::String, ::String) at ./error.jl:42
[2] sametype_error(::Tuple{MyInt8,MyInt8}) at ./promotion.jl:306
[3] not_sametype(::Tuple{MyInt8,MyInt8}, ::Tuple{MyInt8,MyInt8}) at ./promotion.jl:300
[4] +(::MyInt8, ::MyInt8) at ./int.jl:921
[5] top-level scope at REPL[41]:1
In that particular case, the inherited implementation of + is only there to handle the addition of two operands of different types; I guess you'll have to implement + for MyInt8 yourself, in much the same way as Int8 implements its own addition:
julia> which(+, (Int8, Int8))
+(x::T, y::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} in Base at int.jl:86
I'd probably go for an implementation like the following:
julia> Base.:+(a::MyInt8, b::MyInt8) = MyInt8(Int8(a)+Int8(b))
julia> a + a
6
EDIT:
This should reach the same performances as the native Int8 type:
# two Int8 vectors
julia> a1 = rand(Int8, 1000); b1 = rand(Int8, 1000);
# same vectors, as MyInt8 values
julia> a2 = MyInt8.(a1); b2 = MyInt8.(b1);
# check that both ways of doing the calculation produce the same results
julia> #assert a1 .+ b1 == Int8.(a2 .+ b2)
# Benchmark
julia> using BenchmarkTools
julia> #benchmark $a1 .+ $b1
BenchmarkTools.Trial:
memory estimate: 1.06 KiB
allocs estimate: 1
--------------
minimum time: 145.274 ns (0.00% GC)
median time: 176.402 ns (0.00% GC)
mean time: 200.996 ns (4.20% GC)
maximum time: 1.349 μs (79.34% GC)
--------------
samples: 10000
evals/sample: 759
julia> #benchmark $a2 .+ $b2
BenchmarkTools.Trial:
memory estimate: 1.06 KiB
allocs estimate: 1
--------------
minimum time: 140.316 ns (0.00% GC)
median time: 172.119 ns (0.00% GC)
mean time: 195.947 ns (4.54% GC)
maximum time: 1.148 μs (73.34% GC)
--------------
samples: 10000
evals/sample: 750

Progress bar slows the loop

I've done some tests with the progress bar and it slows down the test code considerably.
Are there any alternatives or solutions? I'm looking for a way to track current index while looping and there are some primitive ways to put more conditions to print when step reached but isn't there something good that's built in?
Oh and one more question, Is there a way to print time elapsed from when the function started and display with the index? let me clarify, I know about #time and etc but is there a way to count time and display it with corresponding index like
"Reached index $i in iteration in time $time"
Code for the tests done:
function test(x)
summ = BigInt(0);
Juno.progress(name = "foo") do id
for i = 1:x
summ+=i;
#info "foo" progress=i/x _id=id
end
end
println("sum up to $x is $summ");
return summ;
end
#benchmark test(10^4)
function test2(x)
summ = BigInt(0);
for i = 1:x
summ+=i;
(i%10 == 0) && println("Reached this milestone $i")
end
println("sum up to $x is $summ");
return summ;
end
#benchmark test2(10^4)
EDIT 1
for Juno.progress:
BenchmarkTools.Trial:
memory estimate: 21.66 MiB
allocs estimate: 541269
--------------
minimum time: 336.595 ms (0.00% GC)
median time: 345.875 ms (0.00% GC)
mean time: 345.701 ms (0.64% GC)
maximum time: 356.436 ms (1.34% GC)
--------------
samples: 15
evals/sample: 1
For the crude simple version:
BenchmarkTools.Trial:
memory estimate: 1.22 MiB
allocs estimate: 60046
--------------
minimum time: 111.251 ms (0.00% GC)
median time: 117.110 ms (0.00% GC)
mean time: 119.886 ms (0.51% GC)
maximum time: 168.116 ms (15.31% GC)
--------------
samples: 42
evals/sample: 1

I'd recommend using Juno.#progress directly for much better performance:
using BenchmarkTools
function test(x)
summ = BigInt(0)
Juno.progress(name = "foo") do id
for i = 1:x
summ += i
#info "foo" progress = i / x _id = id
end
end
println("sum up to $x is $summ")
return summ
end
#benchmark test(10^4) # min: 326ms
function test1(x)
summ = BigInt(0)
Juno.#progress "foo" for i = 1:x
summ += i
end
println("sum up to $x is $summ")
return summ
end
#benchmark test1(10^4) # min 5.4ms
function test2(x)
summ = BigInt(0)
for i = 1:x
summ += i
end
println("sum up to $x is $summ")
return summ
end
#benchmark test2(10^4) # min 0.756ms
function test3(x)
summ = BigInt(0);
for i = 1:x
summ+=i;
(i%10 == 0) && println("Reached this milestone $i")
end
println("sum up to $x is $summ");
return summ;
end
#benchmark test3(10^4) # min 33ms
Juno.progress can make no performance optimizations at all for you, but you can implement them manually:
function test4(x)
summ = BigInt(0)
update_interval = x÷200 # update every 0.5%
Juno.progress(name = "foo") do id
for i = 1:x
summ += i
if i % update_interval == 0
#info "foo" progress = i / x _id = id
end
end
end
println("sum up to $x is $summ")
return summ
end
#benchmark test4(10^4) # min: 5.2ms

As was stated by High Performance Mark writing to the screen is fundamentally slow (crazy fast in human scale, very slow in computer scale.) You could abandon writing the output to the progress bar, but you can also simply update the progress bar less often. In your test case you're doing 10000 additions and updating the progress bar 10000 times. To be honest I've never used Julia and I have no idea what the progress bar looks like. Even if it is a GUI progress bar on a 4K screen and each of these updates actually changes it at all I guarantee a human can't see the difference. I would update it at the beginning (to be 0) and at the end (to be 100%) and then use an if statement with a modulo test to only update every so many additions. Example below in python which I'll claim is pseudo code since I've never used julia:
updateEvery = 2
for i in range(1,x):
sum += i
if x % updateEvery == 0:
updateProgressBar(i/x)
By varying updateEvery you can decrease or increase the number of progress bar updates. You can even calculate it dynamically based on x, say updateEvery = x/100, this would mean the progress bar would line up pretty well to percentages. The inefficiency caused by the progress bar updates is also probably meaningless for small values of x and as x increases the number of updates per number to be added will decrease (because the total number of updates will be constant.
Oh and if you really need great performance to the counting clock tick level (which you probably don't,) modulo is faster for powers of 2 as it can be done with a binary and operation. I assume Julia will figure this optimisation out for you and you can just use % and round the value of updateEvery to the next power of 2. Though if you really care about that level of performance you'd be best to just get rid of the progress bar to eliminate the loop altogether.

What is the fastest method(s) for reading and writing a matrix of Float64 to file in julia

Let x = randn(100, 2). I want to write x to its own file. This file will contain x, and only x, and x will only ever be of type Matrix{Float64}. In the past, I have always used HDF5 for this, but it occurs to me that this is over-kill, since in this setup I will only have one array per file. Note that JLD uses HDF5, and so is also over-kill.
1) What is the fastest method for reading and writing x assuming I will only ever want to read the entire matrix?
2) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix?
3) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix, or over-write a slice of the matrix (but not change the matrix size)?

You could use the serialize function, provided you heed the warnings in the documentation about non-guarantees between versions etc.
serialize(stream::IO, value)
Write an arbitrary value to a stream in an opaque format, such that it can be read back by deserialize. The read-back value will be as identical as possible to the original. In general, this process will not work if the reading and writing are done by different
versions of Julia, or an instance of Julia with a different system image. Ptr values are serialized as all-zero bit patterns (NULL).
An 8-byte identifying header is written to the stream first. To avoid writing the header, construct a SerializationState and use it as the first argument to serialize instead. See also Serializer.writeheader.
Really though, JLD (or in fact, its successor, JLD2) is generally the recommended way*.
*Of particular interest to you might be the statements that: "JLD2 saves and loads Julia data structures in a format comprising a subset of HDF5, without any dependency on the HDF5 C library" and that "it typically outperforms the previous JLD package (sometimes by multiple orders of magnitude) and often outperforms Julia's built-in serializer".

Based on the suggestions made by Tasos above, I put together a rudimentary speed test for both writes and reads using 4 different methods:
h5 (using the HDF5 package)
jld (using the JLD2 package)
slz (using serialize and deserialize)
dat (write to a binary file, using the first 128 bits to store the dimension of the matrix)
I've pasted the test code at the bottom of this answer. The results are:
julia> #time f_write_test(N, "h5")
0.191555 seconds (2.11 k allocations: 76.380 MiB, 26.39% gc time)
julia> #time f_write_test(N, "jld")
0.774857 seconds (8.33 k allocations: 77.058 MiB, 0.32% gc time)
julia> #time f_write_test(N, "slz")
0.108687 seconds (2.61 k allocations: 76.495 MiB, 1.91% gc time)
julia> #time f_write_test(N, "dat")
0.087488 seconds (1.61 k allocations: 76.379 MiB, 1.08% gc time)
julia> #time f_read_test(N, "h5")
0.051646 seconds (5.81 k allocations: 76.515 MiB, 14.80% gc time)
julia> #time f_read_test(N, "jld")
0.071249 seconds (10.04 k allocations: 77.136 MiB, 57.60% gc time)
julia> #time f_read_test(N, "slz")
0.038967 seconds (3.11 k allocations: 76.527 MiB, 22.17% gc time)
julia> #time f_read_test(N, "dat")
0.068544 seconds (1.81 k allocations: 76.405 MiB, 59.21% gc time)
So for writes, the write to binary option outperforms even serialize, and is twice as fast as HDF5 and almost an order of magnitude faster than JLD2.
For reads, deserialize has the best performance, while HDF5, JLD2 and reading from binary are all fairly close in performance, with HDF5 being slightly ahead.
I haven't included a test for writing to slices, but may come back to this in the future. Obviously writing to slices is impossible using serialize (not to mention the versioning/system image issues that serialize also faces), and I'm not really sure how to do it using JLD2. My gut feel writing a slice to binary will easily beat HDF5 if the slice is contiguous on disk, but will probably be significantly slower than HDF5 if it is non-contiguous and if the HDF5 method optimally exploits chunking. If HDF5 doesn't exploit chunking (which implies knowing at write time what slices you will want), then I suspect the binary method will come out ahead.
In summary, I'm going to go with the binary method, as I think that at this stage it is clearly the overall winner.
I suspect that eventually, JLD2 will probably be the method of choice, but there is a fair way to go here (the package itself is very new so not much time for the community to work on optimisations etc).
Test code follows:
using JLD2, HDF5
f_write_h5(fp::String, x::Matrix{Float64}) = h5write(fp, "G/D", x)
f_write_jld(fp::String, x::Matrix{Float64}) = #save fp x
f_write_slz(fp::String, x::Matrix{Float64}) = open(fid->serialize(fid, x), fp, "w")
f_write_dat_inner(fid1::IOStream, x::Matrix{Float64}) = begin ; write(fid1, size(x,1)) ; write(fid1, size(x,2)) ; write(fid1, x) ; end
f_write_dat(fp::String, x::Matrix{Float64}) = open(fid1->f_write_dat_inner(fid1, x), fp, "w")
f_read_h5(fp::String) = h5read(fp, "G/D")
f_read_jld(fp::String) = #load fp x
f_read_slz(fp::String) = open(deserialize, fp, "r")
f_read_dat_inner(fid1::IOStream) = begin ; d1 = read(fid1, Int) ; d2 = read(fid1, Int) ; read(fid1, Float64, (d1, d2)) ; end
f_read_dat(fp::String) = open(f_read_dat_inner, fp, "r")
function f_write_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_write_h5("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "jld" && [ f_write_jld("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "slz" && [ f_write_slz("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "dat" && [ f_write_dat("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
#[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
nothing
end
function f_read_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_read_h5("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "jld" && [ f_read_jld("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "slz" && [ f_read_slz("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "dat" && [ f_read_dat("$(dp)$(n).$(filetype)") for n = 1:N ]
[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
nothing
end
f_write_test(1, "h5")
f_write_test(1, "jld")
f_write_test(1, "slz")
f_write_test(1, "dat")
f_read_test(1, "h5")
f_read_test(1, "jld")
f_read_test(1, "slz")
f_read_test(1, "dat")
N = 100
#time f_write_test(N, "h5")
#time f_write_test(N, "jld")
#time f_write_test(N, "slz")
#time f_write_test(N, "dat")
#time f_read_test(N, "h5")
#time f_read_test(N, "jld")
#time f_read_test(N, "slz")
#time f_read_test(N, "dat")

Julia has two build-in functions readdlm & writedlm for doing this:
julia> x = randn(5, 5)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
julia> writedlm("txtmat.txt", x, use_mmap=true)
julia> readdlm("txtmat.txt", use_mmap=true)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
Definitely not the fastest way(use Mmap.mmap directly as DanGetz suggested in the comment if performance is a big deal), but it seems this is the simplest way and the output file is human-readable.

Copying array columns

I have an algorithm that requires one column of an array to be replaced by another column of the same array.
I tried doing it with slices, and element-wise.
const M = 10^4
const N = 10^4
A = rand(Float32, M, N)
B = rand(Float32, N, M)
function copy_col!(A::Array{Float32,2},col1::Int,col2::Int)
A[1:end,col2] = A[1:end,col1]
end
function copy_col2!(A::Array{Float32,2},col1::Int,col2::Int)
for i in 1:size(A,1)
A[i,col2] = A[i,col1]
end
end
[Both functions+rand are called here once for compilation]
#time (for i in 1:20000 copy_col!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
#time (for i in 1:20000 copy_col2!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
>> 0.607899 seconds (314.81 k allocations: 769.879 MB, 25.05% gc time)
>> 0.213387 seconds (117.96 k allocations: 2.410 MB)
Why does copying using slices perform so much worse? Is there a better way than what copy_col2! does?

A[1:end,col1] makes a copy of indexed column first then it copies over to A[1:end,col2] so copy_col! allocates more and runs longer. There are sub, slice, and view that may remedy allocations in this case.

Julia is slow with cat command

I wanted to have a look at the julia language, so I wrote a little script to import a dataset I'm working with. But when I run and profile the script it turns out that it is much slower than a similar script in R.
When I do profiling it tells me that all the cat commands have a bad performance.
The files look like this:
#
#Metadata
#
Identifier1 data_string1
Identifier2 data_string2
Identifier3 data_string3
Identifier4 data_string4
//
I primarily want to get the data_strings and split them up into a matrix of single characters.
This is a somehow minimal code example:
function loadfile()
f = open("/file1")
first=true
m = Array(Any, 1,0)
for ln in eachline(f)
if ln[1] != '#' && ln[1] != '\n' && ln[1] != '/'
s = split(ln[1:end-1])
s = split(s[2],"")
if first
m = reshape(s,1,length(s))
first = false
else
s = reshape(s,1,length(s))
println(size(m))
println(size(s))
m = vcat(m, s)
end
end
end
end
Any idea why julia might be slow with the cat command or how i can do it differently?
Thanks for any suggestions!

Using cat like that is slow in that it requires a lot of memory allocations. Every time we do a vcat we are allocating a whole new array m which is mostly the same as the old m. Here is how I'd rewrite your code in a more Julian way, where m is only created at the end:
function loadfile2()
f = open("./sotest.txt","r")
first = true
lines = Any[]
for ln in eachline(f)
if ln[1] == '#' || ln[1] == '\n' || ln[1] == '/'
continue
end
data_str = split(ln[1:end-1]," ")[2]
data_chars = split(data_str,"")
# Can make even faster (2x in my tests) with
# data_chars = [data_str[i] for i in 1:length(data_str)]
# But this inherently assumes ASCII data
push!(lines, data_chars)
end
m = hcat(lines...)' # Stick column vectors together then transpose
end
I made a 10,000 line version of your example data and found the following performance:
Old version:
elapsed time: 3.937826405 seconds (3900659448 bytes allocated, 43.81% gc time)
elapsed time: 3.581752309 seconds (3900645648 bytes allocated, 36.02% gc time)
elapsed time: 3.57753696 seconds (3900645648 bytes allocated, 37.52% gc time)
New version:
elapsed time: 0.010351067 seconds (11568448 bytes allocated)
elapsed time: 0.011136188 seconds (11568448 bytes allocated)
elapsed time: 0.010654002 seconds (11568448 bytes allocated)