Simple Table Operation Has Very Large Compilation Time with MLJ - julia

I am trying to use MLJ on a DataFrame (30,000 rows x 8,000 columns) but every table operation seems to take a huge amount of time to compile but is fast to run.
I have given an example with code below in which a 5 x 5000 DataFrame is generated and it gets stuck on the unpack line (line 3). When I run the same code for a 5 x 5 DataFrame, line 3 outputs “2.872309 seconds (9.09 M allocations: 565.673 MiB, 6.47% gc time, 99.84% compilation time)”.
This is a crazy amount of compilation time for a seemingly simple task and I would like to know how I can reduce this.
Thank you,
Jack
using MLJ
using DataFrames
[line 1] #time arr = [[rand(1:10) for i in 1:5] for i in 1:5000];
output: 0.053668 seconds (200.76 k allocations: 11.360 MiB, 22.16% gc time, 99.16% compilation time)
[line 2] #time df = DataFrames.DataFrame(arr, :auto)
output: 0.267325 seconds (733.43 k allocations: 40.071 MiB, 4.29% gc time, 98.67% compilation time)
[line 3] #time y, X = unpack(df, ==(:x1));
does not finish running

It's not unexpected that the Julia compiler struggles with very wide DataFrames, which have (potentially) heterogeneous column types. That said I'm not sure why this has to be a problem for this operation - I've checked with MLJ maintainers who can hopefully chime in.
In the meantime you can simply do
y, X = df.x1, select!(df, Not(:x1))
which is instantaneous (Note select! will drop x1 from your underlying data, if you want to copy data use select instead)

Please don't cross-post a problem on multiple websites without linking.
The question has been answered at the Julia forum: https://discourse.julialang.org/t/simple-table-operation-has-very-large-compilation-time-with-mlj/82503/2. It was caused by a bug which is fixed in MLJBase 0.20.5.

Related

Julia: parallelize operations on complex data structures (eg DataFrames)

I would like to process a number of large datasets in parallel. Unfortunately the speedup I am getting from using Threads.#threads is very sublinear, as the following simplified example shows.
(I'm very new to Julia, so apologies if I missed something obvious)
Let's create some dummy input data - 8 dataframes with 2 integer columns each and 10 million rows:
using DataFrames
n = 8
dfs = Vector{DataFrame}(undef, n)
for i = 1:n
dfs[i] = DataFrame(Dict("x1" => rand(1:Int64(1e7), Int64(1e7)), "x2" => rand(1:Int64(1e7), Int64(1e7))))
end
Now do some processing on each dataframe (group by x1 and sum x2)
function process(df::DataFrame)::DataFrame
combine([:x2] => sum, groupby(df, :x1))
end
Finally, compare the speed of doing the processing on a single dataframe with doing it on all 8 dataframes in parallel. The machine I'm running this on has 50 cores and Julia was started with 50 threads, so ideally there should not be much of a time difference.
julia> dfs_res = Vector{DataFrame}(undef, n)
julia> #time for i = 1:1
dfs_res[i] = process(dfs[i])
end
3.041048 seconds (57.24 M allocations: 1.979 GiB, 4.20% gc time)
julia> Threads.nthreads()
50
julia> #time Threads.#threads for i = 1:n
dfs_res[i] = process(dfs[i])
end
5.603539 seconds (455.14 M allocations: 15.700 GiB, 39.11% gc time)
So the parallel run takes almost twice as long per dataset (this gets worse with more datasets). I have a feeling this has something to do with inefficient memory management. GC time is pretty high for the second run. And I assume the preallocation with undef isn't efficient for DataFrames. Pretty much all the examples I've seen for parallel processing in Julia are done on numeric arrays with fixed and a-priori known sizes. However here the datasets could have arbitrary sizes, columns etc. In R workflows like that can be done very efficiently with mclapply. Is there something similar (or a different but efficient pattern) in Julia? I chose to go with threads and not multi-processing to avoid copying data (Julia doesn't seem to support the fork process model like R / mclapply).
Multithreading in Julia does not scale well beyond 16 threads.
Hence you need to use multiprocessing instead.
Your code might look like this:
using DataFrames, Distributed
addprocs(4) # or 50
#everywhere using DataFrames, Distributed
n = 8
dfs = Vector{DataFrame}(undef, n)
for i = 1:n
dfs[i] = DataFrame(Dict("x1" => rand(1:Int64(1e7), Int64(1e7)), "x2" => rand(1:Int64(1e7), Int64(1e7))))
end
#everywhere function process(df::DataFrame)::DataFrame
combine([:x2] => sum, groupby(df, :x1))
end
dfs_res = #distributed (vcat) for i = 1:n
df = process(dfs[i])
(i, myid(), df)
end
What is important in this type of code is that transferring data between processes takes time. So sometimes you might want just to keep separate DataFrames on separate workers. Like always - it depends on your processing architecture.
Edit some notes on the performance
For testing have your code in functions and use consts (or use BenchamrTools.jl)
using DataFrames
const dfs = [DataFrame(Dict("x1" => rand(1:Int64(1e7), Int64(1e7)), "x2" => rand(1:Int64(1e7), Int64(1e7)))) for i in 1:8 ]
function process(df::DataFrame)::DataFrame
combine([:x2] => sum, groupby(df, :x1))
end
function p1!(res, d)
for i = 1:8
res[i] = process(dfs[i])
end
end
function p2!(res, d)
Threads.#threads for i = 1:8
res[i] = process(dfs[i])
end
end
const dres = Vector{DataFrame}(undef, 8)
And here result
julia> GC.gc();#time p1!(dres, dfs)
30.840718 seconds (507.28 M allocations: 16.532 GiB, 6.42% gc time)
julia> GC.gc();#time p1!(dres, dfs)
30.827676 seconds (505.66 M allocations: 16.451 GiB, 7.91% gc time)
julia> GC.gc();#time p2!(dres, dfs)
18.002533 seconds (505.77 M allocations: 16.457 GiB, 23.69% gc time)
julia> GC.gc();#time p2!(dres, dfs)
17.675169 seconds (505.66 M allocations: 16.451 GiB, 23.64% gc time)
Why the difference is only approx 2x on an 8 cores machine - because we have spent most of the time garbage collecting! (look at output in your question - the problem is the same)
When you use less RAM you will see a better multithreading speed-up up to 3x.

Julia push! - is not the right command while trying to collect data

I am struggling with Julia every time that I need to collect data in an array "outside" functions.
If I use push!(element, array) I can collect data in the array, but if the code is inside a loop then the array "grows" each time.
What do you recommend?
I know is quite basic :) but thank you!
I assume the reason why you do not want to use push! is because you have used Matlab before where this sort of operation is painfully slow (to be precise, it is an O(n^2) operation, so doubling n quadruples the runtime). This is not true for Julia's push!, since push! uses the algorithm described here which is only O(n) (so doubling n only doubles the runtime).
You can easily check this experimentally. In Matlab we have
>> n = 100000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 2.206152 seconds.
>> n = 200000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 8.301130 seconds.
so indeed the runtime quadruples for an n of twice the size. In contrast, in Julia we have
julia> using BenchmarkTools
function myzeros(n)
a = Vector{Int}()
for i = 1:n
push!(a,0)
end
return a
end
#btime myzeros(100_000);
#btime myzeros(200_000);
486.054 μs (17 allocations: 2.00 MiB)
953.982 μs (18 allocations: 3.00 MiB)
so indeed the runtimes only doubles for an n of twice the size.
Long story short: if you know the size of the final array, then preallocating the array is always best (even in Julia). However, if you don't know the final array size, then you can use push! without losing too much performance.

Changing datatype UInt64 to Float in Julia time

I am trying to calculate running time of a function in Julia. For example:
time = tic(); 7^12000000; toc()
I want to get the result as float. Type of "time" is Uint64, can anyone help me to convert it to Float64?
Thanks in advance
The issue is that tic and toc got removed in Julia 1.0 (on 0.7 they work but throw a deprecation warning). What I propose below works on Julia 0.6, 0.7 and 1.0.
You can use:
#elapsed macro from Base which returns time taken in seconds as Float64 (which, in particular, returns compilation time and run time on the first call of the benchmarked function but only runt time on consecutive runs as the called function will already be compiled)
#belapsed macro from BenchmarkTools.jl which returns the same but is more sophisticated (see BenchmarkTools.jl for details, but the main difference is that it runs your function many times and reports minimum observed time)
Here is an example:
julia> #elapsed sum(rand(10^6)) # includes compilation time
0.182671045
julia> #elapsed sum(rand(10^6)) # benchmarked functions are already precompiled
0.007848933
julia> using BenchmarkTools
julia> #belapsed sum(rand(10^6)) # minimum time from many runs
0.006249196
Your question is not clear. tic() and toc() do not exist in Julia. Use the macro #time.
julia> #time Float64(UInt(7^12000))
0.000048 seconds (7 allocations: 208 bytes)
6.871777734182465e18

#parallel vs. native loops in julia

I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.
I know there is a little overhead and it's absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
x = 0
#time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)
x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.432549 seconds (3.91 k allocations: 241.138 KiB)
I got good result for parallel here but in following example not.
x2 = 0
#time for i=1:100000
x2 = Int(rand(Bool)) + x2
end
0.006025 seconds (98.97 k allocations: 1.510 MiB)
x2 = #time #parallel (+) for i=1:100000
Int(rand(Bool))
end
0.084736 seconds (3.87 k allocations: 239.122 KiB)
Doing things in parallel will ALWAYS be less efficient. It is because doing things parallel has always overhead to synchronize. Anyway the hope is, to get the result earlies on wall time than a pure sequential call (one computer, single core)
Your number are astonishing, and I found the cause.
First of all, allow to use all cores, goto into REPL
julia> nworkers
4
# original case to get correct relative times
julia> x = 0
julia> #time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.864891 seconds (200.00 M allocations: 2.980 GiB, 1.62% gc time)
julia> x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.350262 seconds (4.08 k allocations: 254.165 KiB)
99991471
# now a correct benchmark
julia> function test()
x = 0
for i=1:200000000
x = Int(rand(Bool)) + x
end
end
julia> #time test()
0.465478 seconds (4 allocations: 160 bytes)
What happend?
Your first test case uses an global variable x. And that is terrible slow. The case access 200 000 000 times a slow variable.
In the second test case the global variable x is assigned just one time, so the poor performance is not coming into account
In my test case there is no global variable. I used a local variable. Local variables are much faster (due to better compiler optimizations)
Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
A: Yes.
1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense
2) Arrange the workflow smarter - to benefit from register-based code, from harnessing the cache-lines's sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.
3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host's hardware unique source-of-randomness, IO-devices et al )
4) Get familiar with your optimising compilers internal options and "shortcuts" -- sometimes anti-patterns get generated at a cost of extended run-times
5) Get maximum from your underlying operating system's tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler's queue

Julia Memory Allocation for Addition of Two Matrices in place

I'm curious why Julias implementation of matrix addition appears to make a copy. Heres an example:
foo1=rand(1000,1000)
foo2=rand(1000,1000)
foo3=rand(1000,1000)
julia> #time foo1=foo2+foo3;
0.001719 seconds (9 allocations: 7.630 MB)
julia> sizeof(foo1)/10^6
8.0
The amount of memory allocated is roughly the same as the memory required by a matrix of these dimensions.
It looks like in order to process foo2+foo3 memory is allocated to store the result and then foo1 is assigned to it by reference.
Does this mean that for most linear algebra operations we need to call BLAS and LAPACK functions directly to do things in place?
To understand what is going on here, let's consider what foo1 = foo2 + foo3 actually does.
First it evaluates foo2 + foo3. To do this it will allocate a new temporary array to hold the output
Then it will bind the name foo1 to this new temporary array, undoing all effort you put in to pre-allocate the output array.
In short, you see that memory usage is about that of the resultant array because the routine is indeed allocating new memory for an array of that size.
Here are some alternatives:
write a loop
use broadcast!
We could try do do copy!(foo1, foo2+foo3) and then the array you pre-allocated will be filled, but it will still allocate the temporary (see below)
The original version posted here
Here's some code for those 4 cases
julia> function with_loop!(foo1, foo2, foo3)
for i in eachindex(foo2)
foo1[i] = foo2[i] + foo3[i]
end
end
julia> function with_broadcast!(foo1, foo2, foo3)
broadcast!(+, foo1, foo2, foo3)
end
julia> function with_copy!(foo1, foo2, foo3)
copy!(foo1, foo2+foo3)
end
julia> function original(foo1, foo2, foo3)
foo1 = foo2 + foo3
end
Now let's time these functions
julia> for f in [:with_broadcast!, :with_loop!, :with_copy!, :original]
#eval $f(foo1, foo2, foo3) # compile
println("timing $f")
#eval #time $f(foo1, foo2, foo3)
end
timing with_broadcast!
0.001787 seconds (5 allocations: 192 bytes)
timing with_loop!
0.001783 seconds (4 allocations: 160 bytes)
timing with_copy!
0.003604 seconds (9 allocations: 7.630 MB)
timing original
0.002702 seconds (9 allocations: 7.630 MB, 97.91% gc time)
You can see that with_loop! and broadcast! do about the same and both are much faster and more efficient than the others. with_copy! and original are both slower and use more memory.
In general, to do inplace operations I'd recommend starting out by writing a loop
First, read #spencerlyon2's answer. Another approach is to use Dahua Lin's package Devectorize.jl. It defines the #devec macro which automates the translations of vector (array) expressions into looped code.
In this example we will define with_devec!(foo1,foo2,foo3) as follows,
julia> using Devectorize # install with Pkg.add("Devectorize")
julia> with_devec!(foo1,foo2,foo3) = #devec foo1[:]=foo2+foo3
Running the benchmark achieves the 4 allocations results.
You can use axpy! function from the LinearAlgebra package.
using LinearAlgebra
julia> #time BLAS.axpy!(1., foo2, foo3)
0.002187 seconds (4 allocations: 160 bytes)

Resources