I would like to process a number of large datasets in parallel. Unfortunately the speedup I am getting from using Threads.#threads is very sublinear, as the following simplified example shows.
(I'm very new to Julia, so apologies if I missed something obvious)
Let's create some dummy input data - 8 dataframes with 2 integer columns each and 10 million rows:
using DataFrames
n = 8
dfs = Vector{DataFrame}(undef, n)
for i = 1:n
dfs[i] = DataFrame(Dict("x1" => rand(1:Int64(1e7), Int64(1e7)), "x2" => rand(1:Int64(1e7), Int64(1e7))))
end
Now do some processing on each dataframe (group by x1 and sum x2)
function process(df::DataFrame)::DataFrame
combine([:x2] => sum, groupby(df, :x1))
end
Finally, compare the speed of doing the processing on a single dataframe with doing it on all 8 dataframes in parallel. The machine I'm running this on has 50 cores and Julia was started with 50 threads, so ideally there should not be much of a time difference.
julia> dfs_res = Vector{DataFrame}(undef, n)
julia> #time for i = 1:1
dfs_res[i] = process(dfs[i])
end
3.041048 seconds (57.24 M allocations: 1.979 GiB, 4.20% gc time)
julia> Threads.nthreads()
50
julia> #time Threads.#threads for i = 1:n
dfs_res[i] = process(dfs[i])
end
5.603539 seconds (455.14 M allocations: 15.700 GiB, 39.11% gc time)
So the parallel run takes almost twice as long per dataset (this gets worse with more datasets). I have a feeling this has something to do with inefficient memory management. GC time is pretty high for the second run. And I assume the preallocation with undef isn't efficient for DataFrames. Pretty much all the examples I've seen for parallel processing in Julia are done on numeric arrays with fixed and a-priori known sizes. However here the datasets could have arbitrary sizes, columns etc. In R workflows like that can be done very efficiently with mclapply. Is there something similar (or a different but efficient pattern) in Julia? I chose to go with threads and not multi-processing to avoid copying data (Julia doesn't seem to support the fork process model like R / mclapply).
Multithreading in Julia does not scale well beyond 16 threads.
Hence you need to use multiprocessing instead.
Your code might look like this:
using DataFrames, Distributed
addprocs(4) # or 50
#everywhere using DataFrames, Distributed
n = 8
dfs = Vector{DataFrame}(undef, n)
for i = 1:n
dfs[i] = DataFrame(Dict("x1" => rand(1:Int64(1e7), Int64(1e7)), "x2" => rand(1:Int64(1e7), Int64(1e7))))
end
#everywhere function process(df::DataFrame)::DataFrame
combine([:x2] => sum, groupby(df, :x1))
end
dfs_res = #distributed (vcat) for i = 1:n
df = process(dfs[i])
(i, myid(), df)
end
What is important in this type of code is that transferring data between processes takes time. So sometimes you might want just to keep separate DataFrames on separate workers. Like always - it depends on your processing architecture.
Edit some notes on the performance
For testing have your code in functions and use consts (or use BenchamrTools.jl)
using DataFrames
const dfs = [DataFrame(Dict("x1" => rand(1:Int64(1e7), Int64(1e7)), "x2" => rand(1:Int64(1e7), Int64(1e7)))) for i in 1:8 ]
function process(df::DataFrame)::DataFrame
combine([:x2] => sum, groupby(df, :x1))
end
function p1!(res, d)
for i = 1:8
res[i] = process(dfs[i])
end
end
function p2!(res, d)
Threads.#threads for i = 1:8
res[i] = process(dfs[i])
end
end
const dres = Vector{DataFrame}(undef, 8)
And here result
julia> GC.gc();#time p1!(dres, dfs)
30.840718 seconds (507.28 M allocations: 16.532 GiB, 6.42% gc time)
julia> GC.gc();#time p1!(dres, dfs)
30.827676 seconds (505.66 M allocations: 16.451 GiB, 7.91% gc time)
julia> GC.gc();#time p2!(dres, dfs)
18.002533 seconds (505.77 M allocations: 16.457 GiB, 23.69% gc time)
julia> GC.gc();#time p2!(dres, dfs)
17.675169 seconds (505.66 M allocations: 16.451 GiB, 23.64% gc time)
Why the difference is only approx 2x on an 8 cores machine - because we have spent most of the time garbage collecting! (look at output in your question - the problem is the same)
When you use less RAM you will see a better multithreading speed-up up to 3x.
Related
I am struggling with Julia every time that I need to collect data in an array "outside" functions.
If I use push!(element, array) I can collect data in the array, but if the code is inside a loop then the array "grows" each time.
What do you recommend?
I know is quite basic :) but thank you!
I assume the reason why you do not want to use push! is because you have used Matlab before where this sort of operation is painfully slow (to be precise, it is an O(n^2) operation, so doubling n quadruples the runtime). This is not true for Julia's push!, since push! uses the algorithm described here which is only O(n) (so doubling n only doubles the runtime).
You can easily check this experimentally. In Matlab we have
>> n = 100000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 2.206152 seconds.
>> n = 200000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 8.301130 seconds.
so indeed the runtime quadruples for an n of twice the size. In contrast, in Julia we have
julia> using BenchmarkTools
function myzeros(n)
a = Vector{Int}()
for i = 1:n
push!(a,0)
end
return a
end
#btime myzeros(100_000);
#btime myzeros(200_000);
486.054 μs (17 allocations: 2.00 MiB)
953.982 μs (18 allocations: 3.00 MiB)
so indeed the runtimes only doubles for an n of twice the size.
Long story short: if you know the size of the final array, then preallocating the array is always best (even in Julia). However, if you don't know the final array size, then you can use push! without losing too much performance.
I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.
I know there is a little overhead and it's absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
x = 0
#time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)
x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.432549 seconds (3.91 k allocations: 241.138 KiB)
I got good result for parallel here but in following example not.
x2 = 0
#time for i=1:100000
x2 = Int(rand(Bool)) + x2
end
0.006025 seconds (98.97 k allocations: 1.510 MiB)
x2 = #time #parallel (+) for i=1:100000
Int(rand(Bool))
end
0.084736 seconds (3.87 k allocations: 239.122 KiB)
Doing things in parallel will ALWAYS be less efficient. It is because doing things parallel has always overhead to synchronize. Anyway the hope is, to get the result earlies on wall time than a pure sequential call (one computer, single core)
Your number are astonishing, and I found the cause.
First of all, allow to use all cores, goto into REPL
julia> nworkers
4
# original case to get correct relative times
julia> x = 0
julia> #time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.864891 seconds (200.00 M allocations: 2.980 GiB, 1.62% gc time)
julia> x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.350262 seconds (4.08 k allocations: 254.165 KiB)
99991471
# now a correct benchmark
julia> function test()
x = 0
for i=1:200000000
x = Int(rand(Bool)) + x
end
end
julia> #time test()
0.465478 seconds (4 allocations: 160 bytes)
What happend?
Your first test case uses an global variable x. And that is terrible slow. The case access 200 000 000 times a slow variable.
In the second test case the global variable x is assigned just one time, so the poor performance is not coming into account
In my test case there is no global variable. I used a local variable. Local variables are much faster (due to better compiler optimizations)
Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
A: Yes.
1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense
2) Arrange the workflow smarter - to benefit from register-based code, from harnessing the cache-lines's sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.
3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host's hardware unique source-of-randomness, IO-devices et al )
4) Get familiar with your optimising compilers internal options and "shortcuts" -- sometimes anti-patterns get generated at a cost of extended run-times
5) Get maximum from your underlying operating system's tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler's queue
I'm curious why Julias implementation of matrix addition appears to make a copy. Heres an example:
foo1=rand(1000,1000)
foo2=rand(1000,1000)
foo3=rand(1000,1000)
julia> #time foo1=foo2+foo3;
0.001719 seconds (9 allocations: 7.630 MB)
julia> sizeof(foo1)/10^6
8.0
The amount of memory allocated is roughly the same as the memory required by a matrix of these dimensions.
It looks like in order to process foo2+foo3 memory is allocated to store the result and then foo1 is assigned to it by reference.
Does this mean that for most linear algebra operations we need to call BLAS and LAPACK functions directly to do things in place?
To understand what is going on here, let's consider what foo1 = foo2 + foo3 actually does.
First it evaluates foo2 + foo3. To do this it will allocate a new temporary array to hold the output
Then it will bind the name foo1 to this new temporary array, undoing all effort you put in to pre-allocate the output array.
In short, you see that memory usage is about that of the resultant array because the routine is indeed allocating new memory for an array of that size.
Here are some alternatives:
write a loop
use broadcast!
We could try do do copy!(foo1, foo2+foo3) and then the array you pre-allocated will be filled, but it will still allocate the temporary (see below)
The original version posted here
Here's some code for those 4 cases
julia> function with_loop!(foo1, foo2, foo3)
for i in eachindex(foo2)
foo1[i] = foo2[i] + foo3[i]
end
end
julia> function with_broadcast!(foo1, foo2, foo3)
broadcast!(+, foo1, foo2, foo3)
end
julia> function with_copy!(foo1, foo2, foo3)
copy!(foo1, foo2+foo3)
end
julia> function original(foo1, foo2, foo3)
foo1 = foo2 + foo3
end
Now let's time these functions
julia> for f in [:with_broadcast!, :with_loop!, :with_copy!, :original]
#eval $f(foo1, foo2, foo3) # compile
println("timing $f")
#eval #time $f(foo1, foo2, foo3)
end
timing with_broadcast!
0.001787 seconds (5 allocations: 192 bytes)
timing with_loop!
0.001783 seconds (4 allocations: 160 bytes)
timing with_copy!
0.003604 seconds (9 allocations: 7.630 MB)
timing original
0.002702 seconds (9 allocations: 7.630 MB, 97.91% gc time)
You can see that with_loop! and broadcast! do about the same and both are much faster and more efficient than the others. with_copy! and original are both slower and use more memory.
In general, to do inplace operations I'd recommend starting out by writing a loop
First, read #spencerlyon2's answer. Another approach is to use Dahua Lin's package Devectorize.jl. It defines the #devec macro which automates the translations of vector (array) expressions into looped code.
In this example we will define with_devec!(foo1,foo2,foo3) as follows,
julia> using Devectorize # install with Pkg.add("Devectorize")
julia> with_devec!(foo1,foo2,foo3) = #devec foo1[:]=foo2+foo3
Running the benchmark achieves the 4 allocations results.
You can use axpy! function from the LinearAlgebra package.
using LinearAlgebra
julia> #time BLAS.axpy!(1., foo2, foo3)
0.002187 seconds (4 allocations: 160 bytes)
Is there a good way to do parallel matrix multiplication in julia? I tried using DArrays, but it was significantly slower than just a single-thread multiplication.
Parallel in what sense? If you mean single-machine, multi-threaded, then Julia does this by default as OpenBLAS (the underlying linear algebra library used) is multithreaded.
If you mean multiple-machine, distributed-computing-style, then you will be encountering a lot of communications overhead that will only be worth it for very large problems, and a customized approach might be needed.
The problem is most likely that direct (maybe single-threaded) matrix-multiplication is normally performed with an optimized library function. In the case of OpenBLAS, this is already multithreaded. For arrays with size 2000x2000, the simple matrixmultiplication
#time c = sa * sb;
results in 0.3 seconds multithreaded and 0.7 seconds singlethreaded.
Splitting of a single dimension in multiplication the times get even worse and reach around 17 seconds in singlethreaded mode.
#time for j = 1:n
sc[:,j] = sa[:,:] * sb[:,j]
end
shared arrays
The solution to your problem might be the use of shared arrays, which share the same data across your processes on a single computer. Please note that shared arrays are still marked as experimental.
# create shared arrays and initialize them with random numbers
sa = SharedArray(Float64,(n,n),init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sb = SharedArray(Float64,(n,n),init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sc = SharedArray(Float64,(n,n));
Then you have to create a function, which performs a cheap matrix multiplication on a subset of the matrix.
#everywhere function mymatmul!(n,w,sa,sb,sc)
# works only for 4 workers and n divisible by 4
range = 1+(w-2) * div(n,4) : (w-1) * div(n,4)
sc[:,range] = sa[:,:] * sb[:,range]
end
Finally, the main process tells the workers to work on their part.
#time #sync begin
for w in workers()
#async remotecall_wait(w, mymatmul!, n, w, sa, sb, sc)
end
end
which takes around 0.3 seconds which is the same time as the multithreaded single-process time.
It sounds like you're interested in dense matrices, in which case see the other answers. Should you be (or become) interested in sparse matrices, see https://github.com/madeleineudell/ParallelSparseMatMul.jl.
I am trying to write a function that would solve any general version of this problem:
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.
The way I solved it for this instance was:
multiples = Array(Int, 0)
[ (i % 3 == 0 || i % 5 == 0) && push!(multiples, i) for i in 1:1000 ]
sum(multiples)
I want to write a function that will take an array of multiples (in this case, [3,5]) and the final number (in this case, 1000). The point is that the array can consist of arbitrarily many numbers, not just two (e.g., [3,5,6]). Then the function should run i % N == 0 for each N.
How do I do that most efficiently? Could that involve metaprogramming? (The code doesn't have to be in a list comprehension format.)
Thanks!
Very first thing that popped into my head was the following, using modulo division and a functional style:
v1(N,M) = sum(filter(k->any(j->k%j==0,M), 1:N))
But I wanted to investigate some alternatives, as this has two problems:
Two levels of anonymous function, something Julia doesn't optimize well (yet).
Creates a temporary array from the range, then sums.
So here is the most obvious alternative, the C-style version of the one-liner:
function v2(N,M)
sum_so_far = 0
for k in 1:N
for j in M
if k%j==0
sum_so_far += k
break
end
end
end
return sum_so_far
end
But then I thought about it some more, and remembered reading somewhere that modulo division is a slow operation. I wanted to see how IntSets perform - a set specialized for integers. So here is another one-liner, IntSets without using any module division, and a functional style!
v3(N,M) = sum(union(map(j->IntSet(j:j:N), M)...))
Expanding the map into a for loop and repeatedly applying union! to a single IntSet was not much better, so I won't include that here. To break this down:
IntSet(j:j:N) is all the multiples of j between j and N
j->IntSet(j:j:N) is an anonymous function that returns that IntSet
map(j->IntSet(j:j:N), M) applies that function to each j in M, and returns a Vector{IntSet}.
The ... "splats" the vector out into arguments of union
union creates an IntSet that is the union of its arguments - in this case, all multiples of the numbers in M
Then we sum it to finish
I benchmarked these with
N,M = 10000000, [3,4,5]
which gives you
One-liner: 2.857292874 seconds (826006352 bytes allocated, 10.49% gc time)
C-style: 0.190581908 seconds (176 bytes allocated)
IntSet no modulo: 0.121820101 seconds (16781040 bytes allocated)
So you can definitely even beat C-style code with higher level objects - modulo is that expensive I guess! The neat thing about the no modulo one is it parallelizes quite easily:
addprocs(3)
#everywhere worker(j,N) = IntSet(j:j:N)
v4(N,M) = sum(union(pmap(j->worker(j,N),M)...))
#show v4(1000, [3,5])
#time v3(1000000000,[3,4,5]); # bigger N than before
#time v4(1000000000,[3,4,5]);
which gives
elapsed time: 12.279323079 seconds (2147831540 bytes allocated, 0.94% gc time)
elapsed time: 10.322364457 seconds (1019935752 bytes allocated, 0.71% gc time)
which isn't much better, but its something I suppose.
Okay, here is my updated answer.
Based on the benchmarks in the answer of #IainDunning, the method to beat is his v2. My approach below appears to be much faster, but I'm not clever enough to generalize it to input vectors of length greater than 2. A good mathematician should be able to improve on my answer.
Quick intuition: For the case of length(M)=2, the problem reduces to the sum of all multiples of M[1] up to N added to the sum of all multiples of M[2] up to N, where, to avoid double-counting, we then need to subtract the sum of all multiples of M[1]*M[2] up to N. A similar algorithm could be implemented for M > 2, but the double-counting issue becomes much more complicated very quickly. I suspect a general algorithm for this would definitely exist (it is the kind of issue that crops up all the time in the field of combinatorics) but I don't know it off the top of my head.
Here is the test code for my approach (f1) versus v2:
function f1(N, M)
if length(M) > 2
error("I'm not clever enough for this case")
end
runningSum = 0
for c = 1:length(M)
runningSum += sum(M[c]:M[c]:N)
end
for c1 = 1:length(M)
for c2 = c1+1:length(M)
temp1 = M[c1]*M[c2]
runningSum -= sum(temp1:temp1:N)
end
end
return(runningSum)
end
function v2(N, M)
sum_so_far = 0
for k in 1:N
for j in M
if k%j==0
sum_so_far += k
break
end
end
end
return sum_so_far
end
f1(1000, [3,5])
v2(1000, [3,5])
N = 10000000
M = [3,5]
#time f1(N, M)
#time v2(N, M)
Timings are:
elapsed time: 4.744e-6 seconds (96 bytes allocated)
elapsed time: 0.201480996 seconds (96 bytes allocated)
Sorry, this is an interesting problem but I'm afraid I have to get back to work :-) I'll check back in later if I get a chance...