I am struggling with Julia every time that I need to collect data in an array "outside" functions.
If I use push!(element, array) I can collect data in the array, but if the code is inside a loop then the array "grows" each time.
What do you recommend?
I know is quite basic :) but thank you!
I assume the reason why you do not want to use push! is because you have used Matlab before where this sort of operation is painfully slow (to be precise, it is an O(n^2) operation, so doubling n quadruples the runtime). This is not true for Julia's push!, since push! uses the algorithm described here which is only O(n) (so doubling n only doubles the runtime).
You can easily check this experimentally. In Matlab we have
>> n = 100000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 2.206152 seconds.
>> n = 200000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 8.301130 seconds.
so indeed the runtime quadruples for an n of twice the size. In contrast, in Julia we have
julia> using BenchmarkTools
function myzeros(n)
a = Vector{Int}()
for i = 1:n
push!(a,0)
end
return a
end
#btime myzeros(100_000);
#btime myzeros(200_000);
486.054 μs (17 allocations: 2.00 MiB)
953.982 μs (18 allocations: 3.00 MiB)
so indeed the runtimes only doubles for an n of twice the size.
Long story short: if you know the size of the final array, then preallocating the array is always best (even in Julia). However, if you don't know the final array size, then you can use push! without losing too much performance.
Related
I have a function that returns an array. I'd like to map the function to a vector of inputs, and the the output to be a simple concatenation of all the arrays. The function is:
function log_it(r, bzero = 0.25, N = 400)
main = rand(Float16, (N+150));
main[1] = bzero;
for i in 2:N+150
main[i] = *(r, main[i-1], (1-main[i-1]))
end;
y = unique(main[(N+1):(N+150)]);
r_vec = repeat([r], size(y)[1]);
hcat(r_vec, y)
end;
and I can map it fine:
map(log_it, 2.4:0.001:2.405)
but the result is gross:
[2.4 0.58349609375]
[2.401 0.58349609375]
[2.402 0.583984375; 2.402 0.58349609375]
[2.403 0.583984375]
[2.404 0.583984375]
[2.405 0.58447265625; 2.405 0.583984375]
NB, the length of the nested arrays is unbounded - I'm looking for a solution that doesn't depend on knowing the the length of nested arrays in advance.
What I want is something like this:
2.4 0.583496
2.401 0.583496
2.402 0.583984
2.402 0.583496
2.403 0.583984
2.404 0.583984
2.405 0.584473
2.405 0.583984
Which I made using a for loop:
results = Array{Float64, 2}(undef, 0, 2)
for i in 2.4:0.001:2.405
results = cat(results, log_it(i), dims = 1)
end
results
The code works fine, but the for loop takes about four times as long. I also feel like map is the right way to do it and I'm just missing something - either in executing map in such a way that it returns a nice vector of arrays, or in some mutation of the array that will "unnest". I've tried looking through functions like flatten and collect but can't find anything.
Many thanks in advance!
Are you sure you're benchmarking this correctly? Especially with very fast operations benchmarking can sometimes be tricky. As a starting point, I would recommend to ensure you always wrap any code you want to benchmark into a function, and use the BenchmarkTools package to get reliable timings.
There generally shouldn't be a performance penalty for writing loops in Julia, so a 3x increase in runtime for a loop compared to map sounds suspicious.
Here's what I get:
julia> using BenchmarkTools
julia> #btime map(log_it, 2.4:0.001:2.405)
121.426 μs (73 allocations: 14.50 KiB)
julia> function with_loop()
results = Array{Float64, 2}(undef, 0, 2)
for i in 2.4:0.001:2.405
results = cat(results, log_it(i), dims = 1)
end
results
end
julia> #btime with_loop()
173.492 μs (295 allocations: 23.67 KiB)
So the loop is about 50% slower, but that's because you're allocating more.
When you're using map there's usually a more Julia way of expressing what you're doing using broadcasting. This works for any user defined function:
julia> #btime log_it.(2.4:0.001:2.405)
121.434 μs (73 allocations: 14.50 KiB)
Is equivalent to your map expression. What you're looking for I think is just a way to stack all the resulting vectors - you can use vcat and splatting for that:
julia> #btime vcat(log_it.(2.4:0.001:2.405)...)
122.837 μs (77 allocations: 14.84 KiB)
and just to confirm:
julia> vcat(log_it.(2.4:0.001:2.405)...) == with_loop()
true
So using broadcasting and concatenating gives the same result as your loop at the speed and memory cost of your map solution.
I have two versions of code that seem to do the same thing:
sum = 0
for x in 1:100
sum += x
end
sum = 0
for x in collect(1:100)
sum += x
end
Is there a practical difference between the two approaches?
In Julia, 1:100 returns a particular struct called UnitRange that looks like this:
julia> dump(1:100)
UnitRange{Int64}
start: Int64 1
stop: Int64 100
This is a very compact struct to represent ranges with step 1 and arbitrary (finite) size. UnitRange is subtype of AbstractRange, a type to represent ranges with arbitrary step, subtype of AbstractVector.
The instances of UnitRange dynamically compute their elements whenever the you use getindex (or the syntactic sugar vector[index]). For example, with #less (1:100)[3] you can see this method:
function getindex(v::UnitRange{T}, i::Integer) where {T<:OverflowSafe}
#_inline_meta
val = v.start + (i - 1)
#boundscheck _in_unit_range(v, val, i) || throw_boundserror(v, i)
val % T
end
This is returning the i-th element of the vector by adding i - 1 to the first element (start) of the range. Some functions have optimised methods with UnitRange, or more generally with AbstractRange. For instance, with #less sum(1:100) you can see the following
function sum(r::AbstractRange{<:Real})
l = length(r)
# note that a little care is required to avoid overflow in l*(l-1)/2
return l * first(r) + (iseven(l) ? (step(r) * (l-1)) * (l>>1)
: (step(r) * l) * ((l-1)>>1))
end
This method uses the formula for the sum of an arithmetic progression, which is extremely efficient as it's evaluated in a time independent of the size of the vector.
On the other hand, collect(1:100) returns a plain Vector with one hundred elements 1, 2, 3, ..., 100. The main difference with UnitRange (or other types of AbstractRange) is that getindex(vector::Vector, i) (or vector[i], with vector::Vector) doesn't do any computation but simply accesses the i-th element of the vector. The downside of a Vector over a UnitRange is that generally speaking there aren't efficient methods when working with them as the elements of this container are completely arbitrary, while UnitRange represents a set of numbers with peculiar properties (sorted, constant step, etc...).
If you compare the performance of methods for which UnitRange has super-efficient implementations, this type will win hands down (note the use of interpolation of variables with $(...) when using macros from BenchmarkTools):
julia> using BenchmarkTools
julia> #btime sum($(1:1000_000))
0.012 ns (0 allocations: 0 bytes)
500000500000
julia> #btime sum($(collect(1:1000_000)))
229.979 μs (0 allocations: 0 bytes)
500000500000
Remember that UnitRange comes with the cost of dynamically computing the elements every time you access them with getindex. Consider for example this function:
function test(vec)
sum = zero(eltype(vec))
for idx in eachindex(vec)
sum += vec[idx]
end
return sum
end
Let's benchmark it with a UnitRange and a plain Vector:
julia> #btime test($(1:1000_000))
812.673 μs (0 allocations: 0 bytes)
500000500000
julia> #btime test($(collect(1:1000_000)))
522.828 μs (0 allocations: 0 bytes)
500000500000
In this case the function calling the plain array is faster than the one with a UnitRange because it doesn't have to dynamically compute 1 million elements.
Of course, in these toy examples it'd be more sensible to iterate over all elements of vec rather than its indices, but in real world cases a situation like these may be more sensible. This last example, however, shows that a UnitRange is not necessarily more efficient than a plain array, especially if you need to dynamically compute all of its elements. UnitRanges are more efficient when you can take advantage of specialised methods (like sum) for which the operation can be performed in constant time.
As a file remark, note that if you originally have a UnitRange it's not necessarily a good idea to convert it to a plain Vector to get good performance, especially if you're going to use it only once or very few times, as the conversion to Vector involves itself the dynamic computation of all elements of the range and the allocation of the necessary memory:
julia> #btime collect($(1:1000_000));
422.435 μs (2 allocations: 7.63 MiB)
julia> #btime test(collect($(1:1000_000)))
882.866 μs (2 allocations: 7.63 MiB)
500000500000
A programme I am writing has a user-written file containing parameters which are to be read in and implemented within the code. Users should be able to comment their input file by delimiting them with a comment character (I have gone with "#", in convention with Julia) - in parsing the input file, the code will remove these comments. Whilst making minor optimisations to this parser, I noted that instantiating the second variable prior to calling split() made a noticeable difference to the number allocations:
function removecomments1(line::String; dlm::String="#")
str::String = ""
try
str, tmp = split(line, dlm)
catch
str = line
finally
return str
end
end
function removecomments2(line::String; dlm::String="#")
str::String = ""
tmp::SubString{String} = ""
try
str, tmp = split(line, dlm)
catch
str = line
finally
return str
end
end
line = "Hello world # my comment"
#time removecomments1(line)
#time removecomments2(line)
$> 0.016092 seconds (27.31 k allocations: 1.367 MiB)
0.016164 seconds (31.26 k allocations: 1.548 MiB)
My intuition (coming from a C++ background) tells me that initialising both variables should have resulted in an increase in speed as well as minimising further allocations, since the compiler has already been told that a second variable is required as well as its corresponding type, however this doesn't appear to hold. Why would this be the case?
Aside: Are there any more efficient ways of achieving the same result as these functions?
EDIT:
Following a post by Oscar Smith, initialising str as type SubString{String} instead of String has reduced the allocations by around 10%:
$> 0.014811 seconds (24.29 k allocations: 1.246 MiB)
0.015045 seconds (28.25 k allocations: 1.433 MiB)
In your example, the only reason you need the try-catch block is because you're trying to destructure the output of split even though split will return a one element array when the input line has no comments. If you simply extract the first element from the output of split, then you can avoid the try-catch construct, which will save you time and memory:
julia> using BenchmarkTools
julia> removecomments3(line::String; dlm::String = "#") = first(split(line, dlm))
removecomments3 (generic function with 1 method)
julia> #btime removecomments1($line);
198.522 ns (5 allocations: 224 bytes)
julia> #btime removecomments2($line);
208.507 ns (6 allocations: 256 bytes)
julia> #btime removecomments3($line);
147.001 ns (4 allocations: 192 bytes)
In partial answer to your original question, pre-allocation is mainly used for arrays, not for strings or other scalars. For more discussion of when to use pre-allocation, check out this SO post.
To reason about what this is doing, think about what the split function would return if it was written in c++. It would not be copying, but would instead return a char*. As such, all that str::String = "" is doing is making Julia create an extra string object to ignore.
I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.
I know there is a little overhead and it's absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
x = 0
#time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)
x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.432549 seconds (3.91 k allocations: 241.138 KiB)
I got good result for parallel here but in following example not.
x2 = 0
#time for i=1:100000
x2 = Int(rand(Bool)) + x2
end
0.006025 seconds (98.97 k allocations: 1.510 MiB)
x2 = #time #parallel (+) for i=1:100000
Int(rand(Bool))
end
0.084736 seconds (3.87 k allocations: 239.122 KiB)
Doing things in parallel will ALWAYS be less efficient. It is because doing things parallel has always overhead to synchronize. Anyway the hope is, to get the result earlies on wall time than a pure sequential call (one computer, single core)
Your number are astonishing, and I found the cause.
First of all, allow to use all cores, goto into REPL
julia> nworkers
4
# original case to get correct relative times
julia> x = 0
julia> #time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.864891 seconds (200.00 M allocations: 2.980 GiB, 1.62% gc time)
julia> x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.350262 seconds (4.08 k allocations: 254.165 KiB)
99991471
# now a correct benchmark
julia> function test()
x = 0
for i=1:200000000
x = Int(rand(Bool)) + x
end
end
julia> #time test()
0.465478 seconds (4 allocations: 160 bytes)
What happend?
Your first test case uses an global variable x. And that is terrible slow. The case access 200 000 000 times a slow variable.
In the second test case the global variable x is assigned just one time, so the poor performance is not coming into account
In my test case there is no global variable. I used a local variable. Local variables are much faster (due to better compiler optimizations)
Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
A: Yes.
1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense
2) Arrange the workflow smarter - to benefit from register-based code, from harnessing the cache-lines's sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.
3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host's hardware unique source-of-randomness, IO-devices et al )
4) Get familiar with your optimising compilers internal options and "shortcuts" -- sometimes anti-patterns get generated at a cost of extended run-times
5) Get maximum from your underlying operating system's tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler's queue
I'm curious why Julias implementation of matrix addition appears to make a copy. Heres an example:
foo1=rand(1000,1000)
foo2=rand(1000,1000)
foo3=rand(1000,1000)
julia> #time foo1=foo2+foo3;
0.001719 seconds (9 allocations: 7.630 MB)
julia> sizeof(foo1)/10^6
8.0
The amount of memory allocated is roughly the same as the memory required by a matrix of these dimensions.
It looks like in order to process foo2+foo3 memory is allocated to store the result and then foo1 is assigned to it by reference.
Does this mean that for most linear algebra operations we need to call BLAS and LAPACK functions directly to do things in place?
To understand what is going on here, let's consider what foo1 = foo2 + foo3 actually does.
First it evaluates foo2 + foo3. To do this it will allocate a new temporary array to hold the output
Then it will bind the name foo1 to this new temporary array, undoing all effort you put in to pre-allocate the output array.
In short, you see that memory usage is about that of the resultant array because the routine is indeed allocating new memory for an array of that size.
Here are some alternatives:
write a loop
use broadcast!
We could try do do copy!(foo1, foo2+foo3) and then the array you pre-allocated will be filled, but it will still allocate the temporary (see below)
The original version posted here
Here's some code for those 4 cases
julia> function with_loop!(foo1, foo2, foo3)
for i in eachindex(foo2)
foo1[i] = foo2[i] + foo3[i]
end
end
julia> function with_broadcast!(foo1, foo2, foo3)
broadcast!(+, foo1, foo2, foo3)
end
julia> function with_copy!(foo1, foo2, foo3)
copy!(foo1, foo2+foo3)
end
julia> function original(foo1, foo2, foo3)
foo1 = foo2 + foo3
end
Now let's time these functions
julia> for f in [:with_broadcast!, :with_loop!, :with_copy!, :original]
#eval $f(foo1, foo2, foo3) # compile
println("timing $f")
#eval #time $f(foo1, foo2, foo3)
end
timing with_broadcast!
0.001787 seconds (5 allocations: 192 bytes)
timing with_loop!
0.001783 seconds (4 allocations: 160 bytes)
timing with_copy!
0.003604 seconds (9 allocations: 7.630 MB)
timing original
0.002702 seconds (9 allocations: 7.630 MB, 97.91% gc time)
You can see that with_loop! and broadcast! do about the same and both are much faster and more efficient than the others. with_copy! and original are both slower and use more memory.
In general, to do inplace operations I'd recommend starting out by writing a loop
First, read #spencerlyon2's answer. Another approach is to use Dahua Lin's package Devectorize.jl. It defines the #devec macro which automates the translations of vector (array) expressions into looped code.
In this example we will define with_devec!(foo1,foo2,foo3) as follows,
julia> using Devectorize # install with Pkg.add("Devectorize")
julia> with_devec!(foo1,foo2,foo3) = #devec foo1[:]=foo2+foo3
Running the benchmark achieves the 4 allocations results.
You can use axpy! function from the LinearAlgebra package.
using LinearAlgebra
julia> #time BLAS.axpy!(1., foo2, foo3)
0.002187 seconds (4 allocations: 160 bytes)