Suppose I have a matrix
beta = ones(1000,1000)
and I want to update one of the rows of beta. Suppose I did
#btime randn!(view(beta,1,:))
I get 3.766 μs (3 allocations: 112 bytes)
However, if I did #btime beta[1,:] = randn(1000),
I get 3.453 μs (2 allocations: 7.95 KiB).
Two questions
Why is in-place update slower than creating a new array?
What is the fastest way to update beta[1,:] with randomly drawn numbers? If beta = ones(10,10), it looks like I should use StaticArrays, but I'm not sure how to.
The random number generation takes the bulk of the time. randn is optimized to generate collections of Normal random variables. Specifically, generating a vector is about 2.5x faster than generating N single samples for vectors of length N ~ 200 (and more).
If this is the place you wish to optimize (beware premature optimization), then creating a buffer of random Normals of 1024 or so, and drawing from it before refreshing it, might be optimal.
In order to measure the speed ratio of the two methods I used the following code:
using BenchmarkTools
ratio(k) = ( #belapsed [randn() for i=1:$k]) / ( #belapsed randn($k) )
[ratio(k) for k in [1,2,4,8,16,32,64,128,256,512,1024,2048,4096]]
Giving the following ratios:
13-element Vector{Float64}:
1.0188417161077299
1.0517378917378917
1.0201505957529582
0.8502604933018909
1.2023468426000639
1.664313080265282
2.172084771444596
3.259981975486662
2.349527479585283
3.0571514578137093
2.670449479187776
2.9401818181818187
3.127237912876974
It can be seen from these values, that around 100, the ratio saturates at ~2.5x as previously suggested.
Related
I am trying to do a simple function to check the differences between factorial and Stirling's approximation:
using DataFrames
n = 24
df_res = DataFrame(i = BigInt[],
f = BigInt[],
st = BigInt[])
for i in 1:n
fac = factorial(big(i))
sterling = i^i*exp(-i)*sqrt(i)*sqrt(2*pi)
res = DataFrame(i = [i],
f = [fac],
st = [sterling])
df_res = [df_res;res]
end
first(df_res, 24)
The result for sterling when i= 16 and i= 24 is 0!. So, I checked power for both values and the result is 0:
julia> 16^16
0
julia> 24^24
0
I did the same code in R, and there are no issues. What am I doing wrong or what I don't know about Julia and I probably should?
It appears that Julia integers are either 32-bit or 64-bit, depending on your system, according to the Julia documentation for Integers and Floating-Point Numbers. Your exponentiation is overflowing your values, even if they're 64 bits.
Julia looks like it supports Arbitrary Precision Arithmetic, which you'll need to store the large resultant values.
According to the Overflow Section, writing big(n) makes n arbitrary precision.
While the question has been answered at the another post one more thing is worth saying.
Julia is one of very few languages that allows you to define your very own primitive types - so you can be still with fast fixed precision numbers yet handle huge values. There is a package BitIntegers for that:
BitIntegers.#define_integers 512
Now you can do:
julia> Int512(2)^500
3273390607896141870013189696827599152216642046043064789483291368096133796404674554883270092325904157150886684127560071009217256545885393053328527589376
Usually you will get better performance for for even big fixed point arithmetic numbers. For an example:
julia> #btime Int512(2)^500;
174.687 ns (0 allocations: 0 bytes)
julia> #btime big(2)^500;
259.643 ns (9 allocations: 248 bytes)
There is a simple solution to your problem that does not involve using BigInt or any specialized number types, and which is much faster. Simply tweak your mathematical expression slightly.
foo(i) = i^i*exp(-i)*sqrt(i)*sqrt(2*pi) # this is your function
bar(i) = (i / exp(1))^i * sqrt(i) * sqrt(2*pi) # here's a better way
OK, let's test it:
1.7.2> foo(16)
0.0 # oops. not what we wanted
1.7.2> foo(big(16)) # works
2.081411441522312838373895982304611417026205959453251524254923609974529540404514e+13
1.7.2> bar(16) # also works
2.0814114415223137e13
Let's try timing it:
1.7.2> using BenchmarkTools
1.7.2> #btime foo(n) setup=(n=16)
18.136 ns (0 allocations: 0 bytes)
0.0
1.7.2> #btime foo(n) setup=(n=big(16))
4.457 μs (25 allocations: 1.00 KiB) # horribly slow
2.081411441522312838373895982304611417026205959453251524254923609974529540404514e+13
1.7.2> #btime bar(n) setup=(n=16)
99.682 ns (0 allocations: 0 bytes) # pretty fast
2.0814114415223137e13
Edit: It seems like
baz(i) = float(i)^i * exp(-i) * sqrt(i) * sqrt(2*pi)
might be an even better solution, since the numerical values are closer to the original.
I am struggling with Julia every time that I need to collect data in an array "outside" functions.
If I use push!(element, array) I can collect data in the array, but if the code is inside a loop then the array "grows" each time.
What do you recommend?
I know is quite basic :) but thank you!
I assume the reason why you do not want to use push! is because you have used Matlab before where this sort of operation is painfully slow (to be precise, it is an O(n^2) operation, so doubling n quadruples the runtime). This is not true for Julia's push!, since push! uses the algorithm described here which is only O(n) (so doubling n only doubles the runtime).
You can easily check this experimentally. In Matlab we have
>> n = 100000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 2.206152 seconds.
>> n = 200000; tic; a = []; for i = 1:n; a = [a;0]; end; toc
Elapsed time is 8.301130 seconds.
so indeed the runtime quadruples for an n of twice the size. In contrast, in Julia we have
julia> using BenchmarkTools
function myzeros(n)
a = Vector{Int}()
for i = 1:n
push!(a,0)
end
return a
end
#btime myzeros(100_000);
#btime myzeros(200_000);
486.054 μs (17 allocations: 2.00 MiB)
953.982 μs (18 allocations: 3.00 MiB)
so indeed the runtimes only doubles for an n of twice the size.
Long story short: if you know the size of the final array, then preallocating the array is always best (even in Julia). However, if you don't know the final array size, then you can use push! without losing too much performance.
I have two versions of code that seem to do the same thing:
sum = 0
for x in 1:100
sum += x
end
sum = 0
for x in collect(1:100)
sum += x
end
Is there a practical difference between the two approaches?
In Julia, 1:100 returns a particular struct called UnitRange that looks like this:
julia> dump(1:100)
UnitRange{Int64}
start: Int64 1
stop: Int64 100
This is a very compact struct to represent ranges with step 1 and arbitrary (finite) size. UnitRange is subtype of AbstractRange, a type to represent ranges with arbitrary step, subtype of AbstractVector.
The instances of UnitRange dynamically compute their elements whenever the you use getindex (or the syntactic sugar vector[index]). For example, with #less (1:100)[3] you can see this method:
function getindex(v::UnitRange{T}, i::Integer) where {T<:OverflowSafe}
#_inline_meta
val = v.start + (i - 1)
#boundscheck _in_unit_range(v, val, i) || throw_boundserror(v, i)
val % T
end
This is returning the i-th element of the vector by adding i - 1 to the first element (start) of the range. Some functions have optimised methods with UnitRange, or more generally with AbstractRange. For instance, with #less sum(1:100) you can see the following
function sum(r::AbstractRange{<:Real})
l = length(r)
# note that a little care is required to avoid overflow in l*(l-1)/2
return l * first(r) + (iseven(l) ? (step(r) * (l-1)) * (l>>1)
: (step(r) * l) * ((l-1)>>1))
end
This method uses the formula for the sum of an arithmetic progression, which is extremely efficient as it's evaluated in a time independent of the size of the vector.
On the other hand, collect(1:100) returns a plain Vector with one hundred elements 1, 2, 3, ..., 100. The main difference with UnitRange (or other types of AbstractRange) is that getindex(vector::Vector, i) (or vector[i], with vector::Vector) doesn't do any computation but simply accesses the i-th element of the vector. The downside of a Vector over a UnitRange is that generally speaking there aren't efficient methods when working with them as the elements of this container are completely arbitrary, while UnitRange represents a set of numbers with peculiar properties (sorted, constant step, etc...).
If you compare the performance of methods for which UnitRange has super-efficient implementations, this type will win hands down (note the use of interpolation of variables with $(...) when using macros from BenchmarkTools):
julia> using BenchmarkTools
julia> #btime sum($(1:1000_000))
0.012 ns (0 allocations: 0 bytes)
500000500000
julia> #btime sum($(collect(1:1000_000)))
229.979 μs (0 allocations: 0 bytes)
500000500000
Remember that UnitRange comes with the cost of dynamically computing the elements every time you access them with getindex. Consider for example this function:
function test(vec)
sum = zero(eltype(vec))
for idx in eachindex(vec)
sum += vec[idx]
end
return sum
end
Let's benchmark it with a UnitRange and a plain Vector:
julia> #btime test($(1:1000_000))
812.673 μs (0 allocations: 0 bytes)
500000500000
julia> #btime test($(collect(1:1000_000)))
522.828 μs (0 allocations: 0 bytes)
500000500000
In this case the function calling the plain array is faster than the one with a UnitRange because it doesn't have to dynamically compute 1 million elements.
Of course, in these toy examples it'd be more sensible to iterate over all elements of vec rather than its indices, but in real world cases a situation like these may be more sensible. This last example, however, shows that a UnitRange is not necessarily more efficient than a plain array, especially if you need to dynamically compute all of its elements. UnitRanges are more efficient when you can take advantage of specialised methods (like sum) for which the operation can be performed in constant time.
As a file remark, note that if you originally have a UnitRange it's not necessarily a good idea to convert it to a plain Vector to get good performance, especially if you're going to use it only once or very few times, as the conversion to Vector involves itself the dynamic computation of all elements of the range and the allocation of the necessary memory:
julia> #btime collect($(1:1000_000));
422.435 μs (2 allocations: 7.63 MiB)
julia> #btime test(collect($(1:1000_000)))
882.866 μs (2 allocations: 7.63 MiB)
500000500000
I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.
I know there is a little overhead and it's absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
x = 0
#time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)
x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.432549 seconds (3.91 k allocations: 241.138 KiB)
I got good result for parallel here but in following example not.
x2 = 0
#time for i=1:100000
x2 = Int(rand(Bool)) + x2
end
0.006025 seconds (98.97 k allocations: 1.510 MiB)
x2 = #time #parallel (+) for i=1:100000
Int(rand(Bool))
end
0.084736 seconds (3.87 k allocations: 239.122 KiB)
Doing things in parallel will ALWAYS be less efficient. It is because doing things parallel has always overhead to synchronize. Anyway the hope is, to get the result earlies on wall time than a pure sequential call (one computer, single core)
Your number are astonishing, and I found the cause.
First of all, allow to use all cores, goto into REPL
julia> nworkers
4
# original case to get correct relative times
julia> x = 0
julia> #time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.864891 seconds (200.00 M allocations: 2.980 GiB, 1.62% gc time)
julia> x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.350262 seconds (4.08 k allocations: 254.165 KiB)
99991471
# now a correct benchmark
julia> function test()
x = 0
for i=1:200000000
x = Int(rand(Bool)) + x
end
end
julia> #time test()
0.465478 seconds (4 allocations: 160 bytes)
What happend?
Your first test case uses an global variable x. And that is terrible slow. The case access 200 000 000 times a slow variable.
In the second test case the global variable x is assigned just one time, so the poor performance is not coming into account
In my test case there is no global variable. I used a local variable. Local variables are much faster (due to better compiler optimizations)
Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
A: Yes.
1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense
2) Arrange the workflow smarter - to benefit from register-based code, from harnessing the cache-lines's sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.
3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host's hardware unique source-of-randomness, IO-devices et al )
4) Get familiar with your optimising compilers internal options and "shortcuts" -- sometimes anti-patterns get generated at a cost of extended run-times
5) Get maximum from your underlying operating system's tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler's queue
I need to work with some databases read with read.table from csv (comma separated values ), and I wish to know how to compute the size of the allocated memory for each type of variable.
How to do it ?
edit -- in other words : how much memory R allocs for a general data frame read from a .csv file ?
You can get the amount of memory allocated to an object with object.size. For example:
x = 1:1000
object.size(x)
# 4040 bytes
This script might also be helpful- it lets you view or graph the amount of memory used by all of your current objects.
In answer to your question of why object.size(4) is 48 bytes, the reason is that there is some overhead in each numeric vector. (In R, the number 4 is not just an integer as in other languages- it is a numeric vector of length 1). But that doesn't hurt performance, because the overhead does not increase with the size of the vector. If you try:
> object.size(1:100000) / 100000
4.0004 bytes
This shows you that each integer itself requires only 4 bytes (as you expect).
Thus, summary:
For a numeric vector of length n, the size in bytes is typically 40 + 8 * floor(n / 2). However, on my version of R and OS there is a single slight discontinuity, where it jumps to 168 bytes faster than you would expect (see plot below). Beyond that, the linear relationship holds, even up to a vector of length 10000000.
plot(sapply(1:50, function(n) object.size(1:n)))
For a categorical variable, you can see a very similar linear trend, though with a bit more overhead (see below). Outside of a few slight discontinuities, the relationship is quite close to 400 + 60 * n.
plot(sapply(1:100, function(n) object.size(factor(1:n))))