In what way(s) can I benchmark a Julia function? - julia

Background
I've self-taught myself machine learning and have recently started delving into the Julia Machine Learning Ecosystem.
Coming from a python background and having some Tensorflow and OpenCV/skimage experience, I want to benchmark Julia ML libraries (Flux/JuliaImages) against its counterparts to see how fast or slow it really performs CV(any) task(s) and to decide if I should shift to using Julia.
I know how to get the time taken to execute a function in python using timeit module like this :
#Loading an Image using OpenCV
s = """\
img = cv2.imread('sample_image.png', 1)
"""
setup = """\
import timeit
"""
print(str(round((timeit.timeit(stmt = s, setup = setup, number = 1))*1000, 2)) + " ms")
#printing the time taken in ms rounded to 2 digits
How does one compare the execution time of a function performing the same task in Julia using the appropriate library (in this case, JuliaImages).
Does Julia provide any function/macro to time/benchmark ?

using BenchmarkTools is the recommended way to benchmark Julia functions. Unless you are timing something that takes quite a while, use either #benchmark or the less verbose #btime macros exported from it. Because the machinery behind these macros evaluates the target function many times, #time is useful for benchmarking things that run slowly (e.g. where disk access or very time-consuming calculations are involved).
It is important to use #btime or #benchmark correctly, this avoids misleading results. Usually, you are benchmarking a function that takes one or more arguments. When benchmarking, all arguments should be external variables: (without the benchmark macro)
x = 1
f(x)
# do not use f(1)
The function will be evaluated many times. To prevent the function arguments from being re-evaluated whenever the function is evaluated, we must mark each argument by prefixing a $ to the name of each variable that is used as an argument. The benchmarking macros use this to indicate that the variable should be evaluated (resolved) once, at the start of the benchmarking process and then the result is to be reused directly as is:
julia> using BenchmarkTools
julia> a = 1/2;
julia> b = 1/4;
julia> c = 1/8;
julia> a, b, c
(0.5, 0.25, 0.125)
julia> function sum_cosines(x, y, z)
return cos(x) + cos(y) + cos(z)
end;
julia> #btime sum_cosines($a, $b, $c); # the `;` suppresses printing the returned value
11.899 ns (0 allocations: 0 bytes) # calling the function takes ~12 ns (nanoseconds)
# the function does not allocate any memory
# if we omit the '$', what we see is misleading
julia> #btime sum_cosines(a, b, c); # the function appears more than twice slower
28.441 ns (1 allocation: 16 bytes) # the function appears to be allocating memory
# #benchmark can be used the same way that #btime is used
julia> #benchmark sum_cosines($a,$b,$c) # do not use a ';' here
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.111 ns (0.00% GC)
median time: 12.213 ns (0.00% GC)
mean time: 12.500 ns (0.00% GC)
maximum time: 39.741 ns (0.00% GC)
--------------
samples: 1500
evals/sample: 999
While there are parameters than can be adjusted, the default values usually work well. For additional information about BenchmarkTools for experienced ursers, see the manual.

Julia provides two macros for timing/benchmarking code runtime. These are :
#time
#benchmark : external, install by Pkg.add("BenchmarkTools")
Using BenchmarkTools' #benchmark is very easy and would be helpful to you in comparing the speed of the two languages.
Example of using #benchark against the python bench you provided.
using Images, FileIO, BenchmarkTools
#benchmark img = load("sample_image.png")
Output :
BenchmarkTools.Trial:
memory estimate: 3.39 MiB
allocs estimate: 322
--------------
minimum time: 76.631 ms (0.00% GC)
median time: 105.579 ms (0.00% GC)
mean time: 110.319 ms (0.41% GC)
maximum time: 209.470 ms (0.00% GC)
--------------
samples: 46
evals/sample: 1
Now to compare for the mean time, you should put samples (46) as the number in your python timeit code and divide it by the same number to get the mean time of execution.
print(str(round((timeit.timeit(stmt = s, setup = setup, number = 46)/46)*1000, 2)) + " ms")
You can follow this process for benchmarking any function in both Julia and Python.
I hope you're doubt has been cleared.
Note : From a statistical point of view, #benchmark is much better than #time.

Related

CUDA example in Julia doesn't use GPU

I'm doing my first steps on running Julia 1.6.5 code on GPU. For some reason, it seems the GPU is not being used at all. These are the steps:
First of all, my GPU passed on the test recommended at CUDA Julia Docs:
# install the package
using Pkg
Pkg.add("CUDA")
# smoke test (this will download the CUDA toolkit)
using CUDA
CUDA.versioninfo()
using Pkg
Pkg.test("CUDA") # takes ~40 minutes if using 1 thread
Secondly, the below code took around 8 minutes (real time) for supposedly running on my GPU. It loads and multiplies, for 10 times, two matrices 10000 x 10000:
using CUDA
using Random
N = 10000
a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))
for i in 1:10
global a_d = randn(N, N)
global b_d = randn(N, N)
global c_d = a_d * b_d
end
global a_d = nothing
global b_d = nothing
global c_d = nothing
GC.gc()
Outcome on terminal as follows:
(base) ciro#ciro-G3-3500:~/projects/julia/cuda$ time julia cuda-gpu.jl
real 8m13,016s
user 50m39,146s
sys 13m16,766s
Then, an equivalent code for the CPU is run. Execution time was also equivalent:
using Random
N = 10000
for i in 1:10
a = randn(N, N)
b = randn(N, N)
c = a * b
end
Execution:
(base) ciro#ciro-G3-3500:~/projects/julia/cuda$ time julia cuda-cpu.jl
real 8m2,689s
user 50m9,567s
sys 13m3,738s
Moreover, by following the info on NVTOP screen command, it is weird to see the GPU memory and cores being loaded/unloaded accordingly, besides still using the same 800% CPUs (or eight cores) of my regular CPU, which is the same usage the CPU-version has.
Any hint is greatly appreciated.
There are a couple of things that prevent your code from working properly and fast.
First, you are overwriting your allocated CuArrays with normal CPU Arrays by using randn, which means that the matrix multiplication runs on the CPU.
You should use CUDA.randn instead. By using CUDA.randn!, you are not allocating any memory beyond what was already allocated.
Secondly, you are using global variables and the global scope, which is bad for performance.
Thirdly, you are using C = A * B which reallocates memory. You should use the in-place version mul! instead.
I would propose the following solution:
using CUDA
using LinearAlgebra
N = 10000
a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))
# wrap your code in a function
# `!` is a convention to indicate that the arguments will be modified
function randn_mul!(A, B, C)
CUDA.randn!(A)
CUDA.randn!(B)
mul!(C, A, B)
end
# use CUDA.#time to time the GPU execution time and memory usage:
for i in 1:10
CUDA.#time randn_mul!(a_d, b_d, c_d)
end
which runs pretty fast on my machine:
$ time julia --project=. cuda-gpu.jl
2.392889 seconds (4.69 M CPU allocations: 263.799 MiB, 6.74% gc time) (2 GPU allocations: 1024.000 MiB, 0.05% memmgmt time)
0.267868 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.274376 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.268574 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.274514 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.272016 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.272668 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.273441 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.274318 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
0.272389 seconds (60 CPU allocations: 2.000 KiB) (2 GPU allocations: 1024.000 MiB, 0.00% memmgmt time)
real 0m8.726s
user 0m6.030s
sys 0m0.554s
Note that the first time the function was called, the execution time and memory usage was higher because you are measuring compilation time any time a function is first called with a given type signature.
After playing a little bit, the following code also works. Interesting to note the declaration "global" on c_d variable. Without it, the system complained about ambiguity between the (global) CuArray c_d and the assumption of a different (unintended local) variable c_d.
using CUDA
using Random
N = 10000
a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))
for i in 1:10
randn!(a_d)
randn!(b_d)
global c_d = a_d * b_d
end
global a_d = nothing
global b_d = nothing
global c_d = nothing
GC.gc()
The outcome on a relatively modest GPU confirms the result:
(base) ciro#ciro-Inspiron-7460:~/projects/julia/cuda$ time julia cuda-gpu.jl
real 0m38,243s
user 0m36,810s
sys 0m1,413s

Using `\` instead of `/` in Julia

For scalars, the \ (solve linear system) operator is equivalent to the division operator /. Is the performance similar?
I ask because currently my code has a line like
x = (1 / alpha) * averylongfunctionname(input1, input2, input3)
Visually, it is important that the division by alpha happens on the "left," so I am considering replacing this with
x = alpha \ averylongfunctionname(input1, input2, input3)
What is the best practice in this situation, from the standpoint of style and the standpoint of performance?
Here are some perplexing benchmarking results:
julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
julia> #btime x[1]\sum(x) setup=(x=rand(100))
15.014 ns (0 allocations: 0 bytes)
56.23358979466163
julia> #btime (1/x[1]) * sum(x) setup=(x=rand(100))
13.312 ns (0 allocations: 0 bytes)
257.4552413802698
julia> #btime sum(x)/x[1] setup=(x=rand(100))
14.929 ns (0 allocations: 0 bytes)
46.25209548841374
They are all about the same, but I'm surprised that the (1 / x) * foo approach has the best performance.
Scalar / and \ really should have the same meaning and performance. Let's define these two test functions:
f(a, b) = a / b
g(a, b) = b \ a
We can then see that they produce identical LLVM code:
julia> #code_llvm f(1.5, 2.5)
; # REPL[29]:1 within `f'
define double #julia_f_380(double %0, double %1) {
top:
; ┌ # float.jl:335 within `/'
%2 = fdiv double %0, %1
; └
ret double %2
}
julia> #code_llvm g(1.5, 2.5)
; # REPL[30]:1 within `g'
define double #julia_g_382(double %0, double %1) {
top:
; ┌ # operators.jl:579 within `\'
; │┌ # float.jl:335 within `/'
%2 = fdiv double %0, %1
; └└
ret double %2
}
And the same machine code too. I'm not sure what is causing the differences in #btime results, but I'm pretty sure that the difference between / and \ is an illusion and not real.
As to x*(1/y), that does not compute the same thing as x/y: it will be potentially less accurate since there is rounding done when computing 1/y and then that rounded value is multiplied by x, which also rounds. For example:
julia> 17/0.7
24.28571428571429
julia> 17*(1/0.7)
24.285714285714285
Since floating point division is guaranteed to be correctly rounded, doing division directly is always going to be more accurate. If the divisor is shared by a lot of loop iterations, however, you can get a speedup by rewriting the computation like this since floating-point multiplication is usually faster than division (although timing my current computer does not show this). Be aware that this comes at a loss of accuracy, however, and if the divisor is not shared there would still be a loss of accuracy and no performance gain.
I don't know, but I can suggest you to try using BenchmarkTools
package: it can help you to evaluate the performance of the two
different statements. Here you can find more details. Bye!
I think that the best choice is (1/x)*foo for two reasons:
it has the best performance (although not much compared to the other ones);
it is more clear for another person reading the code.

How to pass an array of objects to a function in julia?

I want to pass an array of objects to a function and return another array of the same size. How should I do this ? The code should be so fast. For example consider the below code:
struct triangle
height::Float64
base::Float64
end
function area(height::Float64, base::Float64)
return 0.5*base*height
end
I want to define a function which returns the area of an array of triangles.
What is the fastest way to do it?
You can leverage Julia's broadcasting syntax for this. Consider:
struct Triangle # Note: Julia convention is to capitalize types
height::Float64
base::Float64
end
# Define an area method for the Triangle type
area(t::Triangle) = 0.5 * t.height * t.base
# Create an area of random triangles
triangle_array = [Triangle(rand(), rand()) for _ in 1:10]
# Now we can use dot syntax to broadcast our area function over the array of triangles
area.(triangle_array)
Note that this differs from your code in that it directly uses the Triangle object for dispatch in the call to the area function. The area function then doesn't take height and base arguments but just a single Triangle object and accesses its height and base fields (t.height and t.base).
Here are some benchmarks for computing this with:
map: map(area, triangles)
comprehension: `[area(triangle) for triangle in triangles]
broadcasting: area.(triangles)
using interpolation $ on the non-local global variable triangles (based on #DNF comment).
Definitions
using Pkg
Pkg.add("BenchmarkTools")
using BenchmarkTools
struct Triangle
height::Float64
base::Float64
end
function area(t::Triangle)
0.5 * t.height * t.base
end
triangles = [Triangle(rand(), rand()) for _ in 1:1000000]
Results
julia> #benchmark map(area, $triangles)
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 3
--------------
minimum time: 1.168 ms (0.00% GC)
median time: 2.510 ms (0.00% GC)
mean time: 2.485 ms (10.00% GC)
maximum time: 43.540 ms (91.62% GC)
--------------
samples: 2008
evals/sample: 1
julia> #benchmark [area(triangle) for triangle in $triangles]
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 3
--------------
minimum time: 1.150 ms (0.00% GC)
median time: 1.921 ms (0.00% GC)
mean time: 2.327 ms (10.76% GC)
maximum time: 45.883 ms (91.42% GC)
--------------
samples: 2144
evals/sample: 1
julia> #benchmark area.($triangles)
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 1.165 ms (0.00% GC)
median time: 1.224 ms (0.00% GC)
mean time: 1.961 ms (10.13% GC)
maximum time: 44.156 ms (89.33% GC)
--------------
samples: 2544
evals/sample: 1
This would indicate that for this input size, the broadcasting method seems to be the fastest.
For different input size, relative timings may be different, so it is probably a good idea to benchmark it yourself for your use case

Why does Julia interpolation not work in juliabox tutorial?

I'm using juliabox.com to learn Julia, but am having a very basic issue. I can't do the interpolation I'm being told to do:
sum($foo) just doesn't work as described, it just returns "syntax: "$" expression outside quote". This is at https://www.juliabox.com/notebook/notebooks/tutorials/introductory-tutorials/intro-to-julia/Exploring_benchmarking_and_performance.ipynb .
Is there a problem with the tutorial, or with me?
Edit: To be clear, my confusion here was not knowing '$' was paired with #benchmark in this context. The tutorial did not state this, so I saw no reason sum($foo) shouldn't work. Now I understand better. (Perhaps the tutorial's wording could be more clear.)
The tutorial you are using is aimed to teach you specifically how to correctly benchmark Julia code.
The key thing to understand what $ is that it interpolates a value into the benchmark expression, so that it behaves as a variable whose type Julia knows at compile time (https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#interpolating-values-into-benchmark-expressions).
Why is this needed? A major performance problem in Julia programs is using global variables (https://docs.julialang.org/en/latest/manual/performance-tips/#Avoid-global-variables-1). In the following code:
julia> using BenchmarkTools
julia> x = rand(10);
julia> #benchmark sum(x)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 18.492 ns (0.00% GC)
median time: 21.306 ns (0.00% GC)
mean time: 30.284 ns (17.51% GC)
maximum time: 38.387 μs (99.93% GC)
--------------
samples: 10000
evals/sample: 995
The variable x is global.
If you write $x instead of x then the variable x will be local (thus its type will be known for Julia at compile time). Note that this interpolation trick is only used for benchmarking - not for the real code:
julia> #benchmark sum($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.199 ns (0.00% GC)
median time: 5.399 ns (0.00% GC)
mean time: 5.538 ns (0.00% GC)
maximum time: 48.301 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
And the performance difference is exactly due to the fact that in the first time x is global, and in the second x is local.
In order to see that what is going on is allowing Julia to know at compile time the type of x consider the following code:
julia> const y = x;
julia> #benchmark sum(y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.799 ns (0.00% GC)
median time: 5.200 ns (0.00% GC)
mean time: 5.490 ns (0.00% GC)
maximum time: 30.900 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> #benchmark sum($y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.199 ns (0.00% GC)
median time: 5.699 ns (0.00% GC)
mean time: 5.615 ns (0.00% GC)
maximum time: 30.600 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
In this case y is a global constant. Thus the compiler knows its type even though it is global, so in this case if you write y or $y does not matter.
Now you might ask why then sum does not have to be prefived by $. The answer is that sum is a function and thus its type is known at compile time.
Another way to think about $ (I am simplifying a bit, but here as actually something different is done, you can investigate the details using the #macroexpand macro) is that it turns this:
julia> f() = for i in 1:10^6
sum(x)
end
into this
julia> g(x) = for i in 1:10^6
sum(x)
end
And now if you measure the time of both function with simple #time you get:
julia> #time f()
0.032786 seconds (1.05 M allocations: 18.224 MiB)
julia> #time f()
0.024807 seconds (1.00 M allocations: 15.259 MiB, 13.19% gc time)
vs
julia> #time g(x)
0.017912 seconds (53.07 k allocations: 2.990 MiB, 17.93% gc time)
julia> #time g(x)
0.001044 seconds (4 allocations: 160 bytes)
(you should look at the second timing as the first includes compilation time)
In summary
Prefixing global variable name with $ is used for benchmarking purposes only. It makes sure that you get an information about the performance of the function in a type stable context (and this is usually what you are interested in).
Additional cautionary note
Benchmarking Julia code is sometimes tricky, as its compiler is very aggressive in optimizing the code.
For example compare:
julia> using BenchmarkTools
julia> const z = 1
1
julia> #benchmark sin(cos(z))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.999 ns (0.00% GC)
median time: 2.201 ns (0.00% GC)
mean time: 2.394 ns (0.00% GC)
maximum time: 28.301 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> #benchmark sin(cos($z))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 25.477 ns (0.00% GC)
median time: 33.030 ns (0.00% GC)
mean time: 35.307 ns (0.00% GC)
maximum time: 106.747 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 993
You might wonder why using $ in this case makes execution time slower. The reason is that as z is a constant then sin(cos(z)) is fully evaluated during compilation (no computation takes place at run time), so what happens is something similar to:
julia> f() = sin(cos(z))
f (generic function with 1 method)
julia> #code_llvm f()
; # REPL[30]:1 within `f'
; Function Attrs: uwtable
define double #julia_f_16511() #0 {
top:
ret double 0x3FE075ED0B926F7C
}
(and you see that if f() would be called it actually performs no computations).
On the other hand sin(cos($z)) gets expanded in a way that makes Julia create a fresh local variable, call it v, then assign the value of z to it, and finally evaluate sin(cos(v)) at run time (but knowing that the type of v is Int).
Note that this is faster than:
julia> x = 1
1
julia> #benchmark sin(cos(x))
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 39.246 ns (0.00% GC)
median time: 54.638 ns (0.00% GC)
mean time: 67.345 ns (8.80% GC)
maximum time: 41.383 μs (99.82% GC)
--------------
samples: 10000
evals/sample: 981
as in this case the compiler does not know the type of x.

Why julia is very slow for the first evaluation?

I am a newbie on Julia. Run the following code:
const size = 100
#time A = rand(size, size) * rand(size, size)
#time B = rand(size, size) * rand(size, size)
#time a = det(A)
#time b = det(B)
print(a, "\n", b)
Then I get something like this:
0.825584 seconds (259.77 k allocations: 13.101 MiB)
0.000248 seconds (11 allocations: 234.813 KiB)
0.297366 seconds (44.59 k allocations: 2.591 MiB)
0.012814 seconds (12 allocations: 79.375 KiB)
-9.712788203190892e49
-5.471097050756647e49
Why the first call of either matrix multiplication or evaluation of determinant is extremely slow? How to avoid this?
You can't trust the outcome of timing with variables in global scope. Try to put it in a function and run it again. Also, the first time you run the function it will compile, so that will take some time.

Resources