I am trying to benchmark the performance of functions using BenchmarkTools as in the example below. My goal is to obtain the outputs of #benchmark as a DataFrame.
In this example, I am benchmarking the performance of the following two functions:
"""Example function A: recodes negative values to 0"""
function negative_to_zero_a!(x::Array{<:Real,1})
for (i, v) in enumerate(x)
if v < 0
x[i] = zero(x[i]) # uses 'zero()'
end
end
end
"""Example function B: recodes negative values to 0"""
function negative_to_zero_b!(x::Array{<:Real,1})
for (i, v) in enumerate(x)
if v < 0
x[i] = 0 # does not use 'zero()'
end
end
end
Which are meant to mutate the following vectors:
int_a = [1, -2, 3, -4]
float_a = [1.0, -2.0, 3.0, -4.0]
int_b = copy(int_a)
float_b = copy(float_a)
I then produce the performance benchmarks using BenchmarkTools.
using BenchmarkTools
int_a_benchmark = #benchmark negative_to_zero_a!(int_a)
int_b_benchmark = #benchmark negative_to_zero_b!(int_b)
float_a_benchmark = #benchmark negative_to_zero_a!(float_a)
float_b_benchmark = #benchmark negative_to_zero_b!(float_b)
I would now like to retrieve the elements of each of the four BenchmarkTools.Trial objects into a DataFrame similar to the one below. In that DataFrame, each row contains the results of a given BenchmarkTools.Trial object. E.g.
DataFrame("id" => ["int_a_benchmark", "int_b_benchmark", "float_a_benchmark", "float_b_benchmark"],
"minimum" => [15.1516, 15.631, 14.615, 14.271],
"median" => [15.916, 15.731, 15.916, 15.879],
"maximum" => [149.15, 104.108, 63.363, 116.181],
"allocations" => [0, 0, 0, 0],
"memory_bytes" => [0, 0, 0, 0])
4×6 DataFrame
Row │ id minimum median maximum allocations memory_estimate
│ String Float64 Float64 Float64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────────────────
1 │ int_a_benchmark 15.1516 15.916 149.15 0 0
2 │ int_b_benchmark 15.631 15.731 104.108 0 0
3 │ float_a_benchmark 14.615 15.916 63.363 0 0
4 │ float_b_benchmark 14.271 15.879 116.181 0 0
How can I retrieve the results of the benchmarks into a DataFrame like this one?
As usual with Julia, there are multiple ways to do what you want. I present here maybe not the simplest way, but the one, which hopefully shows interesting approach, which allows for some generalizations.
But before we start, small notice: your benchmarks are not quite correct, since your functions mutate argument. For proper benchmarking you should copy your data before each run and also do it every time you execute function. You can find more information here: https://juliaci.github.io/BenchmarkTools.jl/dev/manual/#Setup-and-teardown-phases
So, for now, we assume that you prepared benchmarks like this
int_a_benchmark = #benchmark negative_to_zero_a!(a) setup=(a = copy($int_a)) evals=1
int_b_benchmark = #benchmark negative_to_zero_b!(b) setup=(b = copy($int_b)) evals=1
float_a_benchmark = #benchmark negative_to_zero_a!(a) setup=(a = copy($float_a)) evals=1
float_b_benchmark = #benchmark negative_to_zero_b!(b) setup=(b = copy($float_b)) evals=1
Main idea is the following. If we can represent our benchmark data as a DataFrame, then we can combine them together as a single large DataFrame and do all necessary calculations.
Of course one can do it in super easy way, just by making command
df = DataFrame(times = int_a_benchmark.times, gctimes = int_a_benchmark.gctimes)
df.memory .= int_a_benchmark.memory
df.allocs .= int_a_benchmark.allocs
but this is too boring and too verbatim (but simple and should be done 99% of time). It would be nice, if we can just do DataFrame(int_a_benchmark) and get the result immediately.
As it turns out, it is possible, because DataFrames supports Tables.jl interface for working with table-like data. You can read details in the manual of Tables.jl, but generally you need to define some meaningful things, like names of columns, and column accessors and package will do everything else. I show the results here, without further explanations.
using Tables
Tables.istable(::Type{<:BenchmarkTools.Trial}) = true
Tables.columnaccess(::Type{<:BenchmarkTools.Trial}) = true
Tables.columns(m::BenchmarkTools.Trial) = m
Tables.columnnames(m::BenchmarkTools.Trial) = [:times, :gctimes, :memory, :allocs]
Tables.schema(m::BenchmarkTools.Trial) = Tables.Schema(Tables.columnnames(m), (Float64, Float64, Int, Int))
function Tables.getcolumn(m::BenchmarkTools.Trial, i::Int)
i == 1 && return m.times
i == 2 && return m.gctimes
i == 3 && return fill(m.memory, length(m.times))
return fill(m.allocs, length(m.times))
end
Tables.getcolumn(m::BenchmarkTools.Trial, nm::Symbol) = Tables.getcolumn(m, nm == :times ? 1 : nm == :gctimes ? 2 : nm == :memory ? 3 : 4)
and we can see that it really works (almost magically)
julia> DataFrame(int_a_benchmark)
10000×4 DataFrame
Row │ times gctimes memory allocs
│ Float64 Float64 Int64 Int64
───────┼──────────────────────────────────
1 │ 309.0 0.0 0 0
2 │ 38.0 0.0 0 0
3 │ 25.0 0.0 0 0
4 │ 37.0 0.0 0 0
⋮ │ ⋮ ⋮ ⋮ ⋮
Next step is combining all dataframes in a single dataframe. We should make following steps:
Convert benchmark trial to dataframe
Add column name with the name of relevant benchmark
Group them all together (with vcat function)
Of course, you can do all of this steps one by one for each dataframe, but it's too long (and boring, yes). Instead, we can use amazing mapreduce function and so called Do-Block syntax. map part will prepare necessary dataframes and reduce will combine them together
df_benchmark = mapreduce(vcat, zip([int_a_benchmark, int_b_benchmark, float_a_benchmark, float_b_benchmark],
["int_a_benchmark", "int_b_benchmark", "float_a_benchmark", "float_b_benchmark"])) do (x, y)
df = DataFrame(x)
df.name .= y
df
end
And now for the final part. We have nice, large DataFrame, which we want to aggregate. For this we can use Split-Apply-Combine strategy of DataFrames
julia> combine(groupby(df_benchmark, :name),
:times => minimum => :minimum,
:times => median => :median,
:times => maximum => :maximum,
:allocs => first => :allocations,
:memory => first => :memory_estimate)
4×6 DataFrame
Row │ name minimum median maximum allocations memory_estimate
│ String Float64 Float64 Float64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────────────────
1 │ int_a_benchmark 22.0 24.0 3252.0 0 0
2 │ int_b_benchmark 20.0 23.0 489.0 0 0
3 │ float_a_benchmark 21.0 23.0 134.0 0 0
4 │ float_b_benchmark 21.0 23.0 129.0 0 0
As a bonus, last calulcation can look even better with the help of Chain.jl package:
using Chain
#chain df_benchmark begin
groupby(:name)
combine(:times => minimum => :minimum,
:times => median => :median,
:times => maximum => :maximum,
:allocs => first => :allocations,
:memory => first => :memory_estimate)
end
You can do it e.g. like this:
julia> using Statistics, DataFrames, BenchmarkTools
julia> preprocess_trial(t::BenchmarkTools.Trial, id::AbstractString) =
(id=id,
minimum=minimum(t.times),
median=median(t.times),
maximum=maximum(t.times),
allocations=t.allocs,
memory_estimate=t.memory)
preprocess_trial (generic function with 1 method)
julia> output = DataFrame()
0×0 DataFrame
julia> for (fun, id) in [(sin, "sin"), (cos, "cos"), (log, "log")]
push!(output, preprocess_trial(#benchmark(sin(1)), id))
end
julia> output
3×6 DataFrame
Row │ id minimum median maximum allocations memory_estimate
│ String Float64 Float64 Float64 Int64 Int64
─────┼─────────────────────────────────────────────────────────────────
1 │ sin 0.001 0.001 0.1 0 0
2 │ cos 0.001 0.001 0.1 0 0
3 │ log 0.001 0.001 0.1 0 0
Related
pandas has a number of very handy utilities for manipulating datetime indices. Is there any similar functionality in Julia? I have not found any tutorials for working with such things, though it obviously must be possible.
Some examples of pandas utilities:
dti = pd.to_datetime(
["1/1/2018", np.datetime64("2018-01-01"),
datetime.datetime(2018, 1, 1)]
)
dti = pd.date_range("2018-01-01", periods=3, freq="H")
dti = dti.tz_localize("UTC")
dti.tz_convert("US/Pacific")
idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.Series(range(len(idx)), index=idx)
ts.resample("2H").mean()
Julia libraries have "do only one thing but do it right" philosophy so the layout of its libraries matches perhaps more a Unix (battery of small tools that allow to accomplish a common goal) rather then Python's.
Hence you have separate libraries for DataFrames and Dates:
julia> using Dates, DataFrames
Going through some of the examples of your tutorial:
Pandas
dti = pd.to_datetime(
["1/1/2018", np.datetime64("2018-01-01"), datetime.datetime(2018, 1, 1)]
)
Julia
julia> DataFrame(dti=[Date("1/1/2018", "m/d/y"), Date("2018-01-01"), Date(2018,1,1)])
3×1 DataFrame
Row │ dti
│ Date
─────┼────────────
1 │ 2018-01-01
2 │ 2018-01-01
3 │ 2018-01-01
Pandas
dti = pd.date_range("2018-01-01", periods=3, freq="H")
Julia
julia> DateTime("2018-01-01") .+ Hour.(0:2)
3-element Vector{DateTime}:
2018-01-01T00:00:00
2018-01-01T01:00:00
2018-01-01T02:00:00
Pandas
dti = dti.tz_localize("UTC")
dti.tz_convert("US/Pacific")
Julia
Note that that there is a separate library in Julia for time zones. Additionally "US/Pacific" is a legacy name of a time zone.
julia> using TimeZones
julia> dti = ZonedDateTime.(dti, tz"UTC")
3-element Vector{ZonedDateTime}:
2018-01-01T00:00:00+00:00
2018-01-01T01:00:00+00:00
2018-01-01T02:00:00+00:00
julia> julia> astimezone.(dti, TimeZone("US/Pacific", TimeZones.Class(:LEGACY)))
3-element Vector{ZonedDateTime}:
2017-12-31T16:00:00-08:00
2017-12-31T17:00:00-08:00
2017-12-31T18:00:00-08:00
Pandas
idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.Series(range(len(idx)), index=idx)
ts.resample("2H").mean()
Julia
For resampling or other complex manipulations you will want to use the split-apply-combine pattern (see https://docs.juliahub.com/DataFrames/AR9oZ/1.3.1/man/split_apply_combine/)
julia> df = DataFrame(date=DateTime("2018-01-01") .+ Hour.(0:4), vals=1:5)
5×2 DataFrame
Row │ date vals
│ DateTime Int64
─────┼────────────────────────────
1 │ 2018-01-01T00:00:00 1
2 │ 2018-01-01T01:00:00 2
3 │ 2018-01-01T02:00:00 3
4 │ 2018-01-01T03:00:00 4
5 │ 2018-01-01T04:00:00 5
julia> df.date2 = floor.(df.date, Hour(2));
julia> using StatsBase
julia> combine(groupby(df, :date2), :date2, :vals => mean => :vals_mean)
5×2 DataFrame
Row │ date2 vals_mean
│ DateTime Float64
─────┼────────────────────────────────
1 │ 2018-01-01T00:00:00 1.5
2 │ 2018-01-01T00:00:00 1.5
3 │ 2018-01-01T02:00:00 3.5
4 │ 2018-01-01T02:00:00 3.5
5 │ 2018-01-01T04:00:00 5.0
I must be doing something wrong. I have a Julia script (below) that uses both vcat and plot. When I run the script, vcat returns an empty DataFrame. Another function calls plot and no plot is generated.
When I manually type the commands in the terminal window the commands behave normally.
Any help would be appreciated.
f_l = file_list[start_row_num:end_row_num] # Build a dataframe containing the data
len = length(f_l)
tmp_stock_df = DataFrame(CSV.File(f_l[1]))
vcat(s_d_df, tmp_stock_df)
println(s_d_df)
for i = 2:len
tmp_stock_df = DataFrame(CSV.File(f_l[i]))
tmp_stock_df.quote_datetime = map((x) -> DateTime(x, "yyyy-mm-dd HH:MM:SS"), tmp_stock_df.quote_datetime)
DataFrames.vcat(s_d_df, tmp_stock_df)
end
It's hard to say what you're doing differently when manually typing in the commands, but it seems to me that this code would ever produce the results you're looking for. Apart from the fact that s_d_df is not defined, vcat does not mutate its arguments, and therefore you're never actually adding to your DataFrame:
julia> using DataFrames
julia> df1 = DataFrame(a = rand(2), b = rand(2)); df2 = DataFrame(a = rand(2), b = rand(2));
julia> vcat(df1, df2)
4×2 DataFrame
Row │ a b
│ Float64 Float64
─────┼────────────────────
1 │ 0.918298 0.343344
2 │ 0.538763 0.188229
3 │ 0.347177 0.385166
4 │ 0.18795 0.98408
julia> df1
2×2 DataFrame
Row │ a b
│ Float64 Float64
─────┼────────────────────
1 │ 0.918298 0.343344
2 │ 0.538763 0.188229
You probably want s_d_df = vcat(s_d_df, tmp_stock_df) to assign the result of the concatenation.
On a related note, it looks like you just have a list of files f_l with different csv files stored on your system which you want to read into a single DataFrame, in which case you can just replace the whole loop with:
s_d_df = vcat(CSV.read.(f_l, DataFrame)...)
(potentially also use the dateformat = "yyyy-mm-dd HH:MM:SS" kwarg in CSV.read to directly parse the dates when reading in the file).
The following code creates a segfault for me - is this a bug? And if so, in which component?
using DataFrames
function test()
Threads.#threads for i in 1:50
df = DataFrame()
df.foo = 1
end
end
test()
(need to start Julia with multithreading support for this to work, eg JULIA_NUM_THREADS=50; julia)
It only generates a segfault if the number of iterations / threads is sufficiently high, eg 50. For lower numbers it only sporadically / never does so.
My environment:
julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) Gold 6254 CPU # 3.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 50
It is most likely caused by the fact that you are using deprecated syntax so probably something with deprecation handling messes up things (I do not have enough cores to test it).
In general your code uses deprecated syntax (and produces something different than you probably expect):
~$ julia --depwarn=yes --banner=no
julia> using DataFrames
julia> df = DataFrame()
0×0 DataFrame
julia> df.foo=1
┌ Warning: `setproperty!(df::DataFrame, col_ind::Symbol, v)` is deprecated, use `df[!, col_ind] .= v` instead.
│ caller = top-level scope at REPL[3]:1
└ # Core REPL[3]:1
1
julia> df # note that the resulting deprecated syntax has added the column but it has 0 rows
0×1 DataFrame
julia> df2 = DataFrame()
0×0 DataFrame
julia> df2.foo = [1] # this is a correct syntax - assign a vector
1-element Array{Int64,1}:
1
julia> df2[:, :foo2] .= 1 # or use broadcasting
1-element Array{Int64,1}:
1
julia> insertcols!(df2, :foo3 => 1) # or use insertcols! which does broadcasting automatically, see the docstring for details
1×3 DataFrame
│ Row │ foo │ foo2 │ foo3 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 1 │
The reason why df.foo = 1 is disallowed and df.foo = [1] is required follows the fact that, as opposed to e.g. R, Julia distinguishes scalars and vectors (in R everything is a vector).
Going back to the original question something e.g. like this should work:
using DataFrames
function test()
Threads.#threads for i in 1:50
df = DataFrame()
df.foo = [1]
end
end
test()
please let me know if it causes problems or not. Thank you!
I am new to Julia and am working with creating a properly shaped multidimensional array.
function get_deets(curric)
curric = curric.metrics
return ["" curric["complexity"][1] curric["blocking factor"][1] curric["delay factor"][1]]
end
function compare_currics(currics...)
headers = [" ", "Complexity", "Blocking Factor", "Delay Factor"]
data = [get_deets(curric) for curric in currics]
return pretty_table(data, headers)
end
The data I am getting back is:
3-element Array{Array{Any,2},1}:
["" 393.0 184 209.0]
["" 361.0 164 197.0]
["" 363.0 165 198.0]
However, I need something that looks like this:
3×4 Array{Any,2}:
"" 393.0 184 209.0
"" 361.0 164 197.0
"" 363.0 165 198.0
I would replace the comprehension [get_deets(curric) for curric in currics] with a reduction.
For example:
using Random
function getdeets(curric)
# random "deets", as a 1-D Vector
return [randstring(4), rand(), 10rand(), 100rand()]
end
function getdata(currics)
# All 1-D vectors are concatenated horizontally, to produce a
# 2-D matrix with "deets" as columns (efficient since Julia matrices
# are stored in column major order)
data = reduce(hcat, getdeets(curric) for curric in currics)
return data
end
With this, you get a slightly different structure than what you want: it is transposed, but that should be more efficient overall
julia> getdata(1:3)
4×3 Array{Any,2}:
"B2Mq" "S0hO" "6KCn"
0.291359 0.00046518 0.905285
4.03026 0.612037 8.6458
35.3133 79.3744 6.49379
If you want your tabular data to be presented in the same way as your question, this solution can easily be adapted:
function getdeets(curric)
# random "deets", as a row matrix
return [randstring(4) rand() 10rand() 100rand()]
end
function getdata(currics)
# All rows are concatenated vertically, to produce a
# 2-D matrix
data = reduce(vcat, getdeets(curric) for curric in currics)
return data
end
This produces:
julia> getdata(1:3)
3×4 Array{Any,2}:
"eU7p" 0.563626 0.282499 52.1877
"3pIw" 0.646435 8.16608 27.534
"AI6z" 0.86198 0.235428 25.7382
It looks like for the stuff you want to do you need a DataFrame rather than an Array.
Look at the sample Julia session below:
julia> using DataFrames, Random
julia> df = DataFrame(_=randstring(4), Complexity=rand(4), Blocking_Factor=rand(4), Delay_Factor=rand(4))
4×4 DataFrame
│ Row │ _ │ Complexity │ Blocking_Factor │ Delay_Factor │
│ │ String │ Float64 │ Float64 │ Float64 │
├─────┼────────┼────────────┼─────────────────┼──────────────┤
│ 1 │ S6vT │ 0.817189 │ 0.00723053 │ 0.358754 │
│ 2 │ S6vT │ 0.569289 │ 0.978932 │ 0.385238 │
│ 3 │ S6vT │ 0.990195 │ 0.232987 │ 0.434745 │
│ 4 │ S6vT │ 0.59623 │ 0.113731 │ 0.871375 │
julia> Matrix(df[!,2:end])
4×3 Array{Float64,2}:
0.817189 0.00723053 0.358754
0.569289 0.978932 0.385238
0.990195 0.232987 0.434745
0.59623 0.113731 0.871375
Note that in the last part we have converted the numerical part of the data into an Array (I assume you need an Array at some point). Note that this Array is containing only Float64 elements. In practice this means that no boxing will occur when storing values and any operation on such Array will be an order of magnitude faster. To illustrate the point have a look at the code below (I copy the data from df into two almost identical Arrays).
julia> m = Matrix(df[!,2:end])
4×3 Array{Float64,2}:
0.817189 0.00723053 0.358754
0.569289 0.978932 0.385238
0.990195 0.232987 0.434745
0.59623 0.113731 0.871375
julia> m2 = Matrix{Any}(df[!,2:end])
4×3 Array{Any,2}:
0.817189 0.00723053 0.358754
0.569289 0.978932 0.385238
0.990195 0.232987 0.434745
0.59623 0.113731 0.871375
julia> using BenchmarkTools
julia> #btime mean($m)
5.099 ns (0 allocations: 0 bytes)
0.5296580253263143
julia> #btime mean($m2)
203.103 ns (12 allocations: 192 bytes)
0.5296580253263143
Is there a way to convert an object in Julia to a code representation generating the same object?
I am basically looking for an equivalent to R's dput function.
So if I have an object like:
A = rand(2,2)
# Which outputs
>2×2 Array{Float64,2}:
0.0462887 0.365109
0.698356 0.302478
I can do something like dput(A) which prints something like the following to the console that can be copy-pasted to be able to replicate the object:
[0.0462887 0.365109; 0.698356 0.302478]
I think you are looking for repr:
julia> A = rand(2, 2);
julia> repr(A)
"[0.427705 0.0971806; 0.395074 0.168961]"
Just use Base.dump.
julia> dump(rand(2,2))
Array{Float64}((2, 2)) [0.162861 0.434463; 0.0823066 0.519742]
You can copy the second part.
(This is a modified crosspost of https://stackoverflow.com/a/73337342/18431399)
repr might not work as expected for DataFrames.
Here is one way to mimic the behaviour of R's dput for DataFrames in Julia:
julia> using DataFrames
julia> using Random; Random.seed!(0);
julia> df = DataFrame(a = rand(3), b = rand(1:10, 3))
3×2 DataFrame
Row │ a b
│ Float64 Int64
─────┼──────────────────
1 │ 0.405699 1
2 │ 0.0685458 7
3 │ 0.862141 2
julia> repr(df) # Attempting with repr()
"3×2 DataFrame\n Row │ a b\n │ Float64 Int64\n─────┼──────────────────\n 1 │ 0.405699 1\n 2 │ 0.0685458 7\n 3 │ 0.862141 2"
julia> julian_dput(x) = invoke(show, Tuple{typeof(stdout), Any}, stdout, df);
julia> julian_dput(df)
DataFrame(AbstractVector[[0.4056994708920292, 0.06854582438651502, 0.8621408571954849], [1, 7, 2]], DataFrames.Index(Dict(:a => 1, :b => 2), [:a, :b]))
That is, julian_dput() takes a DataFrame as input and returns a string that can generate the input.
Source: https://discourse.julialang.org/t/given-an-object-return-julia-code-that-defines-the-object/80579/12