Datetimes for Julia dataframes - julia

pandas has a number of very handy utilities for manipulating datetime indices. Is there any similar functionality in Julia? I have not found any tutorials for working with such things, though it obviously must be possible.
Some examples of pandas utilities:
dti = pd.to_datetime(
["1/1/2018", np.datetime64("2018-01-01"),
datetime.datetime(2018, 1, 1)]
)
dti = pd.date_range("2018-01-01", periods=3, freq="H")
dti = dti.tz_localize("UTC")
dti.tz_convert("US/Pacific")
idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.Series(range(len(idx)), index=idx)
ts.resample("2H").mean()

Julia libraries have "do only one thing but do it right" philosophy so the layout of its libraries matches perhaps more a Unix (battery of small tools that allow to accomplish a common goal) rather then Python's.
Hence you have separate libraries for DataFrames and Dates:
julia> using Dates, DataFrames
Going through some of the examples of your tutorial:
Pandas
dti = pd.to_datetime(
["1/1/2018", np.datetime64("2018-01-01"), datetime.datetime(2018, 1, 1)]
)
Julia
julia> DataFrame(dti=[Date("1/1/2018", "m/d/y"), Date("2018-01-01"), Date(2018,1,1)])
3×1 DataFrame
Row │ dti
│ Date
─────┼────────────
1 │ 2018-01-01
2 │ 2018-01-01
3 │ 2018-01-01
Pandas
dti = pd.date_range("2018-01-01", periods=3, freq="H")
Julia
julia> DateTime("2018-01-01") .+ Hour.(0:2)
3-element Vector{DateTime}:
2018-01-01T00:00:00
2018-01-01T01:00:00
2018-01-01T02:00:00
Pandas
dti = dti.tz_localize("UTC")
dti.tz_convert("US/Pacific")
Julia
Note that that there is a separate library in Julia for time zones. Additionally "US/Pacific" is a legacy name of a time zone.
julia> using TimeZones
julia> dti = ZonedDateTime.(dti, tz"UTC")
3-element Vector{ZonedDateTime}:
2018-01-01T00:00:00+00:00
2018-01-01T01:00:00+00:00
2018-01-01T02:00:00+00:00
julia> julia> astimezone.(dti, TimeZone("US/Pacific", TimeZones.Class(:LEGACY)))
3-element Vector{ZonedDateTime}:
2017-12-31T16:00:00-08:00
2017-12-31T17:00:00-08:00
2017-12-31T18:00:00-08:00
Pandas
idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.Series(range(len(idx)), index=idx)
ts.resample("2H").mean()
Julia
For resampling or other complex manipulations you will want to use the split-apply-combine pattern (see https://docs.juliahub.com/DataFrames/AR9oZ/1.3.1/man/split_apply_combine/)
julia> df = DataFrame(date=DateTime("2018-01-01") .+ Hour.(0:4), vals=1:5)
5×2 DataFrame
Row │ date vals
│ DateTime Int64
─────┼────────────────────────────
1 │ 2018-01-01T00:00:00 1
2 │ 2018-01-01T01:00:00 2
3 │ 2018-01-01T02:00:00 3
4 │ 2018-01-01T03:00:00 4
5 │ 2018-01-01T04:00:00 5
julia> df.date2 = floor.(df.date, Hour(2));
julia> using StatsBase
julia> combine(groupby(df, :date2), :date2, :vals => mean => :vals_mean)
5×2 DataFrame
Row │ date2 vals_mean
│ DateTime Float64
─────┼────────────────────────────────
1 │ 2018-01-01T00:00:00 1.5
2 │ 2018-01-01T00:00:00 1.5
3 │ 2018-01-01T02:00:00 3.5
4 │ 2018-01-01T02:00:00 3.5
5 │ 2018-01-01T04:00:00 5.0

Related

Malformed expression in DataFramesMeta.js when #selecting from DataFrame using a Symbol("x") expression

I have a Julia DataFrame which I can work with fine in any normal way. For example, let's say the df is
df = DataFrame(:x => [1,2,3], :y => [4,5,6], :z => [7,8,9]);
I can easily do
julia> df[:, :x]
3-element Vector{Int64}:
1
2
3
julia> df[:, [:x, Symbol("y")]]
3×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
However, if I do the following I get a massive error.
julia> #select(df, Symbol("y"))
ERROR: LoadError: ArgumentError: Malformed expression in DataFramesMeta.jl macro
Stacktrace:
[1] fun_to_vec(ex::Expr; gensym_names::Bool, outer_flags::NamedTuple{(Symbol("#byrow"), Symbol("#passmissing"), Symbol("#astable")), Tuple{Base.RefValue{Bool}, Base.RefValue{Bool}, Base.RefValue{Bool}}}, no_dest::Bool)
# DataFramesMeta ~/.julia/packages/DataFramesMeta/yzaoq/src/parsing.jl:289
[2] (::DataFramesMeta.var"#47#48"{NamedTuple{(Symbol("#byrow"), Symbol("#passmissing"), Symbol("#astable")), Tuple{Base.RefValue{Bool}, Base.RefValue{Bool}, Base.RefValue{Bool}}}})(ex::Expr)
# DataFramesMeta ./none:0
[3] iterate(::Base.Generator{Vector{Any}, DataFramesMeta.var"#47#48"{NamedTuple{(Symbol("#byrow"), Symbol("#passmissing"), Symbol("#astable")), Tuple{Base.RefValue{Bool}, Base.RefValue{Bool}, Base.RefValue{Bool}}}}})
# Base ./generator.jl:47
[4] select_helper(x::Symbol, args::Expr)
# DataFramesMeta ~/.julia/packages/DataFramesMeta/yzaoq/src/macros.jl:1440
[5] var"#select"(__source__::LineNumberNode, __module__::Module, x::Any, args::Vararg{Any})
# DataFramesMeta ~/.julia/packages/DataFramesMeta/yzaoq/src/macros.jl:1543
in expression starting at REPL[35]:1
Any clues?
If you want the call to Symbol constructor be parsed verbatim you need to escape it with $:
julia> #select(df, $(Symbol("y")))
3×1 DataFrame
Row │ y
│ Int64
─────┼───────
1 │ 4
2 │ 5
3 │ 6
See https://juliadata.github.io/DataFramesMeta.jl/stable/#dollar for more examples.
The reason why this is needed is because DataFramesMeta.jl introduces non standard evaluation, so if you want things to be evaluated in a standard way you need to escape them.

Annotations in plot Julia

I am starting to program in Julia.
I would like to know how to put the corresponding name in the scatter for each country. I am using only the Plots.jl package, below is a screenshot of what I have done.
Any help is appreciated !!!
Use the series_annotations property. See the code below.
julia> using DataFrames, Random, Plots
julia> df = DataFrame(x=rand(8),y=rand(8),label=[randstring(6) for _ in 1:8])
8×3 DataFrame
Row │ x y label
│ Float64 Float64 String
─────┼─────────────────────────────
1 │ 0.436953 0.307696 YVFnr8
2 │ 0.269204 0.303321 Rpsbz5
3 │ 0.120839 0.870337 TEHjOf
4 │ 0.329191 0.893599 9cERmd
5 │ 0.484566 0.852965 wn379M
6 │ 0.743256 0.856181 RW1AjF
7 │ 0.857837 0.0873707 YpdRO2
8 │ 0.668681 0.414274 gJ4HLw
julia> scatter(df.x, df.y, series_annotations = text.(df.label, :bottom))

Shaping Julia multidimensional arrays

I am new to Julia and am working with creating a properly shaped multidimensional array.
function get_deets(curric)
curric = curric.metrics
return ["" curric["complexity"][1] curric["blocking factor"][1] curric["delay factor"][1]]
end
function compare_currics(currics...)
headers = [" ", "Complexity", "Blocking Factor", "Delay Factor"]
data = [get_deets(curric) for curric in currics]
return pretty_table(data, headers)
end
The data I am getting back is:
3-element Array{Array{Any,2},1}:
["" 393.0 184 209.0]
["" 361.0 164 197.0]
["" 363.0 165 198.0]
However, I need something that looks like this:
3×4 Array{Any,2}:
"" 393.0 184 209.0
"" 361.0 164 197.0
"" 363.0 165 198.0
I would replace the comprehension [get_deets(curric) for curric in currics] with a reduction.
For example:
using Random
function getdeets(curric)
# random "deets", as a 1-D Vector
return [randstring(4), rand(), 10rand(), 100rand()]
end
function getdata(currics)
# All 1-D vectors are concatenated horizontally, to produce a
# 2-D matrix with "deets" as columns (efficient since Julia matrices
# are stored in column major order)
data = reduce(hcat, getdeets(curric) for curric in currics)
return data
end
With this, you get a slightly different structure than what you want: it is transposed, but that should be more efficient overall
julia> getdata(1:3)
4×3 Array{Any,2}:
"B2Mq" "S0hO" "6KCn"
0.291359 0.00046518 0.905285
4.03026 0.612037 8.6458
35.3133 79.3744 6.49379
If you want your tabular data to be presented in the same way as your question, this solution can easily be adapted:
function getdeets(curric)
# random "deets", as a row matrix
return [randstring(4) rand() 10rand() 100rand()]
end
function getdata(currics)
# All rows are concatenated vertically, to produce a
# 2-D matrix
data = reduce(vcat, getdeets(curric) for curric in currics)
return data
end
This produces:
julia> getdata(1:3)
3×4 Array{Any,2}:
"eU7p" 0.563626 0.282499 52.1877
"3pIw" 0.646435 8.16608 27.534
"AI6z" 0.86198 0.235428 25.7382
It looks like for the stuff you want to do you need a DataFrame rather than an Array.
Look at the sample Julia session below:
julia> using DataFrames, Random
julia> df = DataFrame(_=randstring(4), Complexity=rand(4), Blocking_Factor=rand(4), Delay_Factor=rand(4))
4×4 DataFrame
│ Row │ _ │ Complexity │ Blocking_Factor │ Delay_Factor │
│ │ String │ Float64 │ Float64 │ Float64 │
├─────┼────────┼────────────┼─────────────────┼──────────────┤
│ 1 │ S6vT │ 0.817189 │ 0.00723053 │ 0.358754 │
│ 2 │ S6vT │ 0.569289 │ 0.978932 │ 0.385238 │
│ 3 │ S6vT │ 0.990195 │ 0.232987 │ 0.434745 │
│ 4 │ S6vT │ 0.59623 │ 0.113731 │ 0.871375 │
julia> Matrix(df[!,2:end])
4×3 Array{Float64,2}:
0.817189 0.00723053 0.358754
0.569289 0.978932 0.385238
0.990195 0.232987 0.434745
0.59623 0.113731 0.871375
Note that in the last part we have converted the numerical part of the data into an Array (I assume you need an Array at some point). Note that this Array is containing only Float64 elements. In practice this means that no boxing will occur when storing values and any operation on such Array will be an order of magnitude faster. To illustrate the point have a look at the code below (I copy the data from df into two almost identical Arrays).
julia> m = Matrix(df[!,2:end])
4×3 Array{Float64,2}:
0.817189 0.00723053 0.358754
0.569289 0.978932 0.385238
0.990195 0.232987 0.434745
0.59623 0.113731 0.871375
julia> m2 = Matrix{Any}(df[!,2:end])
4×3 Array{Any,2}:
0.817189 0.00723053 0.358754
0.569289 0.978932 0.385238
0.990195 0.232987 0.434745
0.59623 0.113731 0.871375
julia> using BenchmarkTools
julia> #btime mean($m)
5.099 ns (0 allocations: 0 bytes)
0.5296580253263143
julia> #btime mean($m2)
203.103 ns (12 allocations: 192 bytes)
0.5296580253263143

Julia DataFrame ERROR: BoundsError attempt to access attempt to access a data frame with X rows at index false

I'm trying to subset a DataFrame in Julia as follows:
df = DataFrame(a=[1,2,3], b=["x", "y", "z"])
df2 = df[df.a == 2, :]
I'd expect to get back just the second row, but instead I get an error:
ERROR: BoundsError: attempt to access "attempt to access a data frame
with 3 rows at index false"
What does this error mean and how do I subset the DataFrame?
Just to mention other options note that you can use the filter function here:
julia> filter(row -> row.a == 2, df)
1×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 2 │ y │
or
julia> df[filter(==(2), df.a), :]
1×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 2 │ y │
Fortunately, you only need to add one character: .. The . character enables broadcasting on any Julia function, even ones like ==. Therefore, your code would be as follows:
df = DataFrame(a=[1,2,3], b=["x", "y", "z"])
df2 = df[df.a .== 2, :]
Without the broadcast, the clause df.a == 2 returns false because it's literally comparing the Array [1,2,3], as a whole unit, to the scalar value of 2. An Array of shape (3,) will never be equal to a scalar value of 2, without broadcasting, because the sizes are different. Therefore, that clause just returns a single false.
The error you're getting tells you that you're trying to access the DataFrame at index false, which is not a valid index for a DataFrame with 3 rows. By broadcasting with ., you're now creating a Bool Array of shape (3,), which is a valid way to index a DataFrame with 3 rows.
For more on broadcasting, see the official Julia documentation here.

Equivalent to R's dput in Julia

Is there a way to convert an object in Julia to a code representation generating the same object?
I am basically looking for an equivalent to R's dput function.
So if I have an object like:
A = rand(2,2)
# Which outputs
>2×2 Array{Float64,2}:
0.0462887 0.365109
0.698356 0.302478
I can do something like dput(A) which prints something like the following to the console that can be copy-pasted to be able to replicate the object:
[0.0462887 0.365109; 0.698356 0.302478]
I think you are looking for repr:
julia> A = rand(2, 2);
julia> repr(A)
"[0.427705 0.0971806; 0.395074 0.168961]"
Just use Base.dump.
julia> dump(rand(2,2))
Array{Float64}((2, 2)) [0.162861 0.434463; 0.0823066 0.519742]
You can copy the second part.
(This is a modified crosspost of https://stackoverflow.com/a/73337342/18431399)
repr might not work as expected for DataFrames.
Here is one way to mimic the behaviour of R's dput for DataFrames in Julia:
julia> using DataFrames
julia> using Random; Random.seed!(0);
julia> df = DataFrame(a = rand(3), b = rand(1:10, 3))
3×2 DataFrame
Row │ a b
│ Float64 Int64
─────┼──────────────────
1 │ 0.405699 1
2 │ 0.0685458 7
3 │ 0.862141 2
julia> repr(df) # Attempting with repr()
"3×2 DataFrame\n Row │ a b\n │ Float64 Int64\n─────┼──────────────────\n 1 │ 0.405699 1\n 2 │ 0.0685458 7\n 3 │ 0.862141 2"
julia> julian_dput(x) = invoke(show, Tuple{typeof(stdout), Any}, stdout, df);
julia> julian_dput(df)
DataFrame(AbstractVector[[0.4056994708920292, 0.06854582438651502, 0.8621408571954849], [1, 7, 2]], DataFrames.Index(Dict(:a => 1, :b => 2), [:a, :b]))
That is, julian_dput() takes a DataFrame as input and returns a string that can generate the input.
Source: https://discourse.julialang.org/t/given-an-object-return-julia-code-that-defines-the-object/80579/12

Resources