Julia Groupby with mean calculation - julia

I have this dataframe:
d=DataFrame(class=["A","A","A","B","C","D","D","D"],
num=[10,20,30,40,20,20,13,12],
last=[3,5,7,9,11,13,100,12])
and I want to do a groupby. In Python I would do:
d.groupby('class')[['num','last']].mean()
How can I do the same in Julia?
I am trying something to use combine and groupby but no success so far.
Update: I managed to do it this way:
gd = groupby(d, :class)
combine(gd, :num => mean, :last => mean)
Is there any better way to do it?

It depends what you mean by "a better way". You can apply the same function to multiple columns like this:
combine(gd, [:num, :last] .=> mean)
or if you had a lot of columns and e.g. wanted to apply mean to all columns exept a grouping column you could do:
combine(gd, Not(:class) .=> mean)
or (if you want to avoid having to remember which column was grouping)
combine(gd, valuecols(gd) .=> mean)
These are basic schemas. Now the other issue is how to give a name to your target columns. By default they get a name in a form "source_function" like this:
julia> combine(gd, [:num, :last] .=> mean)
4×3 DataFrame
Row │ class num_mean last_mean
│ String Float64 Float64
─────┼─────────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
you can keep original column names like this (this is sometimes preferred):
julia> combine(gd, [:num, :last] .=> mean, renamecols=false)
4×3 DataFrame
Row │ class num last
│ String Float64 Float64
─────┼──────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
or like this:
julia> combine(gd, [:num, :last] .=> mean .=> identity)
4×3 DataFrame
Row │ class num last
│ String Float64 Float64
─────┼──────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
The last example shows you that you can pass any function as the last part that works on strings and generates you target column name, so you can do:
julia> combine(gd, [:num, :last] .=> mean .=> col -> "prefix_" * uppercase(col) * "_suffix")
4×3 DataFrame
Row │ class prefix_NUM_suffix prefix_LAST_suffix
│ String Float64 Float64
─────┼───────────────────────────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
Edit
Doing the operation in a single line:
You can do just:
combine(groupby(d, :class), [:num, :last] .=> mean)
The benefit of storing groupby(d, :class) in a variable is that you perform grouping once and then can reuse the resulting object many times, which speeds up things.
Also if you use DataFrmesMeta.jl you could write e.g.:
#chain d begin
groupby(:class)
combine([:num, :last] .=> mean)
end
which is more typing, but this is style that people coming from R tend to like.

Related

no method matching NearestNeighbors.KDTree(::Matrix{Int64}, ::Distances.Euclidean) in Impute.knn

I get this error when I want to use the k-Nearest Neighbor algorithm for imputing missing values using Impute.jl:
using Impute, DataFrames
df = DataFrame(
a=[1,2,3,4,missing],
b=[1, missing, 3, 4, missing],
c=[1, 2, missing, 5, 8],
)
# 5×3 DataFrame
# Row │ a b c
# │ Int64? Int64? Int64?
# ─────┼───────────────────────────
# 1 │ 1 1 1
# 2 │ 2 missing 2
# 3 │ 3 3 missing
# 4 │ 4 4 5
# 5 │ missing missing 8
julia> Impute.knn(Matrix(df), dims=:cols)
ERROR: MethodError: no method matching NearestNeighbors.KDTree(::Matrix{Int64}, ::Distances.Euclidean)
Closest candidates are:
NearestNeighbors.KDTree(::AbstractVecOrMat{T}, ::M; leafsize, storedata, reorder, reorderbuffer) where {T<:AbstractFloat, M<:Union{Distances.Chebyshev, Distances.Cityblock, Distances.Euclidean, Distances.Minkowski, Distances.WeightedCityblock, Distances.WeightedEuclidean, Distances.WeightedMinkowski}} at C:\Users\Shayan\.julia\packages\NearestNeighbors\huCPc\src\kd_tree.jl:85
NearestNeighbors.KDTree(::AbstractVector{V}, ::M; leafsize, storedata, reorder, reorderbuffer) where {V<:AbstractArray, M<:Union{Distances.Chebyshev, Distances.Cityblock, Distances.Euclidean, Distances.Minkowski, Distances.WeightedCityblock, Distances.WeightedEuclidean, Distances.WeightedMinkowski}} at C:\Users\Shayan\.julia\packages\NearestNeighbors\huCPc\src\kd_tree.jl:27
How should I fix this?
The problem is where I'm passing a Matrix of type Union{Missing, Int64} rather than Union{Missing, Float64}. Based on the error, NearestNeighbors.KDTree gets AbstractVecOrMat{T}where {T<:AbstractFloat}. So first, I should perform a conversion and then pass the result to the knn imputer:
julia> Impute.knn(
convert(Matrix{Union{Missing, Float64}}, Matrix(df)),
dims=:cols
)
5×3 Matrix{Union{Missing, Float64}}:
1.0 1.0 1.0
2.0 3.0 2.0
3.0 3.0 2.0
4.0 4.0 5.0
4.0 4.0 8.0
Additional Point
After this, I can narrow the eltype of the result using identity.(result) with this assumption that I binned the result of imputation to result:
julia> identity.(result)
5×3 Matrix{Float64}:
1.0 1.0 1.0
2.0 3.0 2.0
3.0 3.0 2.0
4.0 4.0 5.0
4.0 4.0 8.0
The reason behind the latter step is that most functions don't get an object of subtype AbstractMatrix with element type of Union{Missing, T}. So narrowing the element type is often unavoidable in such situations.

Manually setting bin sized for grouped histogram bins in a DataFrame Julia

I've been using the following code to generate histograms from binning one column of a Dataframe and using that bin to calculate a median from another column.
using Plots, Statistics, DataFrames
df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
res = combine(groupby(df, :x, sort=true), :y => median)
bar(res.x, res.y_median, legend=false)
The code point selects values for the bins and I would like to apply a bin range of values manually, if possible?
Row │ A B_median
│ Any Float64
─────┼───────────────────
1 │ 1515.74 0.09
2 │ 1517.7 0.81
3 │ 1527.22 10.23
4 │ 1529.88 2.95
5 │ 1530.72 17.32
6 │ 1530.86 15.22
7 │ 1532.26 1.45
8 │ 1532.68 18.51
9 │ 1541.08 1.32
10 │ 1541.22 15.78
11 │ 1541.36 0.12
12 │ 1541.5 13.55
13 │ 1541.92 11.99
14 │ 1542.06 21.14
15 │ 1542.34 10.645
16 │ 1542.62 19.95
17 │ 1542.76 21.0
18 │ 1543.32 20.91
For example, instead of calculating a median for rows 9->17 individually. Could these rows be bunched together automatically i.e. 1542.7+/-0.7 and a total median value be calculated for this range?
Many thanks!
I assume you want something like this:
julia> using DataFrames, CategoricalArrays, Random, Statistics
julia> Random.seed!(1234);
julia> df = DataFrame(A=rand(20), B=rand(20));
julia> df.A_group = cut(df.A, 4);
julia> res = combine(groupby(df, :A_group, sort=true), :B => median)
4×2 DataFrame
Row │ A_group B_median
│ Cat… Float64
─────┼─────────────────────────────────────────────
1 │ Q1: [0.014908849285099945, 0.532… 0.134685
2 │ Q2: [0.5323651749779272, 0.65860… 0.347995
3 │ Q3: [0.6586057536399257, 0.81493… 0.501756
4 │ Q4: [0.8149335702852593, 0.97213… 0.531899

Julia reporting an extra ) when it doesn't exist

I have this for loop in Julia:
begin
countries_data_labels = ["Canada", "Italy", "China", "United States", "Spain"]
y_axis = DataFrame()
for country in countries_data_labels
new_dataframe = get_country(df, country)
new_dataframe = DataFrame(new_dataframe)
df_rows, df_columns = size(new_dataframe)
new_dataframe_long = stack(new_dataframe, begin:end-4)
y_axis[!, Symbol("$country")] = new_dataframe_long[!, :value]
end
end
and I'm getting this error:
syntax: extra token ")" after end of expression
I decided to comment all of the body of the for loop except the 1st one and ran the cell each time after uncommenting to see which line was throwing this error and it was the 4th line in the body:
new_dataframe_long = stack(new_dataframe, begin:end-4)
There is no reason for this error to exist. There are no extra bracket pieces in this line.
My guess is that you mean here:
stack(new_dataframe[begin:end-4, :])
See the MWE example below:
julia> df = DataFrame(a=11:16,b=2.5:7.5)
6×2 DataFrame
Row │ a b
│ Int64 Float64
─────┼────────────────
1 │ 11 2.5
2 │ 12 3.5
3 │ 13 4.5
4 │ 14 5.5
5 │ 15 6.5
6 │ 16 7.5
julia> stack(df[begin:end-3, :])
3×3 DataFrame
Row │ a variable value
│ Int64 String Float64
─────┼──────────────────────────
1 │ 11 b 2.5
2 │ 12 b 3.5
3 │ 13 b 4.5

Not able to fetch top N rows of dataframe in Julia : UndefVarError: head not defined

I am new to Julia, I want to see the first 5 rows of data frame, but when i am writing below code
head(df,5)
I am getting
UndefVarError: head not defined
head is available in e.g. R but not in Julia. First - note that Julia has a nice data frame printing system out of the box that crops things to fit in the terminal window, so you do not need to subset your data frame to see its head and tail. Here is an example:
julia> df = DataFrame(rand(100, 100), :auto)
100×100 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float6 ⋯
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 0.915485 0.176254 0.381047 0.710266 0.597914 0.177617 0.4475 ⋯
2 │ 0.58495 0.551726 0.464703 0.630956 0.476727 0.804854 0.7908
3 │ 0.123723 0.183817 0.986624 0.306091 0.202054 0.148579 0.3433
4 │ 0.558321 0.117478 0.187091 0.482795 0.0718985 0.807018 0.9463
5 │ 0.771561 0.515823 0.830598 0.0742368 0.0831569 0.818487 0.4912 ⋯
6 │ 0.139018 0.182928 0.00129572 0.0439561 0.0929167 0.264609 0.1555
7 │ 0.16076 0.404707 0.0300284 0.665413 0.681704 0.431746 0.3460
8 │ 0.149331 0.132869 0.237446 0.599701 0.149257 0.70753 0.7687
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
93 │ 0.912703 0.98395 0.133307 0.493799 0.76125 0.295725 0.9249 ⋯
94 │ 0.153175 0.339036 0.685642 0.355421 0.365252 0.434604 0.1515
95 │ 0.780877 0.225312 0.511122 0.0506186 0.108054 0.729219 0.5275
96 │ 0.132961 0.348176 0.619712 0.791334 0.052787 0.577896 0.6696
97 │ 0.904386 0.938876 0.988184 0.831708 0.699214 0.627366 0.4320 ⋯
98 │ 0.0295777 0.704879 0.905364 0.142231 0.586725 0.584692 0.9546
99 │ 0.848715 0.177192 0.544509 0.771653 0.472267 0.584306 0.0089
100 │ 0.81299 0.00540772 0.107315 0.323288 0.592159 0.1297 0.3383
94 columns and 84 rows omitted
Now if you need to fetch first 5 rows of your data frame and create a new data frame then use the first function that is defined in Julia Base:
julia> first(df, 5)
5×100 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 F ⋯
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 0.915485 0.176254 0.381047 0.710266 0.597914 0.177617 0.447533 0 ⋯
2 │ 0.58495 0.551726 0.464703 0.630956 0.476727 0.804854 0.790866 0
3 │ 0.123723 0.183817 0.986624 0.306091 0.202054 0.148579 0.343316 0
4 │ 0.558321 0.117478 0.187091 0.482795 0.0718985 0.807018 0.946342 0
5 │ 0.771561 0.515823 0.830598 0.0742368 0.0831569 0.818487 0.491206 0 ⋯
93 columns omitted
In general the design of DataFrames.jl is that we limit the number of new function names as much as possible and reuse what is defined in Julia Base if possible. This is one example of such a situation. This way users have less things to learn.
In julia, the equivalent command is first rather than head.
first is used instead of head. The last is used instead of tail.

Using R lag for Julia

I am attempting to create a lag +1 forward for a particular column in my data frame.
My data is like this:
julia> head(df)
6×9 DataFrames.DataFrame. Omitted printing of 1 columns
│ Row │ Date │ Open │ High │ Low │ Close │ Adj Close │ Volume │ Close_200sma │
├─────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼──────────────┤
│ 1 │ 1993-02-02 │ 43.9687 │ 43.9687 │ 43.75 │ 43.9375 │ 27.6073 │ 1003200 │ NaN │
│ 2 │ 1993-02-03 │ 43.9687 │ 44.25 │ 43.9687 │ 44.25 │ 27.8036 │ 480500 │ NaN │
│ 3 │ 1993-02-04 │ 44.2187 │ 44.375 │ 44.125 │ 44.3437 │ 27.8625 │ 201300 │ NaN │
│ 4 │ 1993-02-05 │ 44.4062 │ 44.8437 │ 44.375 │ 44.8125 │ 28.1571 │ 529400 │ NaN │
│ 5 │ 1993-02-08 │ 44.9687 │ 45.0937 │ 44.4687 │ 45.0 │ 28.2749 │ 531500 │ NaN │
│ 6 │ 1993-02-09 │ 44.9687 │ 45.0625 │ 44.7187 │ 44.9687 │ 28.2552 │ 492100 │ NaN
│
So this is my attempt at lagging forward, in R I may rep NA, 1 and then append this to the front of the subsetted data. Here is my Julia:
# Lag data +1 forward
lag = df[1:nrow(df)-1,[:Long]] # shorten vector by 1 (remove last element)
v = Float64[]
v = vec(convert(Array, lag)) # convert df column to vector
z = fill(NaN, 1) # rep NaN, 1 time (add this to front) to push all forward +1
lags = Float64[]
lags= vec[z; [v]] # join both arrays z=NA first , make vector same nrow(df)
When I join the NaN and my array I have a length(lags) of 2.
The data is split in two:
julia> length(lags[2])
6255
I see the longer length when access the second portion.
If I join the other way, NaN at end, numbers first. I obtain correct length:
# try joining other way
lags_flip = [v; [z]]
julia> length(lags_flip)
6256
I can also add this back to my data frame: (Nan at bottom, i want at front)
# add back to data frame
df[:add] = lags_flip
1
1
1
1
1
1
1
1
[NaN]
My question is when joining the Nan and my data like this:
lags_flip = [v; [z]]
I obtain correct length, when I do it the other way:
Nan first:
lags= [z; [v]]
Then it doesnt appear correct.
How can I offset by data +1 forward, placing a Nan in front and adding back to my df? I feel im close but missing something
EDIT:
A a second thought - probably messing with length of column in a DataFrame is not the best thing to do and I assume you want a new column anyway. In this case this could be a basic approach:
df[:LagLong] = [missing; df[1:end-1,:Long]]
or if you want NaN (but probably you want missing as explained below):
df[:LagLong] = [NaN; df[1:end-1,:Long]]
PREVIOUS REPLY:
You can do it in place:
julia> x = [1.0,2.0,3.0]
3-element Array{Float64,1}:
1.0
2.0
3.0
julia> pop!(unshift!(x, NaN))
3.0
julia> x
3-element Array{Float64,1}:
NaN
1.0
2.0
Replace x in pop!(unshift!(x, NaN)) by an appropriate column selector like df[:Long].
Note, however, that NaN is not NA in R. In Julia NA is missing. And now there is a branch:
if your column allows missing values (it will show Union{Missing, [Something]} in showcols) then you do the same as above pop!(unshift!(df[:Long], missing)).
if it does not allow missings you have two options. First is to first call allowmissing!(df, :Long) to allow missings and go forward as described above. The other is similar to the approach you have proposed: df[:Long] = [missing; df[1:end-1, :Long]].

Resources