Return the maximum sum in `DataFrames.jl`? - julia

Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257

I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)

DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257

Related

Is there as.factor analogue in Julia?

I have an integer column in dataframe. How can I convert its values into string in Julia?
In R a can simply write:
mutate(column2 = as.factor(column1))
In Julia:
julia> using DataFramesMeta, CategoricalArrays
julia> df = DataFrame(a=1:3, b='a':'c')
3×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> #transform!(df, :b = categorical(:b))
3×2 DataFrame
Row │ a b
│ Int64 Cat…
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
or #transform if you want a new data frame. Also target column name can be different e.g. :b_categorical = categorical(:b).

Julia panel data find data

Suppose I have the following data.
dt = DataFrame(
id = [1,1,1,1,1,2,2,2,2,2,],
t = [1,2,3,4,5, 1,2,3,4,5],
val = randn(10)
)
Row │ id t val
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673
2 │ 1 2 -0.817519
3 │ 1 3 0.201231
4 │ 1 4 0.856569
5 │ 1 5 1.8941
6 │ 2 1 0.240532
7 │ 2 2 -0.431824
8 │ 2 3 0.165137
9 │ 2 4 1.22958
10 │ 2 5 -0.424504
I want to make a dummy variable from t to t+2 whether the val>0.5.
For instance, I want to make val_gr_0.5 a new variable.
Could someone help me with how to do this?
Row │ id t val val_gr_0.5
│ Int64 Int64 Float64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673 0 (search t:1 to 3)
2 │ 1 2 -0.817519 1 (search t:2 to 4)
3 │ 1 3 0.201231 1 (search t:3 to 5)
4 │ 1 4 0.856569 missing
5 │ 1 5 1.8941 missing
6 │ 2 1 0.240532 0 (search t:1 to 3)
7 │ 2 2 -0.431824 1 (search t:2 to 4)
8 │ 2 3 0.165137 1 (search t:3 to 5)
9 │ 2 4 1.22958 missing
10 │ 2 5 -0.424504 missing
julia> using DataFramesMeta
julia> function checkvals(subsetdf)
vals = subsetdf[!, :val]
length(vals) < 3 && return missing
any(vals .> 0.5)
end
checkvals (generic function with 1 method)
julia> for sdf in groupby(dt, :id)
transform!(sdf, :t => ByRow(t -> checkvals(#subset(sdf, #byrow t <= :t <= t+2))) => :val_gr)
end
julia> dt
10×4 DataFrame
Row │ id t val val_gr
│ Int64 Int64 Float64 Bool?
─────┼──────────────────────────────────
1 │ 1 1 0.0619327 false
2 │ 1 2 0.278406 false
3 │ 1 3 -0.595824 true
4 │ 1 4 0.0466594 missing
5 │ 1 5 1.08579 missing
6 │ 2 1 -1.57656 true
7 │ 2 2 0.17594 true
8 │ 2 3 0.865381 true
9 │ 2 4 0.972024 missing
10 │ 2 5 1.54641 missing
first define a function
function run_max(x, window)
window -= 1
res = missings(eltype(x), length(x))
for i in 1:length(x)-window
res[i] = maximum(view(x, i:i+window))
end
res
end
then use it in DataFrames.jl
dt.new = dt.val .> 0.5
transform!(groupby(dt,1), :new => x->run_max(x, 3))

Julia dataframe : Deleting row from sub dataframe

Problem statement : deleting row from sub dataframe
Code:
x=[rand(3) for i in 1:3]
dfx=DataFrame(x,:auto)
dfy=#view dfx[2:3,:]
Q: I want to delete first row from dfy so it will be deleted from dfx too.
I do subset of original dfx to make further checking of subsetted rows if they fulfill conditions. At the end I want to decide to keep row in dfx or to delete it. I operate on subset of dfx which is dfy.
You are not allowed to perform row deletion in views. Here is one example showing why it would be problematic:
julia> using DataFrames
julia> df = DataFrame(a=1:3)
dfv = 3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> dfv = view(df, [1, 1, 1, 1], :)
4×1 SubDataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 1
3 │ 1
4 │ 1
and now assume you want to remove rows 2 and 3 from the dfv view, but you cannot remove the row from the parent twice and also after such a deletion what would be the state of dfv?
I do subset of original dfx to make further checking of subsetted rows if they fulfill conditions.
Note that you can use parentindices function to get the indices in the parent of your view, so that you can later remove appropriate rows from the parent.
EDIT
An example:
julia> x=[rand(3) for i in 1:3]
3-element Vector{Vector{Float64}}:
[0.9362990387940191, 0.872386665989372, 0.9062520245175714]
[0.31161625031197393, 0.21614040488877717, 0.7277794414244152]
[0.35548885964798926, 0.4422493896149622, 0.45150837090448315]
julia> dfx=DataFrame(x, :auto)
3×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.936299 0.311616 0.355489
2 │ 0.872387 0.21614 0.442249
3 │ 0.906252 0.727779 0.451508
julia> dfy=#view dfx[2:3, :]
2×3 SubDataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.872387 0.21614 0.442249
2 │ 0.906252 0.727779 0.451508
julia> row_to_remove = parentindices(dfy)[1][1]
2
julia> delete!(dfx, row_to_remove)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.936299 0.311616 0.355489
2 │ 0.906252 0.727779 0.451508

Condition-based column selection in Julia Programming Language

I'm relatively new to Julia - I wondered how to select some columns in DataFrames.jl, based on condition, e.q., all columns with an average greater than 0.
One way to select columns based on a column-wise condition is to map that condition on the columns using eachcol, then use the resulting Bool array as a column selector on the DataFrame:
julia> using DataFrames, Statistics
julia> df = DataFrame(a=randn(10), b=randn(10) .- 1, c=randn(10) .+ 1, d=randn(10))
10×4 DataFrame
Row │ a b c d
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ -1.05612 -2.01901 1.99614 -2.08048
2 │ -0.37359 0.00750529 2.11529 1.93699
3 │ -1.15199 -0.812506 -0.721653 -0.286076
4 │ 0.992366 -2.05898 0.474682 -0.210283
5 │ 0.206846 -0.922274 1.87723 -0.403679
6 │ -1.01923 -1.4401 -0.0769749 0.0557395
7 │ 1.99409 -0.463743 1.83163 -0.585677
8 │ 2.21445 0.658119 2.33056 -1.01474
9 │ 0.918917 -0.371214 1.76301 -0.234561
10 │ -0.839345 -1.09017 1.38716 -2.82545
julia> f(x) = mean(x) > 0
f (generic function with 1 method)
julia> df[:, map(f, eachcol(df))]
10×2 DataFrame
Row │ a c
│ Float64 Float64
─────┼───────────────────────
1 │ -1.05612 1.99614
2 │ -0.37359 2.11529
3 │ -1.15199 -0.721653
4 │ 0.992366 0.474682
5 │ 0.206846 1.87723
6 │ -1.01923 -0.0769749
7 │ 1.99409 1.83163
8 │ 2.21445 2.33056
9 │ 0.918917 1.76301
10 │ -0.839345 1.38716

How do you apply a shift to a Julia Dataframe?

In python pandas, the shift function is useful to shift the rows in the dataframe forward and possible relative to the original which allows for calculating changes in time series data. What is the equivalent method in Julia?
Normally one would use ShiftedArrays.jl and apply it to columns that require shifting.
Here is a small working example:
using DataFrames, ShiftedArrays
df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
transform(df, :a => lag => :lag_a)
3×3 DataFrame
Row │ a b lag_a
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 4 missing
2 │ 2 5 1
3 │ 3 6 2
or you could do:
df.c = lag(df.a)
or, to have the lead of two rows:
df.c = lead(df.a, 2)
etc.

Resources