Problem statement : deleting row from sub dataframe
Code:
x=[rand(3) for i in 1:3]
dfx=DataFrame(x,:auto)
dfy=#view dfx[2:3,:]
Q: I want to delete first row from dfy so it will be deleted from dfx too.
I do subset of original dfx to make further checking of subsetted rows if they fulfill conditions. At the end I want to decide to keep row in dfx or to delete it. I operate on subset of dfx which is dfy.
You are not allowed to perform row deletion in views. Here is one example showing why it would be problematic:
julia> using DataFrames
julia> df = DataFrame(a=1:3)
dfv = 3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> dfv = view(df, [1, 1, 1, 1], :)
4×1 SubDataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 1
3 │ 1
4 │ 1
and now assume you want to remove rows 2 and 3 from the dfv view, but you cannot remove the row from the parent twice and also after such a deletion what would be the state of dfv?
I do subset of original dfx to make further checking of subsetted rows if they fulfill conditions.
Note that you can use parentindices function to get the indices in the parent of your view, so that you can later remove appropriate rows from the parent.
EDIT
An example:
julia> x=[rand(3) for i in 1:3]
3-element Vector{Vector{Float64}}:
[0.9362990387940191, 0.872386665989372, 0.9062520245175714]
[0.31161625031197393, 0.21614040488877717, 0.7277794414244152]
[0.35548885964798926, 0.4422493896149622, 0.45150837090448315]
julia> dfx=DataFrame(x, :auto)
3×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.936299 0.311616 0.355489
2 │ 0.872387 0.21614 0.442249
3 │ 0.906252 0.727779 0.451508
julia> dfy=#view dfx[2:3, :]
2×3 SubDataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.872387 0.21614 0.442249
2 │ 0.906252 0.727779 0.451508
julia> row_to_remove = parentindices(dfy)[1][1]
2
julia> delete!(dfx, row_to_remove)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.936299 0.311616 0.355489
2 │ 0.906252 0.727779 0.451508
Related
I have an integer column in dataframe. How can I convert its values into string in Julia?
In R a can simply write:
mutate(column2 = as.factor(column1))
In Julia:
julia> using DataFramesMeta, CategoricalArrays
julia> df = DataFrame(a=1:3, b='a':'c')
3×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> #transform!(df, :b = categorical(:b))
3×2 DataFrame
Row │ a b
│ Int64 Cat…
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
or #transform if you want a new data frame. Also target column name can be different e.g. :b_categorical = categorical(:b).
Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257
I'm relatively new to Julia - I wondered how to select some columns in DataFrames.jl, based on condition, e.q., all columns with an average greater than 0.
One way to select columns based on a column-wise condition is to map that condition on the columns using eachcol, then use the resulting Bool array as a column selector on the DataFrame:
julia> using DataFrames, Statistics
julia> df = DataFrame(a=randn(10), b=randn(10) .- 1, c=randn(10) .+ 1, d=randn(10))
10×4 DataFrame
Row │ a b c d
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ -1.05612 -2.01901 1.99614 -2.08048
2 │ -0.37359 0.00750529 2.11529 1.93699
3 │ -1.15199 -0.812506 -0.721653 -0.286076
4 │ 0.992366 -2.05898 0.474682 -0.210283
5 │ 0.206846 -0.922274 1.87723 -0.403679
6 │ -1.01923 -1.4401 -0.0769749 0.0557395
7 │ 1.99409 -0.463743 1.83163 -0.585677
8 │ 2.21445 0.658119 2.33056 -1.01474
9 │ 0.918917 -0.371214 1.76301 -0.234561
10 │ -0.839345 -1.09017 1.38716 -2.82545
julia> f(x) = mean(x) > 0
f (generic function with 1 method)
julia> df[:, map(f, eachcol(df))]
10×2 DataFrame
Row │ a c
│ Float64 Float64
─────┼───────────────────────
1 │ -1.05612 1.99614
2 │ -0.37359 2.11529
3 │ -1.15199 -0.721653
4 │ 0.992366 0.474682
5 │ 0.206846 1.87723
6 │ -1.01923 -0.0769749
7 │ 1.99409 1.83163
8 │ 2.21445 2.33056
9 │ 0.918917 1.76301
10 │ -0.839345 1.38716
In python pandas, the shift function is useful to shift the rows in the dataframe forward and possible relative to the original which allows for calculating changes in time series data. What is the equivalent method in Julia?
Normally one would use ShiftedArrays.jl and apply it to columns that require shifting.
Here is a small working example:
using DataFrames, ShiftedArrays
df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
transform(df, :a => lag => :lag_a)
3×3 DataFrame
Row │ a b lag_a
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 4 missing
2 │ 2 5 1
3 │ 3 6 2
or you could do:
df.c = lag(df.a)
or, to have the lead of two rows:
df.c = lead(df.a, 2)
etc.
In R we can convert NA to 0 with:
df[is.na(df)] <- 0
This works for single columns:
df[ismissing.(df[:col]), :col] = 0
There a way for the full df?
I don't think there's such a function in DataFrames.jl yet.
But you can hack your way around it by combining colwise and recode. I'm also providing a reproducible example here, in case someone wants to iterate on this answer:
julia> using DataFrames
julia> df = DataFrame(a = [missing, 5, 5],
b = [1, missing, missing])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1 │
│ 2 │ 5 │ missing │
│ 3 │ 5 │ missing │
julia> DataFrame(colwise(col -> recode(col, missing=>0), df), names(df))
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───┼───┤
│ 1 │ 0 │ 1 │
│ 2 │ 5 │ 0 │
│ 3 │ 5 │ 0 │
This is a bit ugly as you have to reassign the dataframe column names.
Maybe a simpler way to convert all missing values in a DataFrame is to just use list comprehension:
[df[ismissing.(df[i]), i] = 0 for i in names(df)]