Julia panel data find data - julia

Suppose I have the following data.
dt = DataFrame(
id = [1,1,1,1,1,2,2,2,2,2,],
t = [1,2,3,4,5, 1,2,3,4,5],
val = randn(10)
)
Row │ id t val
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673
2 │ 1 2 -0.817519
3 │ 1 3 0.201231
4 │ 1 4 0.856569
5 │ 1 5 1.8941
6 │ 2 1 0.240532
7 │ 2 2 -0.431824
8 │ 2 3 0.165137
9 │ 2 4 1.22958
10 │ 2 5 -0.424504
I want to make a dummy variable from t to t+2 whether the val>0.5.
For instance, I want to make val_gr_0.5 a new variable.
Could someone help me with how to do this?
Row │ id t val val_gr_0.5
│ Int64 Int64 Float64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673 0 (search t:1 to 3)
2 │ 1 2 -0.817519 1 (search t:2 to 4)
3 │ 1 3 0.201231 1 (search t:3 to 5)
4 │ 1 4 0.856569 missing
5 │ 1 5 1.8941 missing
6 │ 2 1 0.240532 0 (search t:1 to 3)
7 │ 2 2 -0.431824 1 (search t:2 to 4)
8 │ 2 3 0.165137 1 (search t:3 to 5)
9 │ 2 4 1.22958 missing
10 │ 2 5 -0.424504 missing

julia> using DataFramesMeta
julia> function checkvals(subsetdf)
vals = subsetdf[!, :val]
length(vals) < 3 && return missing
any(vals .> 0.5)
end
checkvals (generic function with 1 method)
julia> for sdf in groupby(dt, :id)
transform!(sdf, :t => ByRow(t -> checkvals(#subset(sdf, #byrow t <= :t <= t+2))) => :val_gr)
end
julia> dt
10×4 DataFrame
Row │ id t val val_gr
│ Int64 Int64 Float64 Bool?
─────┼──────────────────────────────────
1 │ 1 1 0.0619327 false
2 │ 1 2 0.278406 false
3 │ 1 3 -0.595824 true
4 │ 1 4 0.0466594 missing
5 │ 1 5 1.08579 missing
6 │ 2 1 -1.57656 true
7 │ 2 2 0.17594 true
8 │ 2 3 0.865381 true
9 │ 2 4 0.972024 missing
10 │ 2 5 1.54641 missing

first define a function
function run_max(x, window)
window -= 1
res = missings(eltype(x), length(x))
for i in 1:length(x)-window
res[i] = maximum(view(x, i:i+window))
end
res
end
then use it in DataFrames.jl
dt.new = dt.val .> 0.5
transform!(groupby(dt,1), :new => x->run_max(x, 3))

Related

How to filter a dataframe keeping the highest value of a certain column

Given the dataframe below, I want to filter records that shares the same q2, id_q, check_id and keep only the ones with the highest value.
input dataframe:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
hdfhhd
dfsdfsdf
10
10
80
There are 2 q2 with same id_q, check_id but with different values: 90,80.
I want to return for the same q2, id_q, check_id the line with the highest value. For example above the output is:
So I want to drop duplicates regarding to: check_id and id_q and keep the one with the highest value of valuecolumn
Desired Output:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
For this case this code seems to be shorter that the ones referenced in other answers:
Suppose you have
julia> df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))
6×4 DataFrame
Row │ a b c notimportant
│ String Int64 Int64 Float64
─────┼────────────────────────────────────
1 │ a 1 11 0.93785
2 │ a 1 12 0.877777
3 │ b 2 13 0.845306
4 │ b 2 14 0.477606
5 │ b 3 15 0.722569
6 │ b 3 16 0.122807
Than you can just do:
julia> combine(groupby(df, [:a, :b]), :c => maximum => :c)
3×3 DataFrame
Row │ a b c
│ String Int64 Int64
─────┼──────────────────────
1 │ a 1 12
2 │ b 2 14
3 │ b 3 16

Is there as.factor analogue in Julia?

I have an integer column in dataframe. How can I convert its values into string in Julia?
In R a can simply write:
mutate(column2 = as.factor(column1))
In Julia:
julia> using DataFramesMeta, CategoricalArrays
julia> df = DataFrame(a=1:3, b='a':'c')
3×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> #transform!(df, :b = categorical(:b))
3×2 DataFrame
Row │ a b
│ Int64 Cat…
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
or #transform if you want a new data frame. Also target column name can be different e.g. :b_categorical = categorical(:b).

Return the maximum sum in `DataFrames.jl`?

Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257

Splitting datasets into train and test in julia

I am trying to split the dataset into train and test subsets in Julia. So far, I have tried using MLDataUtils.jl package for this operation, however, the results are not up to the expectations.
Below are my findings and issues:
Code
# the inputs are
a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
)
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
using MLDataUtils
(x1, y1), (x2, y2) = stratifiedobs((a,b), p=0.7)
#Output of this operation is: (which is not the expectation)
println("x1 is: $x1")
x1 is:
10×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │ 6 │
│ 7 │ 7 │ 7 │ 7 │
│ 8 │ 8 │ 8 │ 8 │
│ 9 │ 9 │ 9 │ 9 │
│ 10 │ 10 │ 10 │ 10 │
println("y1 is: $y1")
y1 is:
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
# but x2 is printed as
(0×3 SubDataFrame, Float64[])
# while y2 as
0-element view(::Array{Float64,1}, Int64[]) with eltype Float64)
However, I would like this dataset to be split in 2 parts with 70% data in train and 30% in test.
Please suggest a better approach to perform this operation in julia.
Thanks in advance.
Probably MLJ.jl developers can show you how to do it using the general ecosystem. Here is a solution using DataFrames.jl only:
julia> using DataFrames, Random
julia> a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
)
10×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 3 3 3
4 │ 4 4 4
5 │ 5 5 5
6 │ 6 6 6
7 │ 7 7 7
8 │ 8 8 8
9 │ 9 9 9
10 │ 10 10 10
julia> function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df, sel, :), view(df, .!sel, :)
end
splitdf (generic function with 1 method)
julia> splitdf(a, 0.7)
(7×3 SubDataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 3 3
2 │ 4 4 4
3 │ 6 6 6
4 │ 7 7 7
5 │ 8 8 8
6 │ 9 9 9
7 │ 10 10 10, 3×3 SubDataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 5 5 5)
I am using views to save memory, but alternatively you could just materialize train and test data frames if you prefer this.
This is how I did implement it for generic arrays in the Beta Machine Learning Toolkit:
"""
partition(data,parts;shuffle=true)
Partition (by rows) one or more matrices according to the shares in `parts`.
# Parameters
* `data`: A matrix/vector or a vector of matrices/vectors
* `parts`: A vector of the required shares (must sum to 1)
* `shufle`: Wheter to randomly shuffle the matrices (preserving the relative order between matrices)
"""
function partition(data::AbstractArray{T,1},parts::AbstractArray{Float64,1};shuffle=true) where T <: AbstractArray
n = size(data[1],1)
if !all(size.(data,1) .== n)
#error "All matrices passed to `partition` must have the same number of rows"
end
ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
return partition.(data,Ref(parts);shuffle=shuffle, fixedRIdx = ridx)
end
function partition(data::AbstractArray{T,N} where N, parts::AbstractArray{Float64,1};shuffle=true,fixedRIdx=Int64[]) where T
n = size(data,1)
nParts = size(parts)
toReturn = []
if !(sum(parts) ≈ 1)
#error "The sum of `parts` in `partition` should total to 1."
end
ridx = fixedRIdx
if (isempty(ridx))
ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
end
current = 1
cumPart = 0.0
for (i,p) in enumerate(parts)
cumPart += parts[i]
final = i == nParts ? n : Int64(round(cumPart*n))
push!(toReturn,data[ridx[current:final],:])
current = (final +=1)
end
return toReturn
end
Use it with:
julia> x = [1:10 11:20]
julia> y = collect(31:40)
julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])
Ore that you can partition also in three or more parts, and the number of arrays to partition also is variable.
By default they are also shuffled, but you can avoid it with the parameter shuffle...
using Pkg Pkg.add("Lathe") using Lathe.preprocess: TrainTestSplit train, test = TrainTestSplit(df)
There is also a positional argument, at in the second position that takes a percentage to split at.

Condition-based column selection in Julia Programming Language

I'm relatively new to Julia - I wondered how to select some columns in DataFrames.jl, based on condition, e.q., all columns with an average greater than 0.
One way to select columns based on a column-wise condition is to map that condition on the columns using eachcol, then use the resulting Bool array as a column selector on the DataFrame:
julia> using DataFrames, Statistics
julia> df = DataFrame(a=randn(10), b=randn(10) .- 1, c=randn(10) .+ 1, d=randn(10))
10×4 DataFrame
Row │ a b c d
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ -1.05612 -2.01901 1.99614 -2.08048
2 │ -0.37359 0.00750529 2.11529 1.93699
3 │ -1.15199 -0.812506 -0.721653 -0.286076
4 │ 0.992366 -2.05898 0.474682 -0.210283
5 │ 0.206846 -0.922274 1.87723 -0.403679
6 │ -1.01923 -1.4401 -0.0769749 0.0557395
7 │ 1.99409 -0.463743 1.83163 -0.585677
8 │ 2.21445 0.658119 2.33056 -1.01474
9 │ 0.918917 -0.371214 1.76301 -0.234561
10 │ -0.839345 -1.09017 1.38716 -2.82545
julia> f(x) = mean(x) > 0
f (generic function with 1 method)
julia> df[:, map(f, eachcol(df))]
10×2 DataFrame
Row │ a c
│ Float64 Float64
─────┼───────────────────────
1 │ -1.05612 1.99614
2 │ -0.37359 2.11529
3 │ -1.15199 -0.721653
4 │ 0.992366 0.474682
5 │ 0.206846 1.87723
6 │ -1.01923 -0.0769749
7 │ 1.99409 1.83163
8 │ 2.21445 2.33056
9 │ 0.918917 1.76301
10 │ -0.839345 1.38716

Resources