Linear regression per group in Julia - julia

To do a linear regression in Julia we can use the function lm like this:
using DataFrames
using GLM
df = DataFrame(x=[2,3,2,1,3,5,7,4,2],
y=[2,3,5,1,5,6,4,2,3])
9×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 2 2
2 │ 3 3
3 │ 2 5
4 │ 1 1
5 │ 3 5
6 │ 5 6
7 │ 7 4
8 │ 4 2
9 │ 2 3
lm(#formula(y~x), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 2.14516 1.11203 1.93 0.0951 -0.484383 4.77471
x 0.403226 0.303282 1.33 0.2254 -0.313923 1.12037
───────────────────────────────────────────────────────────────────────
But I was wondering how to do a linear regression per group in Julia. Here is some reproducible data:
df = DataFrame(group = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
x=[2,3,2,1,3,5,7,4,2,3,4,5,2,6,3,1,6,1],
y=[2,3,5,1,5,6,4,2,3,3,2,4,7,1,8,4,3,1])
18×3 DataFrame
Row │ group x y
│ Char Int64 Int64
─────┼─────────────────────
1 │ A 2 2
2 │ A 3 3
3 │ A 2 5
4 │ A 1 1
5 │ A 3 5
6 │ A 5 6
7 │ A 7 4
8 │ A 4 2
⋮ │ ⋮ ⋮ ⋮
12 │ B 5 4
13 │ B 2 7
14 │ B 6 1
15 │ B 3 8
16 │ B 1 4
17 │ B 6 3
18 │ B 1 1
3 rows omitted
So I was wondering if anyone knows how to perform a linear regression per group (in this case for group A and B in df) and get the statistical coefficients like p-value and R Square per group in Julia?

UPDATE: In light of #matnor's comment about trouble in interpreting results of regression with baseline in answer, here is a better formula which gives clearer grouped results:
lm(#formula(y~0 + group + x & group), df)
With this regression the table is mostly self-explanatory. Note covariance still needs interpretation (but depending on context may be more applicable).
ORIGINAL ANSWER:
It can be as simple as:
lm(#formula(y~1 + group + x*group), df)
GLM fits group as a categorical variable, adding a dummy coefficient (maybe today the PC crowd will change this name) for each group. The interaction term x*group adds another set of dummy coefficients. The first set, represents the intercept of each group, and the second represents the slope. Here are the results:
StatsModels.TableRegressionModel{...}
y ~ 1 + x + group + x & group
Coefficients:
──────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|)
──────────────────────────────────────────────────────
(Intercept) 2.14516 1.47762 1.45 0.1686
x 0.403226 0.402988 1.00 0.3340
group: B 2.62322 2.10649 1.25 0.2335
x & group: B -0.723079 0.557198 -1.30 0.2154
──────────────────────────────────────────────────────
Note that group A doesn't appear because it is the baseline group (and corresponds to first intercept/slope pair).
If you look at the numbers, for example, for group B, you can see 2.62322 and -0.723079 which need to be added to the baseline to get slope/intercept of group:
julia> # 4.76838, -0.319853 are group B slope/intercept
julia> 2.14516 + 2.62322 ≈ 4.76838 # intercept
true
julia> 0.403226 + -0.723079 ≈ -0.319853 # slope
true
There are some benefits in terms of efficiency to this method, as well as added flexibility (GLM has more features).

Use groupby function from DataFrames:
julia> gp = groupby(df, :group)
GroupedDataFrame with 2 groups based on key: group
First Group (9 rows): group = 'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
Row │ group x y
│ Char Int64 Int64
─────┼─────────────────────
1 │ A 2 2
2 │ A 3 3
3 │ A 2 5
4 │ A 1 1
5 │ A 3 5
6 │ A 5 6
7 │ A 7 4
8 │ A 4 2
9 │ A 2 3
⋮
Last Group (9 rows): group = 'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
Row │ group x y
│ Char Int64 Int64
─────┼─────────────────────
1 │ B 3 3
2 │ B 4 2
3 │ B 5 4
⋮ │ ⋮ ⋮ ⋮
7 │ B 1 4
8 │ B 6 3
9 │ B 1 1
3 rows omitted
julia> for df in gp
#show lm(#formula(y~x), df)
end
lm(#= REPL[12]:2 =# #formula(y ~ x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 2.14516 1.11203 1.93 0.0951 -0.484383 4.77471
x 0.403226 0.303282 1.33 0.2254 -0.313923 1.12037
───────────────────────────────────────────────────────────────────────
lm(#= REPL[12]:2 =# #formula(y ~ x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 4.76838 1.79758 2.65 0.0328 0.517773 9.01899
x -0.319853 0.460734 -0.69 0.5099 -1.40932 0.769609
─────────────────────────────────────────────────────────────────────────
And if you want to save the returned object by lm, then you can take the following approach:
julia> res = Vector{StatsModels.TableRegressionModel}(undef, 2);
julia> for (idx,df) in enumerate(gp)
res[idx] = lm(#formula(y~x), df)
end
julia> res[1]
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 2.14516 1.11203 1.93 0.0951 -0.484383 4.77471
x 0.403226 0.303282 1.33 0.2254 -0.313923 1.12037
───────────────────────────────────────────────────────────────────────

Related

How to filter a dataframe keeping the highest value of a certain column

Given the dataframe below, I want to filter records that shares the same q2, id_q, check_id and keep only the ones with the highest value.
input dataframe:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
hdfhhd
dfsdfsdf
10
10
80
There are 2 q2 with same id_q, check_id but with different values: 90,80.
I want to return for the same q2, id_q, check_id the line with the highest value. For example above the output is:
So I want to drop duplicates regarding to: check_id and id_q and keep the one with the highest value of valuecolumn
Desired Output:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
For this case this code seems to be shorter that the ones referenced in other answers:
Suppose you have
julia> df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))
6×4 DataFrame
Row │ a b c notimportant
│ String Int64 Int64 Float64
─────┼────────────────────────────────────
1 │ a 1 11 0.93785
2 │ a 1 12 0.877777
3 │ b 2 13 0.845306
4 │ b 2 14 0.477606
5 │ b 3 15 0.722569
6 │ b 3 16 0.122807
Than you can just do:
julia> combine(groupby(df, [:a, :b]), :c => maximum => :c)
3×3 DataFrame
Row │ a b c
│ String Int64 Int64
─────┼──────────────────────
1 │ a 1 12
2 │ b 2 14
3 │ b 3 16

Counting occurrences and making calculations with a dataframe

I have column in a dataframe like this:
df = DataFrame(:num=>rand(0:10,20))
From df I want to make 2 others dataframe:
df1 = counter(df[!,:num)
To have the frequencies of each integer from 0 to 10. But I need the values sorted from 0 to 10:
0=>2
1=>3
2=>7
so on..
Then I want a new dataframe df2 where:
column_p = sum of occurrences of 9 and 10
column_n = sum of occurrences of 7 and 8
column_d = sum of occurrences of 0 to 6
I managed to get the first part, even though the result is not sorted but this last dataframe has been a challenge to my julia skills (still learning)
UPDATE 1
I managed to do this fucntion:
function f(dff)
#eachrow dff begin
if :num >=9
:class = "Positive"
elseif :num >=7
:class = "Neutral"
elseif :num <7
:class = "Negative"
end
end
end
This function do half of what I want and fails if there's no :class column in the dataframe.
Now I want to count how many positive, neutral and negatives to do this operation:
(posivite - negative) / (negatives+neutral+positives)
The first part is:
julia> using DataFrames, Random
julia> Random.seed!(1234);
julia> df = DataFrame(:num=>rand(0:10,20));
julia> df1 = combine(groupby(df, :num, sort=true), nrow)
10×2 DataFrame
Row │ num nrow
│ Int64 Int64
─────┼──────────────
1 │ 0 1
2 │ 2 2
3 │ 3 2
4 │ 4 2
5 │ 5 1
6 │ 6 2
7 │ 7 2
8 │ 8 4
9 │ 9 1
10 │ 10 3
I was not sure what you wanted in the second step, but here are two ways to achieve the third step using either df1 or df:
julia> (sum(df1.nrow[df1.num .>= 9]) - sum(df1.nrow[df1.num .<= 6])) / sum(df1.nrow)
-0.3
julia> (count(>=(9), df.num) - count(<=(6), df.num)) / nrow(df)
-0.3

Julia panel data find data

Suppose I have the following data.
dt = DataFrame(
id = [1,1,1,1,1,2,2,2,2,2,],
t = [1,2,3,4,5, 1,2,3,4,5],
val = randn(10)
)
Row │ id t val
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673
2 │ 1 2 -0.817519
3 │ 1 3 0.201231
4 │ 1 4 0.856569
5 │ 1 5 1.8941
6 │ 2 1 0.240532
7 │ 2 2 -0.431824
8 │ 2 3 0.165137
9 │ 2 4 1.22958
10 │ 2 5 -0.424504
I want to make a dummy variable from t to t+2 whether the val>0.5.
For instance, I want to make val_gr_0.5 a new variable.
Could someone help me with how to do this?
Row │ id t val val_gr_0.5
│ Int64 Int64 Float64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673 0 (search t:1 to 3)
2 │ 1 2 -0.817519 1 (search t:2 to 4)
3 │ 1 3 0.201231 1 (search t:3 to 5)
4 │ 1 4 0.856569 missing
5 │ 1 5 1.8941 missing
6 │ 2 1 0.240532 0 (search t:1 to 3)
7 │ 2 2 -0.431824 1 (search t:2 to 4)
8 │ 2 3 0.165137 1 (search t:3 to 5)
9 │ 2 4 1.22958 missing
10 │ 2 5 -0.424504 missing
julia> using DataFramesMeta
julia> function checkvals(subsetdf)
vals = subsetdf[!, :val]
length(vals) < 3 && return missing
any(vals .> 0.5)
end
checkvals (generic function with 1 method)
julia> for sdf in groupby(dt, :id)
transform!(sdf, :t => ByRow(t -> checkvals(#subset(sdf, #byrow t <= :t <= t+2))) => :val_gr)
end
julia> dt
10×4 DataFrame
Row │ id t val val_gr
│ Int64 Int64 Float64 Bool?
─────┼──────────────────────────────────
1 │ 1 1 0.0619327 false
2 │ 1 2 0.278406 false
3 │ 1 3 -0.595824 true
4 │ 1 4 0.0466594 missing
5 │ 1 5 1.08579 missing
6 │ 2 1 -1.57656 true
7 │ 2 2 0.17594 true
8 │ 2 3 0.865381 true
9 │ 2 4 0.972024 missing
10 │ 2 5 1.54641 missing
first define a function
function run_max(x, window)
window -= 1
res = missings(eltype(x), length(x))
for i in 1:length(x)-window
res[i] = maximum(view(x, i:i+window))
end
res
end
then use it in DataFrames.jl
dt.new = dt.val .> 0.5
transform!(groupby(dt,1), :new => x->run_max(x, 3))

Return the maximum sum in `DataFrames.jl`?

Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257

Splitting datasets into train and test in julia

I am trying to split the dataset into train and test subsets in Julia. So far, I have tried using MLDataUtils.jl package for this operation, however, the results are not up to the expectations.
Below are my findings and issues:
Code
# the inputs are
a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
)
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
using MLDataUtils
(x1, y1), (x2, y2) = stratifiedobs((a,b), p=0.7)
#Output of this operation is: (which is not the expectation)
println("x1 is: $x1")
x1 is:
10×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │ 6 │
│ 7 │ 7 │ 7 │ 7 │
│ 8 │ 8 │ 8 │ 8 │
│ 9 │ 9 │ 9 │ 9 │
│ 10 │ 10 │ 10 │ 10 │
println("y1 is: $y1")
y1 is:
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
# but x2 is printed as
(0×3 SubDataFrame, Float64[])
# while y2 as
0-element view(::Array{Float64,1}, Int64[]) with eltype Float64)
However, I would like this dataset to be split in 2 parts with 70% data in train and 30% in test.
Please suggest a better approach to perform this operation in julia.
Thanks in advance.
Probably MLJ.jl developers can show you how to do it using the general ecosystem. Here is a solution using DataFrames.jl only:
julia> using DataFrames, Random
julia> a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
)
10×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 3 3 3
4 │ 4 4 4
5 │ 5 5 5
6 │ 6 6 6
7 │ 7 7 7
8 │ 8 8 8
9 │ 9 9 9
10 │ 10 10 10
julia> function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df, sel, :), view(df, .!sel, :)
end
splitdf (generic function with 1 method)
julia> splitdf(a, 0.7)
(7×3 SubDataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 3 3
2 │ 4 4 4
3 │ 6 6 6
4 │ 7 7 7
5 │ 8 8 8
6 │ 9 9 9
7 │ 10 10 10, 3×3 SubDataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 5 5 5)
I am using views to save memory, but alternatively you could just materialize train and test data frames if you prefer this.
This is how I did implement it for generic arrays in the Beta Machine Learning Toolkit:
"""
partition(data,parts;shuffle=true)
Partition (by rows) one or more matrices according to the shares in `parts`.
# Parameters
* `data`: A matrix/vector or a vector of matrices/vectors
* `parts`: A vector of the required shares (must sum to 1)
* `shufle`: Wheter to randomly shuffle the matrices (preserving the relative order between matrices)
"""
function partition(data::AbstractArray{T,1},parts::AbstractArray{Float64,1};shuffle=true) where T <: AbstractArray
n = size(data[1],1)
if !all(size.(data,1) .== n)
#error "All matrices passed to `partition` must have the same number of rows"
end
ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
return partition.(data,Ref(parts);shuffle=shuffle, fixedRIdx = ridx)
end
function partition(data::AbstractArray{T,N} where N, parts::AbstractArray{Float64,1};shuffle=true,fixedRIdx=Int64[]) where T
n = size(data,1)
nParts = size(parts)
toReturn = []
if !(sum(parts) ≈ 1)
#error "The sum of `parts` in `partition` should total to 1."
end
ridx = fixedRIdx
if (isempty(ridx))
ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
end
current = 1
cumPart = 0.0
for (i,p) in enumerate(parts)
cumPart += parts[i]
final = i == nParts ? n : Int64(round(cumPart*n))
push!(toReturn,data[ridx[current:final],:])
current = (final +=1)
end
return toReturn
end
Use it with:
julia> x = [1:10 11:20]
julia> y = collect(31:40)
julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])
Ore that you can partition also in three or more parts, and the number of arrays to partition also is variable.
By default they are also shuffled, but you can avoid it with the parameter shuffle...
using Pkg Pkg.add("Lathe") using Lathe.preprocess: TrainTestSplit train, test = TrainTestSplit(df)
There is also a positional argument, at in the second position that takes a percentage to split at.

Resources