I have a dataset of 5000 variables. One target and 4999 covariates. I want to estimate one glm per each target-variable combination (4999 models).
How can I do that without manually typing 4999 formulas for GLM ?
In R I would simply define a list of 4999 strings ("target ~ x1) , convert each string to a formula and use map to estimate multiple glm. Is there something similar that can be done in Julia ? Or is there an elegant alternative ?
Thanks in advance.
You can programatically create the formula via Term objects. The docs for that can be found here, but consider the following simple example which should meet your needs:
Start with dummy data
julia> using DataFrames, GLM
julia> df = hcat(DataFrame(y = rand(10)), DataFrame(rand(10, 5)))
10×6 DataFrame
│ Row │ y │ x1 │ x2 │ x3 │ x4 │ x5 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼───────────┼──────────┼───────────┼────────────┼──────────┤
│ 1 │ 0.0200963 │ 0.924856 │ 0.947904 │ 0.429068 │ 0.00833488 │ 0.547378 │
│ 2 │ 0.169498 │ 0.0915296 │ 0.375369 │ 0.0341015 │ 0.390461 │ 0.835634 │
│ 3 │ 0.900145 │ 0.502495 │ 0.38106 │ 0.47253 │ 0.637731 │ 0.814095 │
│ 4 │ 0.255163 │ 0.865253 │ 0.791909 │ 0.0833828 │ 0.741899 │ 0.961041 │
│ 5 │ 0.651996 │ 0.29538 │ 0.161443 │ 0.23427 │ 0.23132 │ 0.947486 │
│ 6 │ 0.305908 │ 0.170662 │ 0.569827 │ 0.178898 │ 0.314841 │ 0.237354 │
│ 7 │ 0.308431 │ 0.835606 │ 0.114943 │ 0.19743 │ 0.344216 │ 0.97108 │
│ 8 │ 0.344968 │ 0.452961 │ 0.595219 │ 0.313425 │ 0.102282 │ 0.456764 │
│ 9 │ 0.126244 │ 0.593456 │ 0.818383 │ 0.485622 │ 0.151394 │ 0.043125 │
│ 10 │ 0.60174 │ 0.8977 │ 0.643095 │ 0.0865611 │ 0.482014 │ 0.858999 │
Now when you run a linear model with GLM, you'd do something like lm(#formula(y ~ x1), df), which indeed can't easily be used in a loop to construct different formulas. We'll therefore follow the docs and create the output of the #formula macro directly - remember macros in Julia just transform syntax to other syntax, so they don't do anything we can't write ourselves!
julia> lm(Term(:y) ~ Term(:x1), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x1
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.428436 0.193671 2.21 0.0579 -0.0181696 0.875041
x1 -0.106603 0.304597 -0.35 0.7354 -0.809005 0.595799
──────────────────────────────────────────────────────────────────────────
You can verify for yourself that the above is equivalent to lm(#formula(y ~ x1), df).
Now it's hopefully an easy step to building the loop that you're looking for (restricted to two covariates below to limit the output):
julia> for x ∈ names(df[:, Not(:y)])[1:2]
#show lm(term(:y) ~ term(x), df)
end
lm(term(:y) ~ term(x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x1
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.428436 0.193671 2.21 0.0579 -0.0181696 0.875041
x1 -0.106603 0.304597 -0.35 0.7354 -0.809005 0.595799
──────────────────────────────────────────────────────────────────────────
lm(Term(:y) ~ Term(x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x2
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 0.639633 0.176542 3.62 0.0068 0.232527 1.04674
x2 -0.502327 0.293693 -1.71 0.1256 -1.17958 0.17493
─────────────────────────────────────────────────────────────────────────
As Dave points out below, it's helpful to use the term() function here to create our terms rather than the Term() constructor directly - this is because names(df) returns a vector of Strings, while the Term() constructor expects Symbols. term() has a method for Strings that handles the conversion automatically.
You can also use the low-level API and pass the dependent variable as a vector and the independent variable as a matrix directly without even building formulas. You will lose coefficient names, but since you have only one independent variable in each model it's probably OK.
This is documented in ?fit. The call for each model will look like glm([ones(length(x1)) x1], target, dist). The column full of ones is for the intercept.
Related
I am trying to generate n dataframes using a loop, where each dataframe has one column with i rows populated with random numbers where i=1:n. So far, none of the following iterating methods work for iterating over dataframe names in order to generate them:
n = 5;
for i = 1:n
"df_$i" = DataFrame(rand($i), :auto)
end
or
n = 5;
for i = 1:n
"df_$(i)" = DataFrame(rand($i), :auto)
end
Thanks!
Is this what you want?
julia> [DataFrame("col$i" => rand(i)) for i in 1:3]
3-element Vector{DataFrame}:
1×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.368821
2×1 DataFrame
Row │ col2
│ Float64
─────┼──────────
1 │ 0.757023
2 │ 0.201711
3×1 DataFrame
Row │ col3
│ Float64
─────┼──────────
1 │ 0.702651
2 │ 0.256179
3 │ 0.560374
(I additionally showed you how to dynamically generate the name of the column in each data frame)
Is there a way to access current_row_index in the following snippet ?
#with df begin
fn.(:col, current_row_index)
end
In this context, since you are broacasting just pass first axes of df:
julia> using DataFramesMeta
julia> fn(x, y) = (x, y)
fn (generic function with 1 method)
julia> df = DataFrame(col=["a", "b", "c"])
3×1 DataFrame
Row │ col
│ String
─────┼────────
1 │ a
2 │ b
3 │ c
julia> #with df begin
fn.(:col, axes(df, 1))
end
3-element Vector{Tuple{String, Int64}}:
("a", 1)
("b", 2)
("c", 3)
Here on stackoverflow.com - when I provide sample data to make a reproducible example, how can I do it the Julian way?
In R for example dput(df) will output a string with which you can create df again. Hence, you just post the result here on stackoverflow and bam! - reproducible example. So, how should one do it in Julia?
I think the easiest thing to do generally is to simply construct an MWE DataFrame with random numbers etc in your example, so there's no need to read/write out.
In situations where that's inconvenient, you might consider writing out to an IO buffer and taking the string representation of that, which people can then read back in the same way in reverse:
julia> using CSV, DataFrames
julia> df = DataFrame(a = rand(5), b = rand(1:10, 5));
julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> string_representation = String(take!(CSV.write(io, df)))
"a,b\n0.5613453808585873,9\n0.3308122459718885,6\n0.631520224612919,9\n0.3533712075535982,3\n0.35289980394398723,9\n"
julia> CSV.read(IOBuffer(string_representation))
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64 │ Int64 │
├─────┼──────────┼───────┤
│ 1 │ 0.561345 │ 9 │
│ 2 │ 0.330812 │ 6 │
│ 3 │ 0.63152 │ 9 │
│ 4 │ 0.353371 │ 3 │
│ 5 │ 0.3529 │ 9 │
Here is one way to mimic the behaviour of R's dput in Julia:
julia> using DataFrames
julia> using Random; Random.seed!(0);
julia> df = DataFrame(a = rand(3), b = rand(1:10, 3))
3×2 DataFrame
Row │ a b
│ Float64 Int64
─────┼──────────────────
1 │ 0.405699 1
2 │ 0.0685458 7
3 │ 0.862141 2
julia> julian_dput(x) = invoke(show, Tuple{typeof(stdout), Any}, stdout, df);
julia> julian_dput(df)
DataFrame(AbstractVector[[0.4056994708920292, 0.06854582438651502, 0.8621408571954849], [1, 7, 2]], DataFrames.Index(Dict(:a => 1, :b => 2), [:a, :b]))
That is, julian_dput() takes a DataFrame as input and returns a string that can generate the input.
Source: https://discourse.julialang.org/t/given-an-object-return-julia-code-that-defines-the-object/80579/12
I would like to do some residual analysis for a GLM.
My model is in the form
using GLM
model = glm(#formula(y ~ x), data, Binomial(), LogitLink())
My text book suggests that residual analysis in the GLM be performed using deviance residuals. I was glad to see that Julia's GLM has a devresid() function, and that it suggests how to use it for plotting (sign(y - μ) * sqrt(devresid(D, y, μ))). However, I'm at a total loss as to what the input arguments are supposed to be. Looking at the doc-string:
?devresid
devresid(D, y, μ::Real)
Return the squared deviance residual of μ from y for distribution D
The deviance of a GLM can be evaluated as the sum of the squared deviance residuals. This is the principal use for these values. The actual deviance residual, say for plotting, is the signed square root of this value
sign(y - μ) * sqrt(devresid(D, y, μ))
Examples
julia> devresid(Normal(), 0, 0.25) ≈ abs2(0.25)
true
julia> devresid(Bernoulli(), 1, 0.75) ≈ -2*log(0.75)
true
julia> devresid(Bernoulli(), 0, 0.25) ≈ -2*log1p(-0.25)
true
D: I'm guessing that in my case it is Binomial()
y: I'm guessing this is the indicator variable for a single case, i.e. 1 or 0
μ: What is this?
How can I use this function to produce things like plots of the deviance residual on a normal probability scale and versus fitted values?
Here's the data I'm using in CSV form
x,y
400,0
220,1
490,0
210,1
500,0
270,0
200,1
470,0
480,0
310,1
240,1
490,0
420,0
330,1
280,1
210,1
300,1
470,1
230,0
430,0
460,0
220,1
250,1
200,1
390,0
I understand this is what you want:
julia> data = DataFrame(X=[1,1,1,2,2], Y=[1,1,0,0,1])
5×2 DataFrame
│ Row │ X │ Y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 1 │ 1 │
│ 3 │ 1 │ 0 │
│ 4 │ 2 │ 0 │
│ 5 │ 2 │ 1 │
julia> model = glm(#formula(Y ~ X), data, Binomial(), LogitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Y ~ 1 + X
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) 1.38629 2.82752 0.490286 0.6239 -4.15554 6.92813
X -0.693146 1.87049 -0.37057 0.7110 -4.35923 2.97294
─────────────────────────────────────────────────────────────────────────────
julia> p = predict(model)
5-element Array{Float64,1}:
0.6666664218508201
0.6666664218508201
0.6666664218508201
0.5
0.5
julia> y = data.Y
5-element Array{Int64,1}:
1
1
0
0
1
julia> #. sign(y - p) * sqrt(devresid(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747
(this is what you would get from calling residuals(model, type="deviance") in R)
Note that in the last line I use #. to vectorize the whole line. Alternatively you could have written it as:
julia> sign.(y .- p) .* sqrt.(devresid.(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747
#noinline f1(x::Int) = x + 1
#noinline f2(x::Int) = x + 2
#Base.pure function f(x::Int, p::Int)
if p == 1
return f1(x)
else
return f2(x)
end
end
I would like a call such as f(1, 2) to be compiled as f2(1) directly without branching due to 2 being a constant.
#code_warntype f(1, 2)
Body::Int64
│╻ ==5 1 ─ %1 = (p === 1)::Bool
│ └── goto #3 if not %1
│ 6 2 ─ %3 = invoke Main.f1(_2::Int64)::Int64
│ └── return %3
│ 8 3 ─ %5 = invoke Main.f2(_2::Int64)::Int64
│ └── return %5
#code_native f(1, 2)
.text
; Function f {
; Location: In[1]:5
; Function ==; {
; Location: In[1]:5
pushq %rax
cmpq $1, %rsi
;}
jne L21
; Location: In[1]:6
movabsq $julia_f1_35810, %rax
callq *%rax
popq %rcx
retq
; Location: In[1]:8
L21:
movabsq $julia_f2_35811, %rax
callq *%rax
popq %rcx
retq
nopw %cs:(%rax,%rax)
;}
However by the look of the code it generates, constant propagation doesn't happen. Is it possible the constant propagation happens in real life but the monitoring tool such as #code_native or #code_warntype are unable to tell because they don't treat 2 as a constant.
Constant propagation will happen if you call f in a compiled part of code with a constant argument (e.g. called from a function).
So in your case you have:
julia> #noinline f1(x::Int) = x + 1
f1 (generic function with 1 method)
julia> #noinline f2(x::Int) = x + 2
f2 (generic function with 1 method)
julia> function f(x::Int, p::Int)
if p == 1
return f1(x)
else
return f2(x)
end
end
f (generic function with 1 method)
julia> #code_warntype f(1,2)
Body::Int64
2 1 ─ %1 = (p === 1)::Bool │╻ ==
└── goto #3 if not %1 │
3 2 ─ %3 = invoke Main.f1(_2::Int64)::Int64 │
└── return %3 │
5 3 ─ %5 = invoke Main.f2(_2::Int64)::Int64 │
└── return %5 │
julia> g() = f(1,2)
g (generic function with 1 method)
julia> #code_warntype g()
Body::Int64
1 1 ─ return 3
julia> h(x) = f(x,2)
h (generic function with 1 method)
julia> #code_warntype h(10)
Body::Int64
1 1 ─ %1 = invoke Main.f2(_2::Int64)::Int64 │╻ f
└── return %1
As a side note, AFAIK #pure macro should not be used with functions that call generic functions, as is the case of f.
EDIT: I have found an interesting corner case here:
julia> f(x,p) = (p==1 ? sin : cos)(x)
f (generic function with 1 method)
julia> #code_warntype f(10, 2)
Body::Any
1 1 ─ %1 = (p === 1)::Bool │╻ ==
└── goto #3 if not %1 │
2 ─ %3 = Main.sin::Core.Compiler.Const(sin, false) │
└── goto #4 │
3 ─ %5 = Main.cos::Core.Compiler.Const(cos, false) │
4 ┄ %6 = φ (#2 => %3, #3 => %5)::Union{typeof(cos), typeof(sin)} │
│ %7 = (%6)(x)::Any │
└── return %7 │
julia> g() = f(10, 2)
g (generic function with 1 method)
julia> #code_warntype g()
Body::Float64
1 1 ─ %1 = invoke Base.Math.cos(10.0::Float64)::Float64 │╻╷ f
└── return %1 │
julia> h(x) = f(x, 2)
h (generic function with 1 method)
julia> #code_warntype h(10)
Body::Any
1 1 ─ %1 = invoke Main.f(_2::Int64, 2::Int64)::Any │
└── return %1
julia> z() = h(10)
z (generic function with 1 method)
julia> #code_warntype z()
Body::Float64
1 1 ─ %1 = invoke Base.Math.cos(10.0::Float64)::Float64 │╻╷╷ h
└── return %1
The thing that is interesting is that for g constant propagation happens as above, but not for h, but then if you wrap h in a function it happens again.
So in general the conclusion probably is that in standard cases in compiled code you can expect constant propagation to happen, but in complex cases the compiler may not be smart enough (of course this can improve in the future).