Combining grep() family of functions with a conditional if statment - r

I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?

According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)

Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"

Related

Julia: Iterating over columns in a dataframe and calculate LinearRegression

I have a data frame with 10 columns(features) COL0 to COL9 and a column RESP. How do I calculate a LinearRegression Model for each pair COL0 to COL9 ~ RESP?
I am expecting to get 10 graphs showing the Model and also a table with the coefficients of my model for each column.
What I tried do far:
model2 = fit(LinearModel, #formula(RESP ~EXPL_0 + EXPL_1 + EXPL_2 +
EXPL_3 + EXPL_4 + EXPL_5 + EXPL_6 + EXPL_7 + EXPL_8 + EXPL_9), df)
And I get what i want.
I still need to know how to plot all this graphs
and if I had COL0 to COL1000, how do I can avoid to type all the columns from 0 to 1000?
I am new to Julia and I really dont have a clue how to get this done. Any help?
Thanks
As Bogumil says, it's not ideal to ask many questions in one post on StackOverflow - your question should be well defined and targeted, ideally with a minimum working example to make it easiest for people to help you.
Let's therefore answer what from the title of the post I take as your main question: how can I fit a linear regression model with GLM which includes many response columns. That question is almost a duplicate of this question, so a very similar answer applies: broadcast the term function over the names in your DataFrame which you want to include on the right hand side like this:
julia> using DataFrames, GLM
julia> df = hcat(DataFrame(RESP = rand(100)), DataFrame(rand(100, 10), :auto));
julia> mymodel = lm(term(:RESP) ~ sum(term.(names(df[!, Not(:RESP)]))), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
RESP ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
(...)
You also ask about plotting, but that's probably best dealt with in a separate question. You can access the estimated coefficients of your linear model using the coef function, e.g.
julia> coef(mymodel)
11-element Vector{Float64}:
0.504236533528822
0.11712812154185266
-0.0206810430546413
-0.15693089456050294
-0.011916514466331067
-0.1030171434361648
0.10378957999147352
-0.09447743618381275
-0.08860977078650123
0.0816071818033377
0.09939548661830626
and the full output with coeftable.
Finally note that you won't necessarily see from that which columns "impact most" your model as you say in the comment, unless you have standardized your regressors.

Best way to tell if a formula contains a random effect?

I have a list of formulas that I would like to fit in a loop using a function. Some of these formulas are random effects models and others are straightforward linear models. I want the function to detect whether the model contains a random effect and if so, use lmer() to fit the model. Otherwise, it should use lm(). Any suggestions on how to check this condition (other than converting the formula to a string and checking for parentheses)? At this stage, they have the same class so I can't just check that. I could also use error handling to catch when lmer() returns an error from a model without a random effect and reroute towards regular lm(), but this also seems unnecessarily messy.
Example below:
fit_models <- function(formula_list) {
models <- list()
for(ii in seq_along(formula_list)) {
if(formula_list[[ii]] is lmer) { # Enter condition here
print("lmer")
} else {
print("lm")
}
}
}
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
fit_models(formulas)
I would say
length(lme4::findbars(f))>0
should reliably detect formulas containing a random-effects component (in the lme4 sense).
From the right hand side of a formula for a mixed-effects model,
determine the pairs of expressions that are separated by the
vertical bar operator.
This is (implicitly) the test that's done in the lme4 code, here ...
The symbols in formulas don't have inherent meanings. A function can reinterpret the symbols to mean whatever they like. So just because there is a "|", that doesn't mean necessarily that that's a formula that has a random effect. That's just how lmer chose to interpret that symbol.
Given that formulas are basically just ordered collections of unevaluated symbols, there's not much more you can do than a basic equality check for a symbol operating on just the formula itself. Rather than a strait up character conversion, you could use all.names. So something like
f2 <- formula(y ~ 1 + x + (1 + x | z))
all.names(f2)
# [1] "~" "y" "+" "+" "x" "(" "|" "+" "x" "z"
"|" %in% all.names(f2)
# [1] TRUE
This won't be fooled if you have something like formula(`a|b` ~ x) where a|b is a (terrible) column name.
You can just convert the formula to a character and look for the pipe operator |:
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
sapply(formulas, function(x) any(grepl("\\|", as.character(x))))
#> [1] FALSE TRUE

Trying to fit a response surface with R's formula notation

I have a set of a couple of dozen numeric variables and am trying to figure out how to compactly express a quadratic form in those variables. I also want to include the variables themselves. The idea here is that we are fitting a response surface, rather than interacting a group of treatments, as the standard R formula notation seems to assume. I am trying to get appropriate expressions turned into an R formula, suitable for estimation by different techniques, with different data sets, or over different periods.
If there is an explicit statement of how R's formula notation works, anywhere, I have not been able to find it. There is an ancient paper from which R supposedly copied the notation, but it is by no means identical to current R usage. Every other description I have found just gives examples, that do not cover every case -- not even close to every case.
So, just as an example, here I try to construct a quadratic form in three variables, without writing out all the pairs by hand with an I() around each pair.
library(tidyverse)
A <- B <- C <- 1:10
LHS <- 1:10 * 600
tb <- tibble(LHS, A, B, C)
my_eq <- as.formula(LHS ~ I(A + B + C)*I(A + B + C))
I have not found any way to tell if I have succeded
Neither
my_form_eq nor
terms(my_form_eq)
seem at all enlightening.
For example, can one predict whether
identical(as.formula(LHS ~ I(A + B + C)*I(A + B + C)), as.formula(LHS ~ I((A + B + C)*(A + B + C)))
is true or false? I can not even guess. Or to take an even simpler case, is ~ A * I(A) equal to A, I(A^2), or something else? And how would you know?
To restate my question, I would like either a full statement of how R's formula notation works, adequate to cover every case and predict what each would mean, or, failing that, a straightforward way of producing an expansion of any existing formula into all the atomic terms for which coefficients will be estimated.
This may not answer your question, but I'll post this anyway since I think it may help a little.
The I function inhibits the interpretation of operators such as "+", so your formula is probably not going to do what you expect it to do. For example, the results of lm(my_eq) will be the same as the results of doing the following:
D <- A + B + C
lm(LHS ~ D * D)
And then you may as well just do lm(LHS ~ D).
For your question, I believe John Maindonald wrote a good book that explains R formulas for many situations. But it's in my office and today is a Sunday.
Edit: For the expansion, I believe you have to fit the model and then look at the call or the terms:
> my_eq <- as.formula(LHS ~ (A + B + C) * (A + B + C))
> my_formula <- lm(my_eq)
> attr(terms(my_formula), "term.labels")
[1] "A" "B" "C" "A:B" "A:C" "B:C"

LMEM: Chi-square = 0 , prob = 1 - what's wrong with my code?

I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!
This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.

glm switches coefficient names for interactions

I use the R code:
dat<-data.frame(p1=c(0,1,1,0,0), GAMMA.1=c(1,2,3,4,3), VAR1=c(2,2,1,3,4), GAMMA.2=c(1,1,3,4,1))
form <- p1 ~ GAMMA.1:VAR1 + GAMMA.2:VAR1
mod <- glm(formula=form, data=dat, family=binomial)
(coef <- coefficients(mod))
# (Intercept) GAMMA.1:VAR1 VAR1:GAMMA.2
# 1.7974974 -0.2563667 -0.2181079
As we can see the names of coef for the interaction GAMMA.2:VAR1 is not in the same order as in form (we have VAR1:GAMMA.2 instead). For several reasons, I need the output
# (Intercept) GAMMA.1:VAR1 GAMMA.2:VAR1
# 1.7974974 -0.2563667 -0.2181079
without changing the names of the coefficients afterwards. Specifically, I want the same names for the coefficients as I used in the form object (without switching as in the code above). Can I tell glm() not to switch the names of the interactions?
The answer is no, not without a lot of rewriting functions. The order of the label of interaction terms is determined by the terms.formula function, which itself is determined by the termsform function buried deep in the C code. There are no parameters that you can pass termsform that give you the behaviour that you want (although keep.order looked promising, it does not do what you want).
You would have to rewrite the terms.formula function to "swap back" the names after output from termsform, and then override the terms.formula function with your patched version, but are you sure you want that? It will be far easier to change the names of the coefficients afterwards.
You could also use terms.formula preemptively, and determine how your formula would be reordered, using and then create a mapping vector.
dat<-data.frame(p1=c(0,1,1,0,0), GAMMA.1=c(1,2,3,4,3), VAR1=c(2,2,1,3,4), GAMMA.2=c(1,1,3,4,1))
form <- p1 ~ GAMMA.1:VAR1 + GAMMA.2:VAR1
new.names<-labels(terms(form,data=dat,keep.order=TRUE))
names(new.names)<-as.character(form[[3]][-1])
new.names
# GAMMA.1:VAR1 GAMMA.2:VAR1
# "GAMMA.1:VAR1" "VAR1:GAMMA.2"
You could use that vector to map names if you had the need later on.
I have two possible workarounds for you.
One workaround comes from the observation that terms within interaction labels are ordered according to their order of appearance in the formula. In your example, the order is GAMMA.1, VAR1, GAMMA.2. You may be able to rewrite your formula with a different ordering so that the formula and coefficient names match:
form <- p1 ~ VAR1:GAMMA.1 + VAR1:GAMMA.2
mod <- glm(formula=form, data=dat, family=binomial)
coefficients(mod)
# (Intercept) VAR1:GAMMA.1 VAR1:GAMMA.2
# 1.7974974 -0.2563667 -0.2181079
Another workaround is to rename the coefficients according to the formula when you pull them out:
rhs_terms <- c("(Intercept)",as.character(form[[3]][2:length(form[[3]])]))
(coef <- setNames(coefficients(mod), rhs_terms))
# (Intercept) GAMMA.1:VAR1 GAMMA.2:VAR1
# 1.7974974 -0.2563667 -0.2181079

Resources