Best way to tell if a formula contains a random effect? - r

I have a list of formulas that I would like to fit in a loop using a function. Some of these formulas are random effects models and others are straightforward linear models. I want the function to detect whether the model contains a random effect and if so, use lmer() to fit the model. Otherwise, it should use lm(). Any suggestions on how to check this condition (other than converting the formula to a string and checking for parentheses)? At this stage, they have the same class so I can't just check that. I could also use error handling to catch when lmer() returns an error from a model without a random effect and reroute towards regular lm(), but this also seems unnecessarily messy.
Example below:
fit_models <- function(formula_list) {
models <- list()
for(ii in seq_along(formula_list)) {
if(formula_list[[ii]] is lmer) { # Enter condition here
print("lmer")
} else {
print("lm")
}
}
}
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
fit_models(formulas)

I would say
length(lme4::findbars(f))>0
should reliably detect formulas containing a random-effects component (in the lme4 sense).
From the right hand side of a formula for a mixed-effects model,
determine the pairs of expressions that are separated by the
vertical bar operator.
This is (implicitly) the test that's done in the lme4 code, here ...

The symbols in formulas don't have inherent meanings. A function can reinterpret the symbols to mean whatever they like. So just because there is a "|", that doesn't mean necessarily that that's a formula that has a random effect. That's just how lmer chose to interpret that symbol.
Given that formulas are basically just ordered collections of unevaluated symbols, there's not much more you can do than a basic equality check for a symbol operating on just the formula itself. Rather than a strait up character conversion, you could use all.names. So something like
f2 <- formula(y ~ 1 + x + (1 + x | z))
all.names(f2)
# [1] "~" "y" "+" "+" "x" "(" "|" "+" "x" "z"
"|" %in% all.names(f2)
# [1] TRUE
This won't be fooled if you have something like formula(`a|b` ~ x) where a|b is a (terrible) column name.

You can just convert the formula to a character and look for the pipe operator |:
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
sapply(formulas, function(x) any(grepl("\\|", as.character(x))))
#> [1] FALSE TRUE

Related

Combining grep() family of functions with a conditional if statment

I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?
According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)
Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"

I() function and its use cases [duplicate]

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.
What does the "I" do in a model like this?
data(rock)
lm(area~I(peri - mean(peri)), data = rock)
Considering that the following does NOT work:
lm(area ~ (peri - mean(peri)), data = rock)
and that this does work:
rock$peri - mean(rock$peri)
Any key words on how to research this myself would also be very helpful.
I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x = the main effect of x, and
x^2 = the main effect and the second order interaction of x",
not the intended x plus x-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.
To get the usual operator, you need to use I() to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs.)
In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).
Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).
From the docs:
Function I has two main uses.
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
To address this point:
df1 <- data.frame(stringi = I("dog"))
df2 <- data.frame(stringi = "dog")
str(df1)
str(df2)
In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
To address this point:
lm(mpg ~ disp + drat, mtcars)
lm(mpg ~ I(disp + drat), mtcars)
Second line. "Creates a new predictor" that is the literal sum of disp + drat

Concatenation of formulae in R

Function constrain of the R package diversitree takes a list of formulae as input.
formulae <- list(lambda1 ~ lambda0, mu1 ~ mu0, q10 ~ q01)
constrain(lik, formulae=formulae)
I would like to pass these formulae via a decision tree and concatenate them as needed.
f1 <- "lambda1 ~ lambda0"
f2 <- "mu1 ~ mu0"
f3 <- "q10 ~ q01"
How do I arrive at the above-shown list formulae?
Unsuccessful attempt:
formulae <- as.formula(paste(f1,f2,f3, collapse=","))
EDIT 1:
I do not know the precise number of respective formulae a prior, but let them be determined via a decision tree. The precise number of individual formulae (i.e., f1, f2, f3, etc.) that goes into variable formulae should thus not be hard-coded.
You can use:
formulae = list(as.formula(f1),as.formula(f2),as.formula(f3))
If you originally have all string formulae in a vector, say f <- c(f1, f2, f3), you may use
lapply(f, as.formula)

What does the capital letter "I" in R linear regression formula mean?

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.
What does the "I" do in a model like this?
data(rock)
lm(area~I(peri - mean(peri)), data = rock)
Considering that the following does NOT work:
lm(area ~ (peri - mean(peri)), data = rock)
and that this does work:
rock$peri - mean(rock$peri)
Any key words on how to research this myself would also be very helpful.
I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x = the main effect of x, and
x^2 = the main effect and the second order interaction of x",
not the intended x plus x-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.
To get the usual operator, you need to use I() to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs.)
In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).
Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).
From the docs:
Function I has two main uses.
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
To address this point:
df1 <- data.frame(stringi = I("dog"))
df2 <- data.frame(stringi = "dog")
str(df1)
str(df2)
In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
To address this point:
lm(mpg ~ disp + drat, mtcars)
lm(mpg ~ I(disp + drat), mtcars)
Second line. "Creates a new predictor" that is the literal sum of disp + drat

Converting R formula format to mathematical equation

When we fit a statistical model in R, say
lm(y ~ x, data=dat)
We use R's special formula syntax: "y~x"
Is there something that converts from such a formula to the corresponding equation? In this case it could be written as:
y = B0 + B1*x
This would be very useful! For one, because with more complicated formulae I don't trust my translation. Second, in scientific papers written with R/Sweave/knitr, sometimes the model should be reported in equation form and for fully reproducible research, we'd like to do this in automated fashion.
Just had a quick play and got this working:
# define a function to take a linear regression
# (anything that supports coef() and terms() should work)
expr.from.lm <- function (fit) {
# the terms we're interested in
con <- names(coef(fit))
# current expression (built from the inside out)
expr <- quote(epsilon)
# prepend expressions, working from the last symbol backwards
for (i in length(con):1) {
if (con[[i]] == '(Intercept)')
expr <- bquote(beta[.(i-1)] + .(expr))
else
expr <- bquote(beta[.(i-1)] * .(as.symbol(con[[i]])) + .(expr))
}
# add in response
expr <- bquote(.(terms(fit)[[2]]) == .(expr))
# convert to expression (for easy plotting)
as.expression(expr)
}
# generate and fit dummy data
df <- data.frame(iq=rnorm(10), sex=runif(10) < 0.5, weight=rnorm(10), height=rnorm(10))
f <- lm(iq ~ sex + weight + height, df)
# plot with our expression as the title
plot(resid(f), main=expr.from.lm(f))
Seems to have lots of freedom about what variables are called, and whether you actually want the coefficients in there as well—but seems good for a start.

Resources