I() function and its use cases [duplicate] - r

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.
What does the "I" do in a model like this?
data(rock)
lm(area~I(peri - mean(peri)), data = rock)
Considering that the following does NOT work:
lm(area ~ (peri - mean(peri)), data = rock)
and that this does work:
rock$peri - mean(rock$peri)
Any key words on how to research this myself would also be very helpful.

I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x = the main effect of x, and
x^2 = the main effect and the second order interaction of x",
not the intended x plus x-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.
To get the usual operator, you need to use I() to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs.)
In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).
Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).

From the docs:
Function I has two main uses.
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
To address this point:
df1 <- data.frame(stringi = I("dog"))
df2 <- data.frame(stringi = "dog")
str(df1)
str(df2)
In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
To address this point:
lm(mpg ~ disp + drat, mtcars)
lm(mpg ~ I(disp + drat), mtcars)
Second line. "Creates a new predictor" that is the literal sum of disp + drat

Related

Best way to tell if a formula contains a random effect?

I have a list of formulas that I would like to fit in a loop using a function. Some of these formulas are random effects models and others are straightforward linear models. I want the function to detect whether the model contains a random effect and if so, use lmer() to fit the model. Otherwise, it should use lm(). Any suggestions on how to check this condition (other than converting the formula to a string and checking for parentheses)? At this stage, they have the same class so I can't just check that. I could also use error handling to catch when lmer() returns an error from a model without a random effect and reroute towards regular lm(), but this also seems unnecessarily messy.
Example below:
fit_models <- function(formula_list) {
models <- list()
for(ii in seq_along(formula_list)) {
if(formula_list[[ii]] is lmer) { # Enter condition here
print("lmer")
} else {
print("lm")
}
}
}
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
fit_models(formulas)
I would say
length(lme4::findbars(f))>0
should reliably detect formulas containing a random-effects component (in the lme4 sense).
From the right hand side of a formula for a mixed-effects model,
determine the pairs of expressions that are separated by the
vertical bar operator.
This is (implicitly) the test that's done in the lme4 code, here ...
The symbols in formulas don't have inherent meanings. A function can reinterpret the symbols to mean whatever they like. So just because there is a "|", that doesn't mean necessarily that that's a formula that has a random effect. That's just how lmer chose to interpret that symbol.
Given that formulas are basically just ordered collections of unevaluated symbols, there's not much more you can do than a basic equality check for a symbol operating on just the formula itself. Rather than a strait up character conversion, you could use all.names. So something like
f2 <- formula(y ~ 1 + x + (1 + x | z))
all.names(f2)
# [1] "~" "y" "+" "+" "x" "(" "|" "+" "x" "z"
"|" %in% all.names(f2)
# [1] TRUE
This won't be fooled if you have something like formula(`a|b` ~ x) where a|b is a (terrible) column name.
You can just convert the formula to a character and look for the pipe operator |:
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
sapply(formulas, function(x) any(grepl("\\|", as.character(x))))
#> [1] FALSE TRUE

Combining grep() family of functions with a conditional if statment

I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?
According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)
Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"

Nonlinear model with many independent variables (fixed effects) in R

I'm trying to fit a nonlinear model with nearly 50 variables (since there are year fixed effects). The problem is I have so many variables that I cannot write the complete formula down like
nl_exp = as.formula(y ~ t1*year.matrix[,1] + t2*year.matrix[,2]
+... +t45*year.matirx[,45] + g*(x^d))
nl_model = gnls(nl_exp, start=list(t=0.5, g=0.01, d=0.1))
where y is the binary response variable, year.matirx is a matrix of 45 columns (indicating 45 different years) and x is the independent variable. The parameters need to be estimated are t1, t2, ..., t45, g, d.
I have good starting values for t1, ..., t45, g, d. But I don't want to write a long formula for this nonlinear regression.
I know that if the model is linear, the expression can be simplified using
l_model = lm(y ~ factor(year) + ...)
I tried factor(year) in gnls function but it does not work.
Besides, I also tried
nl_exp2 = as.formula(y ~ t*year.matrix + g*(x^d))
nl_model2 = gnls(nl_exp2, start=list(t=rep(0.2, 45), g=0.01, d=0.1))
It also returns me error message.
So, is there any easy way to write down the nonlinear formula and the starting values in R?
Since you have not provided any example data, I wrote my own - it is completely meaningless and the model actually doesn't work because it has bad data coverage but it gets the point across:
y <- 1:100
x <- 1:100
year.matrix <- matrix(runif(4500, 1, 10), ncol = 45)
start.values <- c(rep(0.5, 45), 0.01, 0.1) #you could also use setNames here and do this all in one row but that gets really messy
names(start.values) <- c(paste0("t", 1:45), "g", "d")
start.values <- as.list(start.values)
nl_exp2 <- as.formula(paste0("y ~ ", paste(paste0("t", 1:45, "*year.matrix[,", 1:45, "]"), collapse = " + "), " + g*(x^d)"))
gnls(nl_exp2, start=start.values)
This may not be the most efficient way to do it, but since you can pass a string to as.formula it's pretty easy to use paste commands to construct what you are trying to do.

Defining an infix operator for use within a formula

I'm trying to create a more parsimonious version of this solution, which entails specifying the RHS of a formula in the form d1 + d1:d2.
Given that * in the context of a formula is a pithy stand-in for full interaction (i.e. d1 * d2 gives d1 + d2 + d1:d2), my approach has been to try and define an alternative operator, say %+:% using the infix approach I've grown accustomed to in other applications, a la:
"%+:%" <- function(d1,d2) d1 + d2 + d1:d2
However, this predictably fails because I haven't been careful about evaluation; let's introduce an example to illustrate my progress:
set.seed(1029)
v1 <- runif(1000)
v2 <- runif(1000)
y <- .8*(v1 < .3) + .2 * (v2 > .25 & v2 < .8) -
.4 * (v2 > .8) + .1 * (v1 > .3 & v2 > .8)
With this example, hopefully it's clear why simply writing out the two terms might be undesirable:
y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3)
One workaround which is close to my desired output is to define the whole formula as a function:
plus.times <- function(outvar, d1, d2){
as.formula(paste0(quote(outvar), "~", quote(d1),
"+", quote(d1), ":", quote(d2)))
}
This gives the expected coefficients when passed to lm, but with names that are harder to interpret directly (especially in the real data where we take care to give d1 and d2 descriptive names, in contrast to this generic example):
out1 <- lm(y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3))
out2 <- lm(plus.times(y, cut(v2, breaks = c(0, .25, .8, 1)), I(v1 < .3)))
any(out1$coefficients != out2$coefficients)
# [1] FALSE
names(out2$coefficients)
# [1] "(Intercept)" "d1(0.25,0.8]" "d1(0.8,1]" "d1(0,0.25]:d2TRUE"
# [5] "d1(0.25,0.8]:d2TRUE" "d1(0.8,1]:d2TRUE"
So this is less than optimal.
Is there any way to define the adjust the code so that the infix operator I mentioned above works as expected? How about altering the form of plus.times so that the variables are not renamed?
I've been poking around (?formula, ?"~", ?":", getAnywhere(formula.default), this answer, etc.) but haven't seen how exactly R interprets * when it's encountered in a formula so that I can make my desired minor adjustments.
You do not need to define a new operator in this case: in a formula d1/d2 expands to d1 + d1:d2. In other words d1/d2 specifies that d2 is nested within d1. Continuing your example:
out3 <- lm(y ~ cut(v2,breaks=c(0,.25,.8,1))/I(v1 < .3))
all.equal(coef(out1), coef(out3))
# [1] TRUE
Further comments
Factors may be crossed or nested. Two factors are crossed if it possible to observe every combination of levels of the two factors, e.g. sex and treatment, temperature and pH, etc. A factor is nested within another if each level of that factor can only be observed within one of the levels of the other factor, e.g. town and country, staff member and store etc.
These relationships are reflected in the parametrization of the model. For crossed factors we use d1*d2 or d1 + d2 + d1:d2, to give the main effect of each factor, plus the interaction. For nested factors we use d1/d2 or d1 + d1:d2 to give a separate submodel of the form 1 + d2 for each level of d1.
The idea of nesting is not restricted to factors, for example we may use sex/x to fit a separate linear regression on x for males and females.
In a formula, %in% is equivalent to :, but it may be used to emphasize the nested, or hierarchical structure of the data/model. For example, a + b %in% a is the same as a + a:b, but reading it as "a plus b within a" gives a better description of the model being fitted. Even so, using / has the advantage of simplifying the model formula at the same time as emphasizing the structure.

What does the capital letter "I" in R linear regression formula mean?

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.
What does the "I" do in a model like this?
data(rock)
lm(area~I(peri - mean(peri)), data = rock)
Considering that the following does NOT work:
lm(area ~ (peri - mean(peri)), data = rock)
and that this does work:
rock$peri - mean(rock$peri)
Any key words on how to research this myself would also be very helpful.
I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x = the main effect of x, and
x^2 = the main effect and the second order interaction of x",
not the intended x plus x-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.
To get the usual operator, you need to use I() to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs.)
In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).
Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).
From the docs:
Function I has two main uses.
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
To address this point:
df1 <- data.frame(stringi = I("dog"))
df2 <- data.frame(stringi = "dog")
str(df1)
str(df2)
In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
To address this point:
lm(mpg ~ disp + drat, mtcars)
lm(mpg ~ I(disp + drat), mtcars)
Second line. "Creates a new predictor" that is the literal sum of disp + drat

Resources