Significance of 'I' keyword in lm model in R [duplicate] - r

This question already has answers here:
What does the capital letter "I" in R linear regression formula mean?
(2 answers)
Closed 8 years ago.
I was creating a linear model for my assignment :
lm(revenue ~ (max_cpc - max_cpc.mean), data = traffic)
But it throws:
Error in model.frame.default(formula = revenue ~ (max_cpc - max_cpc.mean), :
variable lengths differ (found for 'maxcpc.mean')
Then, through trial and error, I slightly modified my code :
lm(revenue ~ I(max_cpc - max_cpc.mean), data = traffic)
and Bingo!!!It worked well.
But now I am trying to figure out the significance of 'I' and how it fixed my problem. Can anyone explain it to me?

I() prevents the formula-interface from interpreting the argument, so it gets passed along instead to the expression-parsing part.
In the formula interface -x means 'remove x from the predictors'. So I can do y~.-x to mean 'fit y against everything but x'.
You don't want it to do that - you actually want to make a variable that is the difference of two variables and regress on that, so you don't want the formula interface to parse that expression.
I() achieves that for you.
Terms with squaring in them (x^2) also need the same treatment. The formula interface does something special with powers, and if you actually want a variable squared you have to I() it.
I() has some other uses in other contexts as well. See ?I

Related

The Validation Set Approach

I am studying the book ISLR on my own
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
23.26601
If I don't use the attach()
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
97.06483
Why the result change substantially?
Also, I don't know the syntax of this code mean((mpg-predict(lm.fit,data=Auto))[-train]^2);
what does [] represent?
Also, why mpg-predict,we usually use ~ for formula?
I tried to use ?mean but it didn't show the answers.
Many thanks in advance for the help.
The second argument to predict.lm is not "data", it is newdata. So the first set of instruction matched the Auto dataframe to the newdata argument. If you run the second set of instructions with newdata as the parameter, you get the same result:
mean((mpg-predict(lm.fit,newdata=Auto))[-train]^2)
[1] 23.26601
When you execute mpg-predict(lm.fit,newdata=Auto)) you are getting the residuals from the model. You are asking for the difference of the mpg variable value and the prediction for the that variable. (It's just a minus sign between expression mpg - predict(...).
The next part of the code [-train] is removing the training set from the consideration. This is often called the "out-of-bag" residuals when you are doing cross validation.
Instead of attaching, use with
with(Auto, mean((mpg-predict(lm.fit,Auto))[-train]^2))
#[1] 23.26601
The use of - on 'train' in (?Extract - [) is subsetting the elements from the vector by removing those positions created by the sample, before taking the mean of power
with(Auto, (mpg-predict(lm.fit,Auto)))
If we don't use attach or with, the 'mpg' object is not found in the global environment. Therefore, it would result in error
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
Error in mean((mpg - predict(lm.fit, data = Auto))[-train]^2) :
object 'mpg' not found
If the OP got different value, then mpg may be from a different data
Regarding the use of ~ in formula, according to ?lm
Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

Specifying multiple fixed effects in felm

I am trying to estimate a following model:
y_{it} = \alpha + \beta x_{it}+\eta_i+\gamma_t+group_i\times \eta_t+\epsilon_{it}
#Clear everything and load the needed libraries:
rm(list=ls())
library(data.table)
#Define nr of individuals:
nr_ind<-1000
#Define time periods
nr_time<-5
#Define groups:
nr_groups<-2
#Create individual indicators:
pers_id<-rep(1:nr_ind,each = nr_time)
time_id<-rep(1:nr_time,nr_ind)
data<-data.table(pers_id=pers_id,time_id=time_id)
#Create time varying regressor:
data<-data[,x:=rnorm(1,0.01),by=c("pers_id","time_id")]
#Create time effect:
data<-data[,mean_x_time:=3*mean(x),by=c("time_id")]
#Create fixed effect:
data<-data[,mean_x_person:=1.5*mean(x),by=c("pers_id")]
#Create group varying time effect:
data_group<-data.table(pers_id=1:nr_ind,group=sample(c("M","F"),nr_ind,replace=TRUE))
data<-merge(data,data_group,by="pers_id",all.x=TRUE)
data<-data[,group_effect:=ifelse(group=="M",mean_x_time+mean_x_time^2+0.03,0)]
#Define the model:
data$y<-0.1+0.3*data$x+data$mean_x_person+data$mean_x_time+data$group_effect+rnorm(dim(data)[1])
data<-data[,time_id:=as.factor(time_id)]
data<-data[,group:=as.factor(group)]
model<-felm(y~x|pers_id+time_id*group,data=data)
When I then type:
getfe(model)
I obtain an error, which is an expected result given that pers_id and group are collinear. and as far as I understand what felm does it creates:
pers_id+time_id+group_id+time_id:group_id
Currently I can do something like this:
interaction_term<-interaction(data$time_id,data$group)
data$interaction_term<-as.character(interaction_term)
data$dummy_1<-ifelse(as.character(data$interaction_term)=="1.M",1,0)
data$dummy_2<-ifelse(as.character(data$interaction_term)=="2.M",1,0)
data$dummy_3<-ifelse(as.character(data$interaction_term)=="3.M",1,0)
data$dummy_4<-ifelse(as.character(data$interaction_term)=="4.M",1,0)
data$dummy_5<-ifelse(as.character(data$interaction_term)=="5.M",1,0)
model<-felm(y~x+dummy_2+dummy_3+dummy_4+dummy_5|pers_id+time_id,data=data)
But this is a little bit clumsy and becomes infeasible when I have a lot of time periods. So my question is, whether it is possible somehow in felm specify felm(y~x|f1:f2) and have only interaction effect, i.e. f1:f2 and not f1+f2+f1:f2
The construction a*b is not supported in the fixed-effect part of formulas in felm. That field is not parsed with the ordinary parser. Mainly due to the fact that expressions like f*x where f is a factor and x is a numeric would create havoc. I.e. it would create fixed effects like x + f + f:x, but x, being a numeric, should normally be treated as an ordinary continuous variable, i.e. be put in the first part of the formula. This is of course possible to do automatically, but is not currently supported by felm. Neither is f*g with two factors. What actually happens then, I don't know.
The parser is quite simplistic, it consists of letting : be an infix function of two variables, and the fixed-effect part of the formula is then eval'ed in the model frame. The function : creates an interaction factor if both its arguments are factors. It works recursively, so things like f:g:h with three factors also works (by chance, rather than by thought). If one of the arguments is a numeric, it creates an lfe-internal structure so that the expression is treated as an interaction between a factor and a continuous covariate.
In short, there is very limited formula-functionality in the fixed-effect part of the formula. Full interaction '*' is not supported. Only things like time_id:group, or time_id:group + time_id + group.
There is currently an error in getfe() which may lead to an obscure error message if an interaction between two factors is specified in the fixed-effect field to felm. (It's an idiotic error; expressions like attr(f,'x') matches partially on 'x', and I haven't specified exact=TRUE, so it matches something it shouldn't, deep inside lfe). This will be fixed in the next version which is due in a week or so.
This mess is due to the fact that the syntax f:x was introduced in the fixed-effect field to support interaction between a factor f and a continuous covariate x. Interaction between two factors was implemented as an afterthought.

What is the meaning of "~ -1 + ". in R?

I am trying to understand a R script and I came upon this line:
train <- cbind(train[,c(1,2)],model.matrix(~ -1 + .,train[,-c(1,2)]))
train is a data.frame. I think it is trying to combine the first two columns of train with all the other columns after they have been through some sort of matrix manipulation. However, I cannot understand exactly what the model formula(?) seems to be doing. From the comment in the script it's purpose is to turn all the other columns in to 0's and 1's, but I'm not sure how. If someone could clarify that would be great. Thanks!
From ?formula:
The - operator removes the specified terms... [i]t can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin.
Further:
There are two special interpretations of . in a formula. The usual one is in the context of a data argument of model fitting functions and means ‘all columns not otherwise in the formula’
So, you have a formula specifying the response is a function of all variables in train[,-c(1,2)], with an intercept at the origin.

R tilde operator: What does ~0+a means?

I have seen how to use ~ operator in formula. For example y~x means: y is distributed as x.
However I am really confused of what does ~0+a means in this code:
require(limma)
a = factor(1:3)
model.matrix(~0+a)
Why just model.matrix(a) does not work? Why the result of model.matrix(~a) is different from model.matrix(~0+a)? And finally what is the meaning of ~ operator here?
~ creates a formula - it separates the righthand and lefthand sides of a formula
From ?`~`
Tilde is used to separate the left- and right-hand sides in model formula
Quoting from the help for formula
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
So regarding specific issue with ~a+0
You creating a model matrix without an intercept. As a is a factor, model.matrix(~a) will return an intercept column which is a1 (You need n-1 indicators to fully specify n classes)
The help files for each function are well written, detailed and easy to find!
why doesn't model.matrix(a) work
model.matrix(a) doesn't work because a is a factor variable, not a formula or terms object
From the help for model.matrix
object an object of an appropriate class. For the default method, a
model formula or a terms object.
R is looking for a particular class of object, by passing a formula ~a you are passing an object that is of class formula. model.matrix(terms(~a)) would also work, (passing the terms object corresponding to the formula ~a
general note
#BenBolker helpfully notes in his comment, This is a modified version of Wilkinson-Rogers notation.
There is a good description in the Introduction to R.
After reading several manuals, I was confused by the meaning of model.matrix(~0+x) ountil recently that I found this excellent book chapter.
In mathematics 0+a is equal to a and writing a term like 0+a is very strange. However we are here dealing with linear models: A simple high-school equation such as y=ax+b that uncovers the relationship between the predictor variable (x) and the observation (y).
So we can think of ~0+x or equally ~x+0 as an equation of the form: y=ax+b. By adding 0 we are forcing b to be zero, that means that we are looking for a line passing the origin (no intercept). If we indicated a model like ~x+1 or just ~x, there fitted equation could possibily contain a non-zero term b. Equally we may restrict b by a formula ~x-1 or ~-1+x that both mean: no intercept (the same way we exclude a row or column in R by negative index). However something like ~x-2 or ~x+3 is meaningless.
Thanking #mnel for the useful comment, finally what's the reason to use ~ and not =? In standard mathematical terminology / symbology y~x denotes that y is equivalent to x, it is somewhat weaker that y=x. When you are fitting a linear model, you aren't really saying y=x, but more that you can model y as a linear function of x (y = ax+b for example)
To answer part of your question, tilde is used to separate the left- and right-hand sides in model formula. See ?"~" for more help.

In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)

I'm trying to get my head around the use of the tilde operator, and associated functions. My 1st question is why does I() need to be used to specify arithmetic operators? For example, these 2 plots generate different results (the former having a straight line, and the latter the expected curve)
x <- c(1:100)
y <- seq(0.1,10,0.1)
plot(y~x^3)
plot(y~I(x^3))
further, both of the following plots also generate the expected result
plot(x^3, y)
plot(I(x^3), y)
My second question is, perhaps the examples I've been using are too simple, but I don't understand where ~ should actually be used.
The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.
The issue here is how formulas and specifically the "+, ":", and "^" operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function poly when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators "+", "*", ":" and "^" have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (~) separates the left hand side from the right hand side. The ^ and : operators are used to construct interactions so x = x^2 = x^3 rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed (x+y)^2 the R interpreter would have produced (for its own good internal use), not a mathematical: x^2 +2xy +y^2 , but rather a symbolic: x + y +x:y where x:y is an interaction term without its main effects. (The ^ gives you both main effects and interactions.)
?formula
The I() function acts to convert the argument to "as.is", i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.
The ~ should be thought of as saying "is distributed as" or "is dependent on" when seen in regression functions. The ~ is an infix function in its own right. You can see that LHS ~ RHS is almost shorthand for formula(LHS, RHS) by typing this at the console:
`~`(LHS,RHS)
#LHS ~ RHS
class( `~`(LHS,RHS) )
#[1] "formula"
identical( `~`(LHS,RHS), as.formula("LHS~RHS") )
#[1] TRUE # cannot use `formula` since it interprets its first argument
In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for family. The mean for the base level will generally be labelled (Intercept). The function context and arguments may also further determine a link function such as log() or logit() from the family value, and it is also possible to have a non-canonical family/link combination.
The "+" symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.
In plot()-ting functions it basically reverses the usual ( x, y ) order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more "mathematical" mode of communicating with R. In the graphics::plot.formula, curve, and 'lattice' and 'ggplot' functions, it governs how multiple factors or numeric vectors are displayed and "facetted".
The overloading of the "+" operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use "+" as an "arrangement" and grouping operator.

Resources