What is the meaning of "~ -1 + ". in R? - r

I am trying to understand a R script and I came upon this line:
train <- cbind(train[,c(1,2)],model.matrix(~ -1 + .,train[,-c(1,2)]))
train is a data.frame. I think it is trying to combine the first two columns of train with all the other columns after they have been through some sort of matrix manipulation. However, I cannot understand exactly what the model formula(?) seems to be doing. From the comment in the script it's purpose is to turn all the other columns in to 0's and 1's, but I'm not sure how. If someone could clarify that would be great. Thanks!

From ?formula:
The - operator removes the specified terms... [i]t can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin.
Further:
There are two special interpretations of . in a formula. The usual one is in the context of a data argument of model fitting functions and means ‘all columns not otherwise in the formula’
So, you have a formula specifying the response is a function of all variables in train[,-c(1,2)], with an intercept at the origin.

Related

The Validation Set Approach

I am studying the book ISLR on my own
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
23.26601
If I don't use the attach()
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
97.06483
Why the result change substantially?
Also, I don't know the syntax of this code mean((mpg-predict(lm.fit,data=Auto))[-train]^2);
what does [] represent?
Also, why mpg-predict,we usually use ~ for formula?
I tried to use ?mean but it didn't show the answers.
Many thanks in advance for the help.
The second argument to predict.lm is not "data", it is newdata. So the first set of instruction matched the Auto dataframe to the newdata argument. If you run the second set of instructions with newdata as the parameter, you get the same result:
mean((mpg-predict(lm.fit,newdata=Auto))[-train]^2)
[1] 23.26601
When you execute mpg-predict(lm.fit,newdata=Auto)) you are getting the residuals from the model. You are asking for the difference of the mpg variable value and the prediction for the that variable. (It's just a minus sign between expression mpg - predict(...).
The next part of the code [-train] is removing the training set from the consideration. This is often called the "out-of-bag" residuals when you are doing cross validation.
Instead of attaching, use with
with(Auto, mean((mpg-predict(lm.fit,Auto))[-train]^2))
#[1] 23.26601
The use of - on 'train' in (?Extract - [) is subsetting the elements from the vector by removing those positions created by the sample, before taking the mean of power
with(Auto, (mpg-predict(lm.fit,Auto)))
If we don't use attach or with, the 'mpg' object is not found in the global environment. Therefore, it would result in error
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
Error in mean((mpg - predict(lm.fit, data = Auto))[-train]^2) :
object 'mpg' not found
If the OP got different value, then mpg may be from a different data
Regarding the use of ~ in formula, according to ?lm
Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

Specifying multiple fixed effects in felm

I am trying to estimate a following model:
y_{it} = \alpha + \beta x_{it}+\eta_i+\gamma_t+group_i\times \eta_t+\epsilon_{it}
#Clear everything and load the needed libraries:
rm(list=ls())
library(data.table)
#Define nr of individuals:
nr_ind<-1000
#Define time periods
nr_time<-5
#Define groups:
nr_groups<-2
#Create individual indicators:
pers_id<-rep(1:nr_ind,each = nr_time)
time_id<-rep(1:nr_time,nr_ind)
data<-data.table(pers_id=pers_id,time_id=time_id)
#Create time varying regressor:
data<-data[,x:=rnorm(1,0.01),by=c("pers_id","time_id")]
#Create time effect:
data<-data[,mean_x_time:=3*mean(x),by=c("time_id")]
#Create fixed effect:
data<-data[,mean_x_person:=1.5*mean(x),by=c("pers_id")]
#Create group varying time effect:
data_group<-data.table(pers_id=1:nr_ind,group=sample(c("M","F"),nr_ind,replace=TRUE))
data<-merge(data,data_group,by="pers_id",all.x=TRUE)
data<-data[,group_effect:=ifelse(group=="M",mean_x_time+mean_x_time^2+0.03,0)]
#Define the model:
data$y<-0.1+0.3*data$x+data$mean_x_person+data$mean_x_time+data$group_effect+rnorm(dim(data)[1])
data<-data[,time_id:=as.factor(time_id)]
data<-data[,group:=as.factor(group)]
model<-felm(y~x|pers_id+time_id*group,data=data)
When I then type:
getfe(model)
I obtain an error, which is an expected result given that pers_id and group are collinear. and as far as I understand what felm does it creates:
pers_id+time_id+group_id+time_id:group_id
Currently I can do something like this:
interaction_term<-interaction(data$time_id,data$group)
data$interaction_term<-as.character(interaction_term)
data$dummy_1<-ifelse(as.character(data$interaction_term)=="1.M",1,0)
data$dummy_2<-ifelse(as.character(data$interaction_term)=="2.M",1,0)
data$dummy_3<-ifelse(as.character(data$interaction_term)=="3.M",1,0)
data$dummy_4<-ifelse(as.character(data$interaction_term)=="4.M",1,0)
data$dummy_5<-ifelse(as.character(data$interaction_term)=="5.M",1,0)
model<-felm(y~x+dummy_2+dummy_3+dummy_4+dummy_5|pers_id+time_id,data=data)
But this is a little bit clumsy and becomes infeasible when I have a lot of time periods. So my question is, whether it is possible somehow in felm specify felm(y~x|f1:f2) and have only interaction effect, i.e. f1:f2 and not f1+f2+f1:f2
The construction a*b is not supported in the fixed-effect part of formulas in felm. That field is not parsed with the ordinary parser. Mainly due to the fact that expressions like f*x where f is a factor and x is a numeric would create havoc. I.e. it would create fixed effects like x + f + f:x, but x, being a numeric, should normally be treated as an ordinary continuous variable, i.e. be put in the first part of the formula. This is of course possible to do automatically, but is not currently supported by felm. Neither is f*g with two factors. What actually happens then, I don't know.
The parser is quite simplistic, it consists of letting : be an infix function of two variables, and the fixed-effect part of the formula is then eval'ed in the model frame. The function : creates an interaction factor if both its arguments are factors. It works recursively, so things like f:g:h with three factors also works (by chance, rather than by thought). If one of the arguments is a numeric, it creates an lfe-internal structure so that the expression is treated as an interaction between a factor and a continuous covariate.
In short, there is very limited formula-functionality in the fixed-effect part of the formula. Full interaction '*' is not supported. Only things like time_id:group, or time_id:group + time_id + group.
There is currently an error in getfe() which may lead to an obscure error message if an interaction between two factors is specified in the fixed-effect field to felm. (It's an idiotic error; expressions like attr(f,'x') matches partially on 'x', and I haven't specified exact=TRUE, so it matches something it shouldn't, deep inside lfe). This will be fixed in the next version which is due in a week or so.
This mess is due to the fact that the syntax f:x was introduced in the fixed-effect field to support interaction between a factor f and a continuous covariate x. Interaction between two factors was implemented as an afterthought.

Significance of 'I' keyword in lm model in R [duplicate]

This question already has answers here:
What does the capital letter "I" in R linear regression formula mean?
(2 answers)
Closed 8 years ago.
I was creating a linear model for my assignment :
lm(revenue ~ (max_cpc - max_cpc.mean), data = traffic)
But it throws:
Error in model.frame.default(formula = revenue ~ (max_cpc - max_cpc.mean), :
variable lengths differ (found for 'maxcpc.mean')
Then, through trial and error, I slightly modified my code :
lm(revenue ~ I(max_cpc - max_cpc.mean), data = traffic)
and Bingo!!!It worked well.
But now I am trying to figure out the significance of 'I' and how it fixed my problem. Can anyone explain it to me?
I() prevents the formula-interface from interpreting the argument, so it gets passed along instead to the expression-parsing part.
In the formula interface -x means 'remove x from the predictors'. So I can do y~.-x to mean 'fit y against everything but x'.
You don't want it to do that - you actually want to make a variable that is the difference of two variables and regress on that, so you don't want the formula interface to parse that expression.
I() achieves that for you.
Terms with squaring in them (x^2) also need the same treatment. The formula interface does something special with powers, and if you actually want a variable squared you have to I() it.
I() has some other uses in other contexts as well. See ?I

R: multiple linear regression model and prediction model

Starting from a linear model1 = lm(temp~alt+sdist) i need to develop a prediction model, where new data will come in hand and predictions about temp will be made.
I have tried doing something like this:
model2 = predict.lm(model1, newdata=newdataset)
However, I am not sure this is the right way. What I would like to know here is, if this is the right way to go in order to make prediction about temp. Also I am a bit confused when it comes to the newdataset. Which values should be filled in etc.?
I am putting everything from the comments into this answer.
1) You can use predict rather than predict.lm as predict will know your input is of class lm and do the right thing automatically.
2 The newdataset should be a data.frame with the same variables as your original predictors - in this case alt and sdist.
3) If you are bringing in you data using read.table by default it will create a data.frame. This assumes that the new data has columns named alt and sdist Then you can do:
NewDataSet<-read.table(whatever)
NewPredictions<- predict(model1, newdata=NewDatSet)
4) After you have done this if you want to check the predictions - you can do the following
summary(model1)
This will give you the intercept and the coefficients for alt and sdist
NewDataSet[1,]
This should give you the alt and sdist values for the first row, you can change the 1 in the bracket to be any row you want. Then use the information from summary(model1) to calculate what the predicted value should be using any method that you trust.
Finally use
NewPredictions[1]
to get what predict() gave you for the first row (or change the 1 to any other row)
Hopefully that should all work out.

Regression coefficients by group in dataframe R

I have data of various companies' financial information organized by company ticker. I'd like to regress one of the columns' values against the others while keeping the company constant. Is there an easy way to write this out in lm() notation?
I've tried using:
reg <- lmList(lead2.dDA ~ paudit1 + abs.d.GINDEX + logcapx + logmkvalt +
logmkvalt2|pp, data=reg.df)
where pp is a vector of company names, but this returns coefficients as though I regressed all the data at once (and did not separate by company name).
A convenient and apparently little-known syntax for estimating separate regression coefficients by group in lm() involves using the nesting operator, /. In this case it would look like:
reg <- lm(lead2.dDA ~ 0 + pp/(paudit1 + abs.d.GINDEX + logcapx +
logmkvalt + logmkvalt2), data=reg.df)
Make sure that pp is a factor and not a numeric. Also notice that the overall intercept must be suppressed for this to work; in the new formulation, we have a different "intercept" for each group.
A couple comments:
Although the regression coefficients obtained this way will match those given by lmList(), it should be noted that with lm() we estimate only a single residual variance across all the groups, whereas lmList() would estimate separate residual variances for each group.
Like I mentioned in my earlier comment, the lmList() syntax that you gave looks like it should have worked. Since you say it didn't, this leads me to expect that really the problem is something else (although it's hard to tell what without a reproducible example), and so it seems likely that the solution I posted will fail for you as well, for the same unknown reasons. If you want more detailed guidance, please provide more information; help us help you.

Resources