Specifying multiple fixed effects in felm - r

I am trying to estimate a following model:
y_{it} = \alpha + \beta x_{it}+\eta_i+\gamma_t+group_i\times \eta_t+\epsilon_{it}
#Clear everything and load the needed libraries:
rm(list=ls())
library(data.table)
#Define nr of individuals:
nr_ind<-1000
#Define time periods
nr_time<-5
#Define groups:
nr_groups<-2
#Create individual indicators:
pers_id<-rep(1:nr_ind,each = nr_time)
time_id<-rep(1:nr_time,nr_ind)
data<-data.table(pers_id=pers_id,time_id=time_id)
#Create time varying regressor:
data<-data[,x:=rnorm(1,0.01),by=c("pers_id","time_id")]
#Create time effect:
data<-data[,mean_x_time:=3*mean(x),by=c("time_id")]
#Create fixed effect:
data<-data[,mean_x_person:=1.5*mean(x),by=c("pers_id")]
#Create group varying time effect:
data_group<-data.table(pers_id=1:nr_ind,group=sample(c("M","F"),nr_ind,replace=TRUE))
data<-merge(data,data_group,by="pers_id",all.x=TRUE)
data<-data[,group_effect:=ifelse(group=="M",mean_x_time+mean_x_time^2+0.03,0)]
#Define the model:
data$y<-0.1+0.3*data$x+data$mean_x_person+data$mean_x_time+data$group_effect+rnorm(dim(data)[1])
data<-data[,time_id:=as.factor(time_id)]
data<-data[,group:=as.factor(group)]
model<-felm(y~x|pers_id+time_id*group,data=data)
When I then type:
getfe(model)
I obtain an error, which is an expected result given that pers_id and group are collinear. and as far as I understand what felm does it creates:
pers_id+time_id+group_id+time_id:group_id
Currently I can do something like this:
interaction_term<-interaction(data$time_id,data$group)
data$interaction_term<-as.character(interaction_term)
data$dummy_1<-ifelse(as.character(data$interaction_term)=="1.M",1,0)
data$dummy_2<-ifelse(as.character(data$interaction_term)=="2.M",1,0)
data$dummy_3<-ifelse(as.character(data$interaction_term)=="3.M",1,0)
data$dummy_4<-ifelse(as.character(data$interaction_term)=="4.M",1,0)
data$dummy_5<-ifelse(as.character(data$interaction_term)=="5.M",1,0)
model<-felm(y~x+dummy_2+dummy_3+dummy_4+dummy_5|pers_id+time_id,data=data)
But this is a little bit clumsy and becomes infeasible when I have a lot of time periods. So my question is, whether it is possible somehow in felm specify felm(y~x|f1:f2) and have only interaction effect, i.e. f1:f2 and not f1+f2+f1:f2

The construction a*b is not supported in the fixed-effect part of formulas in felm. That field is not parsed with the ordinary parser. Mainly due to the fact that expressions like f*x where f is a factor and x is a numeric would create havoc. I.e. it would create fixed effects like x + f + f:x, but x, being a numeric, should normally be treated as an ordinary continuous variable, i.e. be put in the first part of the formula. This is of course possible to do automatically, but is not currently supported by felm. Neither is f*g with two factors. What actually happens then, I don't know.
The parser is quite simplistic, it consists of letting : be an infix function of two variables, and the fixed-effect part of the formula is then eval'ed in the model frame. The function : creates an interaction factor if both its arguments are factors. It works recursively, so things like f:g:h with three factors also works (by chance, rather than by thought). If one of the arguments is a numeric, it creates an lfe-internal structure so that the expression is treated as an interaction between a factor and a continuous covariate.
In short, there is very limited formula-functionality in the fixed-effect part of the formula. Full interaction '*' is not supported. Only things like time_id:group, or time_id:group + time_id + group.
There is currently an error in getfe() which may lead to an obscure error message if an interaction between two factors is specified in the fixed-effect field to felm. (It's an idiotic error; expressions like attr(f,'x') matches partially on 'x', and I haven't specified exact=TRUE, so it matches something it shouldn't, deep inside lfe). This will be fixed in the next version which is due in a week or so.
This mess is due to the fact that the syntax f:x was introduced in the fixed-effect field to support interaction between a factor f and a continuous covariate x. Interaction between two factors was implemented as an afterthought.

Related

The Validation Set Approach

I am studying the book ISLR on my own
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
23.26601
If I don't use the attach()
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
97.06483
Why the result change substantially?
Also, I don't know the syntax of this code mean((mpg-predict(lm.fit,data=Auto))[-train]^2);
what does [] represent?
Also, why mpg-predict,we usually use ~ for formula?
I tried to use ?mean but it didn't show the answers.
Many thanks in advance for the help.
The second argument to predict.lm is not "data", it is newdata. So the first set of instruction matched the Auto dataframe to the newdata argument. If you run the second set of instructions with newdata as the parameter, you get the same result:
mean((mpg-predict(lm.fit,newdata=Auto))[-train]^2)
[1] 23.26601
When you execute mpg-predict(lm.fit,newdata=Auto)) you are getting the residuals from the model. You are asking for the difference of the mpg variable value and the prediction for the that variable. (It's just a minus sign between expression mpg - predict(...).
The next part of the code [-train] is removing the training set from the consideration. This is often called the "out-of-bag" residuals when you are doing cross validation.
Instead of attaching, use with
with(Auto, mean((mpg-predict(lm.fit,Auto))[-train]^2))
#[1] 23.26601
The use of - on 'train' in (?Extract - [) is subsetting the elements from the vector by removing those positions created by the sample, before taking the mean of power
with(Auto, (mpg-predict(lm.fit,Auto)))
If we don't use attach or with, the 'mpg' object is not found in the global environment. Therefore, it would result in error
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
Error in mean((mpg - predict(lm.fit, data = Auto))[-train]^2) :
object 'mpg' not found
If the OP got different value, then mpg may be from a different data
Regarding the use of ~ in formula, according to ?lm
Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

What is the meaning of "~ -1 + ". in R?

I am trying to understand a R script and I came upon this line:
train <- cbind(train[,c(1,2)],model.matrix(~ -1 + .,train[,-c(1,2)]))
train is a data.frame. I think it is trying to combine the first two columns of train with all the other columns after they have been through some sort of matrix manipulation. However, I cannot understand exactly what the model formula(?) seems to be doing. From the comment in the script it's purpose is to turn all the other columns in to 0's and 1's, but I'm not sure how. If someone could clarify that would be great. Thanks!
From ?formula:
The - operator removes the specified terms... [i]t can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin.
Further:
There are two special interpretations of . in a formula. The usual one is in the context of a data argument of model fitting functions and means ‘all columns not otherwise in the formula’
So, you have a formula specifying the response is a function of all variables in train[,-c(1,2)], with an intercept at the origin.

Multiple comparisions using glht with repeated measure anova

I'm using the following code to try to get at post-hoc comparisons for my cell means:
result.lme3<-lme(Response~Pressure*Treatment*Gender*Group, mydata, ~1|Subject/Pressure/Treatment)
aov.result<-aov(result.lme3, mydata)
TukeyHSD(aov.result, "Pressure:Treatment:Gender:Group")
This gives me a result, but most of the adjusted p-values are incredibly small - so I'm not convinced the result is correct.
Alternatively I'm trying this:
summary(glht(result.lme3,linfct=mcp(????="Tukey")
I don't know how to get the Pressure:Treatment:Gender:Group in the glht code.
Help is appreciated - even if it is just a link to a question I didn't find previously.
I have 504 observations, Pressure has 4 levels and is repeated in each subject, Treatment has 2 levels and is repeated in each subject, Group has 3 levels, and Gender is obvious.
Thanks
I solved a similar problem creating a interaction dummy variable using interaction() function which contains all combinations of the leves of your 4 variables.
I made many tests, the estimates shown for the various levels of this variable show the joint effect of the active levels plus the interaction effect.
For example if:
temperature ~ interaction(infection(y/n), acetaminophen(y/n))
(i put the possible leves in the parenthesis for clarity) the interaction var will have a level like "infection.y:acetaminophen.y" which show the effect on temperature of both infection, acetaminophen and the interaction of the two in comparison with the intercept (where both variables are n).
Instead if the model was:
temperature ~ infection(y/n) * acetaminophen(y/n)
to have the same coefficient for the case when both vars are y, you would have had to add the two simple effect plus the interaction effect. The result is the same but i prefer using interaction since is more clean and elegant.
The in glht you use:
summary(glht(model, linfct= mcp(interaction_var = 'Tukey'))
to achieve your post-hoc, where interaction_var <- interaction(infection, acetaminophen).
TO BE NOTED: i never tested this methodology with nested and mixed models so beware!

Explanation of the formula object used in the coxph function in R

I am a complete novice when it comes to survival analysis. I am working on a project that requires I use the coxph function in the "survival" package, but I am running into trouble because I do not understand what is required by the formula object.
Most descriptions I can find about the function are as follows:
"a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the Surv function. "
I know what needs to be on the left of the operator, the issue is what the function expects from the right-hand side.
Here is a link of what my data looks like (The actual data set is much larger, I'm only displaying the first 20 data points for brevity):
Short explanation of data:
-Row 1 is the header
-Each row after that is a separate patient
-The first column is the age of the patient at the time of the study
-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values.
-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.
-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false).
Given this data I need to plot a Cox Proportional hazard curve, but I end up with an incorrect plot because the right hand side of the formula object is wrong.
Here is my code, "temp4" is the name I gave to the data table:
library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))
I should also note that I have tried replacing the right hand side that you're seeing with the number 1, a period, leaving it blank. These methods produce a kaplan-meier curve.
The following is the console output:
Each new line is an example of the error produced depending on how I filter the data. (ie if I only include patients with ages greater than 85, etc.)
If someone could explain how it works, it would be greatly appreciated.
PS- I have searched for over a week to my solution, and I am asking for help here as a last resort.
You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.
seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)
The above would use all of the x-variables in your temp data.frame. This will use just the first 3:
seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)
Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.
Bottom line: This is not a problem with your formula construction, so much as it is a problem of not understanding the limitations of the chosen method with the dataset you have assembled. You need to be more careful about defining your goals. What is the highest priority in this research? Do you really need every variable? Is it possible to aggregate some of these anonymous variables into clinically meaningful categories such as diagnostic categories or comorbities?

Regression coefficients by group in dataframe R

I have data of various companies' financial information organized by company ticker. I'd like to regress one of the columns' values against the others while keeping the company constant. Is there an easy way to write this out in lm() notation?
I've tried using:
reg <- lmList(lead2.dDA ~ paudit1 + abs.d.GINDEX + logcapx + logmkvalt +
logmkvalt2|pp, data=reg.df)
where pp is a vector of company names, but this returns coefficients as though I regressed all the data at once (and did not separate by company name).
A convenient and apparently little-known syntax for estimating separate regression coefficients by group in lm() involves using the nesting operator, /. In this case it would look like:
reg <- lm(lead2.dDA ~ 0 + pp/(paudit1 + abs.d.GINDEX + logcapx +
logmkvalt + logmkvalt2), data=reg.df)
Make sure that pp is a factor and not a numeric. Also notice that the overall intercept must be suppressed for this to work; in the new formulation, we have a different "intercept" for each group.
A couple comments:
Although the regression coefficients obtained this way will match those given by lmList(), it should be noted that with lm() we estimate only a single residual variance across all the groups, whereas lmList() would estimate separate residual variances for each group.
Like I mentioned in my earlier comment, the lmList() syntax that you gave looks like it should have worked. Since you say it didn't, this leads me to expect that really the problem is something else (although it's hard to tell what without a reproducible example), and so it seems likely that the solution I posted will fail for you as well, for the same unknown reasons. If you want more detailed guidance, please provide more information; help us help you.

Resources