Using broom::tidy on felm result with clustered standard errors - r

I'm trying to extract point estimates and confidence intervals from a panel data model. The following reproduces the error using the canned example from the lfe documentation. The only small change I've made is to cluster standard errors at the firm-level to replicate my issue in est2.
## create covariates
x <- rnorm(1000)
x2 <- rnorm(length(x))
## individual and firm
id <- factor(sample(20,length(x),replace=TRUE))
firm <- factor(sample(13,length(x),replace=TRUE))
## effects for them
id.eff <- rnorm(nlevels(id))
firm.eff <- rnorm(nlevels(firm))
## left hand side
u <- rnorm(length(x))
y <- x + 0.5*x2 + id.eff[id] + firm.eff[firm] + u
## estimate and print result
est1 <- felm(y ~ x+x2| id + firm)
summary(est1)
## estimate and print result with clustered std errors
est2 <- felm(y ~ x+x2| id + firm | 0 | firm)
summary(est2)
I can tidy in the non-clustered SE version or without including the fixed effects:
tidy(est1)
tidy(est2)
tidy(est1, fe = TRUE)
But I can't if I ask for the fixed effects:
tidy(est2, fe = TRUE)
The error is this: Error in overscope_eval_next(overscope, expr) : object 'se' not found
I'm not sure if this is a broom side problem or an lfe side problem. It is possible I'm doing something wrong, but there should be point estimates and standard errors for the fixed effects whether or not I cluster the SEs. (And the fact that there are fewer clusters than FEs is probably an econometric issue, but it doesn't seem to be driving this particular problem.) Any suggestions?

The problem here is that lfe::getfe() is supposed to return the columns c('effect','se','obs','comp','fe','idx') according to its help page. However, if you run
lfe::getfe(est1, se = TRUE) and
lfe::getfe(est2, se = TRUE)
in the second instance, the standard errors are in a column named clusterse instead of se.
The error message is a result of the function broom:::tidy.felm using lfe::getfe() and then dplyr::select(se).
I guess technically it's an lfe problem but I'm not sure which package will be easier to amend
Update: I emailed Simen Gaure (the package author) and he'll be releasing to CRAN some time this spring

Related

Why does R and PROCESS render different result of a mediation model (one is significant, the other one is not)?

As a newcomer who just gets started in R, I am confused about the result of the mediation analysis.
My model is simple: IV 'T1Incivi', Mediator 'T1Envied', DV 'T2PSRB'. I ran the same model in SPSS using PROCESS, but the result was insignificant in PROCESS; however, the indirect effect is significant in R. Since I am not that familiar with R, could you please help me to see if there is anything wrong with my code? And tell me why the result is significant in R but not in SPSS?Thanks a bunch!!!
My code in R:
X predict M
apath <- lm(T1Envied~T1Incivi, data=dat)
summary(apath)
X and M predict Y
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
summary(bpath)
Bootstrapping for indirect effect
getindirect <- function(dataset,random){
d=dataset[random,]
apath <- lm(T1Envied~T1Incivi, data=d)
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
indirect <- apath$coefficients["T1Incivi"]*bpath$coefficients["T1Envied"]
return(indirect)
}
library(boot)
set.seed(6452234)
Ind1 <- boot(data=dat,
statistic=getindirect,
R=5000)
boot.ci(Ind1,
conf = .95,
type = "norm")`*PSRB as outcome*
In your function getindirect all linear regressions should be based on the freshly shuffled data in d.
However there is the line
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
that makes the wrong reference to the variable dat which should really not be used within this function. That alone can explain incoherent results.

Allowing for aliased coefficients when running `grangertest()` in R

I'm currently trying to run a granger causality analysis in R/R Studio. I am receiving errors about aliased coefficients when using the function grangertest(). From my understanding, this occurs because there is perfect multicolinearity between the variables.
Due to having a very large number of pairwise comparisons (e.g. 200+), I would like to simply run the granger with the aliased coefficients as per normal rather than returning an error. According to one answer here, the solution is or was to add set singular.ok=TRUE, but either I am doing it incorrectly the answer s out of date. I've tried checking the documentation, but have come up empty. Any help would be appreciated.
library(lmtest)
x <- c(0,1,2,3)
y <- c(0,3,6,9)
grangertest(x,y,1) # I want this to run successfully even if there are aliased coefficients.
grangertest(x,y,1, singular.ok=TRUE) # this also doesn't work
"Error in waldtest.lm(fm, 2, ...) :
there are aliased coefficients in the model"
Additionally is there a way to flag x and y are actually aliased variables? There seem to be some answers like here but I'm having issues getting it working properly.
alias((x~ y))
Thanks in advance.
After some investigation and emailing the creator of the grangertest package, they sent me this solution. The solution should run on aliased variables when granger test does not. When the variables are not aliased, the solution should give the same values as the normal granger test.
library(lmtest)
library(dynlm)
# Some data that is multicolinear
x <- c(0,1,2,3,4)
y <- c(0,3,6,9,12)
# Some data that is not multicolinear
# x <- c(0,125,200,230,777)
# y <- c(0,3,6,9,200)
# Convert to time series (this is an important step)
x=ts(x)
y=ts(y)
# This will run even when the data is multicolinear (but also when it is not)
# and is functionally the same as running the granger test (which by default uses the waldtest
m1 = dynlm(x ~ L(x, 1:1) + L(y, 1:1))
m2 = dynlm(x ~ L(x, 1:1))
result <-anova(m1, m2, test="F")
# This will fail if the data is multicolinear or aliased but should give the same results as the anova otherwise (F value and P value etc)
#grangertest(y,x,1)

why do statsmodels and R disagree on AIC computation

I have googled this and could not find a solution.
It seems R has an issue with AIC/BIC calculation. It produces incorrect results. A simple example is shown below:
link = 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
df = read.csv(link, row.names = 'model')
form = 'mpg ~ disp + hp + wt + qsec + gear'
my_model = lm(form, data = df)
summary(my_model)
cat('AIC:',AIC(my_model),'\tBIC:',AIC(my_model, k = log(nrow(df))))
AIC: 157.4512 BIC: 167.7113
Doing Exactly the same thing in python, I obtain:
import pandas as pd
from statsmodels.formula.api import ols as lm
link = 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
df = pd.read_csv(link, index_col='model')
form = 'mpg ~ disp + hp + wt + qsec + gear'
my_model = lm(form, df).fit()
my_model.summary()
print(f'AIC: {my_model.aic:.4f}\tBIC: {my_model.bic:.4f}')
AIC: 155.4512 BIC: 164.2456
You could check the summary(my_model) in R and my_model.summary() in python and you will notice that the two models are EXACTLY the same in everything, apart from the AIC and BIC.
I decided to compute it manually in R:
p = length(coef(my_model)) # number of predictors INCLUDING the Intercept ie 6
s = sqrt(sum(resid(my_model)^2)/nrow(df)) #sqrt(sigma(my_model)^2 * (nrow(df) - p)/nrow(df))
logl = -2* sum(dnorm(df$mpg, fitted(my_model),s, log = TRUE))
c(aic = logl + 2*p, bic = logl + log(nrow(df))*p)
aic bic
155.4512 164.2456
Which matches the results produced by python.
Digging deeper, I noticed that the AIC does use the logLik function. And that is where the problem arises: logLik(my_model) gives exactly the same results as shown in the logl above before multiplying by -2 but the df is given as 7 instead of 6.
If I bruteforce the rank in order to make it 6, I get the correct results ie:
my_model$rank = my_model$rank - 1
cat('AIC:',AIC(my_model),'\tBIC:',AIC(my_model, k = log(nrow(df))))
AIC: 155.4512 BIC: 164.2456
Why does R add 1 to the number of predictors? You can access the logLik function used in base R by typing stats:::logLik.lm on your Rstudio and pressing enter. The two lines below kind of seems to have an issue:
function (object, REML = FALSE, ...)
{
...
p <- object$rank
...
attr(val, "df") <- p + 1 # This line here. Why does R ADD 1?
...
}
This is clearly a deliberate choice: R counts the scale parameter in the set of estimated parameters. From ?logLik.lm:
For ‘"lm"’ fits it is assumed that the scale has been estimated
(by maximum likelihood or REML)
(see also here, pointed out by #MrFlick in the comments). This kind of ambiguity (and, whether normalization constants are included in the log-likelihoods: in R, they are) always has to be checked before comparing results across platforms, and sometimes even across procedures or functions within the same platform.
For what it's worth there also seems to be lots of discussion of this from the statsmodels side, e.g. this (closed) issue about why AIC/BIC are inconsistent between R and statsmodels ...
This commit in March 2002 shows Martin Maechler changing the "df" (degrees of freedom/number of model parameters) attribute back to object$rank+1 with the following additional annotations:
The help page ?logLik.lm gains:
Note that error variance \eqn{\sigma^2} is estimated in \code{lm()} and hence
counted as well.
(this message was obviously edited at some later point to the version seen above).
The NEWS file gains (under "BUG FIXES"):
o logLik.lm() now uses "df = p + 1" again (`+ sigma'!).
It was hard for me to do the archaeology back further than this (i.e. presumably based on the messages here the p+1 reckoning was originally used, then someone changed it to p instead, and MM changed it back in 2002), because functions moved around (this file was created in 2001, so earlier versions will be harder to find). I didn't find any discussion of this in the r-devel mailing list archive for Feb or Mar 2002 ...

How to cluster standard errors with small sample corrections in R

I have the following code:
library(lmtest)
library(sandwich)
library(plm)
library(multiwayvcov)
reg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2 + d1 + d2 + d3 + d4, df)
coeftest(reg, vcov = vcovHC(reg, type="HC1")
coeftest(reg, vcov = vcovHC(reg, type="sss", cluster="study"))
I want to compare the regression when I use typical heteroskedasticity-robust standard errors and when I cluster the standard errors at the study level with a small sample correction. The regression and first -coeftest- work, but the second spits out a clear error message:
Error in match.arg(type) : 'arg' should be one of “HC3”, “const”, “HC”, “HC0”, “HC1”, “HC2”, “HC4”, “HC4m”, “HC5”
I found the code online where they use -type="sss"- as a small sample correction but it doesn't seem to work here. Is there something I am doing wrong or is one of the above listed in the error message the heteroskedastic-adjusted covariance matrix and the code was maybe updated? Clearly I cannot use -type="sss"-, but I don't know how to incorporate the small sample correction otherwise.
Using -...vcovHC(df, type="sss", cluster="study")- is a dated way to incorporate small sample corrections. Upon understanding differences between the sandwich estimators HC0-HC4, using the code previous to it:
coeftest(reg, vcov = vcovHC(reg, type="HC1")
is appropriate with the corresponding sandwich estimator in the type argument. The issue was with the dated syntax that followed and this is the correct format.

Partial residual plots for linear model including an interaction term

My model includes one response variable, five predictors and one interaction term for predictor_1 and predictor_2. I would like to plot partial residual plots for every predictor variable which I would normally realize using the crPlots function from the package car. Unfortunately the function complains that it doesn't work with models that include interaction terms.
Is there another way of doing what I want?
EDIT: I created a small example illustrating the problem
require(car)
R <- c(0.53,0.60,0.64,0.52,0.75,0.66,0.71,0.49,0.52,0.59)
P1 <- c(3.1,1.8,1.8,1.8,1.8,3.2,3.2,2.8,3.1,3.3)
P2 <- c(2.1,0.8,0.3,0.5,0.4,1.3,0.5,1.2,1.6,2.1)
lm.fit1 <- lm(R ~ P1 + P2)
summary(lm.fit1)
crPlots(lm.fit1) # works fine
lm.fit2 <- lm(R ~ P1*P2)
summary(lm.fit2)
crPlots(lm.fit2) # not available
Another way to do this is to put the interaction term in as a separate variable (which avoids hacking the code for crPlot(...)).
df <- data.frame(R,P1,P2,P1.P2=P1*P2)
lm.fit1 <- lm(R ~ ., df)
summary(lm.fit1)
crPlots(lm.fit1)
Note that summary(lm.fit1) yeilds exactly the same result as summary(lm(R~P1*P2,df)).
I must admit i'm not that familiar with partial residual plots so i'm not entirely sure what the proper interpretation of them should be given an interaction term. But basically, the equivalent of
crPlot(lm.fit1, "P1")
is
x <- predict(lm.fit1, type="term", term="P1")
y <- residuals(lm.fit1, type="partial")[,"P1"]
plot(x, y)
abline(lm(y~x), col="red", lty=2)
loessLine(x,y,col="green3",log.x = FALSE, log.y = FALSE, smoother.args=list())
so really, there's no real reason the same idea couldn't work with an interaction term as well. We just leave the partial contribution from a variable due to the interaction as a separate entity and just focus on the non-interaction contribution. So what i'm going to do is just take out the check for the interaction term and then we can use the function. Assuming that
body(car:::crPlot.lm)[[11]]
# if (any(attr(terms(model), "order") > 1)) {
# stop("C+R plots not available for models with interactions.")
# }
we can copy and modify to create a new function with out the check
crPlot2 <- car:::crPlot.lm
body(crPlot2) <- body(crPlot2)[-11]
environment(crPlot2) <- asNamespace("car")
And then we can run
layout(matrix(1:2, ncol=2))
crPlot2(lm.fit2, "P1")
crPlot2(lm.fit2, "P2")
to get
I'm sure the authors had a good reason for not incorporating models with interaction terms so use this hack at your own risk. It's just unclear to me what should happen to the residual from the interaction term when making the plot.

Resources