Different R-squared by plm within regression and fixest? - r

I am running a FE regression of firm characteristics on the dependant variable effective tax rates.
I tried both plm package and fixest package. I understand the differences in the standard errors (and I correct them with coeftest for the plm regression, not shown here), however I do not understand the difference in adjusted R-squared between fixest and plm.
Coefficients are the same in both models, so adjusted R-squared should be the same, right?
> fe <- feols(GETR ~ SIZE + LEV + CAPINT + INVINT + ROA + LLEV + CF + EK| id + year,data = panel52,cluster = ~ id+year)
> summary(fe)
OLS estimation, Dep. Var.: GETR
Observations: 19,240
Fixed-effects: id: 1,924, year: 10
Standard-errors: Clustered (id & year)
Estimate Std. Error t value Pr(>|t|)
SIZE 0.031979 0.010624 3.010150 0.0147123 *
LEV -0.021880 0.033039 -0.662243 0.5244090
CAPINT 0.098979 0.027374 3.615754 0.0056088 **
INVINT 0.045080 0.039294 1.147250 0.2808605
ROA 0.222094 0.089892 2.470664 0.0355315 *
LLEV 0.015973 0.025740 0.620558 0.5502796
CF -0.237174 0.098485 -2.408230 0.0393631 *
EK 0.027064 0.063651 0.425196 0.6806793
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.160357 Adj. R2: 0.212174
Within R2: 0.004764
> summary(fe52)
Twoways effects Within Model
Call:
plm(formula = GETR ~ SIZE + LEV + CAPINT + INVINT + ROA + LLEV +
CF + EK, data = panel52, na.action = na.exclude, effect = "twoways",
model = "within")
Balanced Panel: n = 1924, T = 10, N = 19240
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.7032714 -0.0635238 -0.0079128 0.0376269 0.9293129
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
SIZE 0.0319790 0.0065342 4.8941 9.967e-07 ***
LEV -0.0218800 0.0356996 -0.6129 0.5400
CAPINT 0.0989786 0.0222388 4.4507 8.612e-06 ***
INVINT 0.0450804 0.0366761 1.2292 0.2190
ROA 0.2220941 0.0389295 5.7050 1.182e-08 ***
LLEV 0.0159730 0.0180534 0.8848 0.3763
CF -0.2371736 0.0425711 -5.5712 2.567e-08 ***
EK 0.0270641 0.0380943 0.7104 0.4774
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 497.11
Residual Sum of Squares: 494.74
R-Squared: 0.004764
Adj. R-Squared: -0.10685
F-statistic: 10.3509 on 8 and 17299 DF, p-value: 1.4467e-14```

Related

model selection using StepAIC; how can I see other models besides my final model?

I am conducting model selection on my dataset using the package MASS and the function stepAIC. This is the current code I am using:
mod <- lm(Distance~DiffAge + DiffR + DiffSize + DiffRep + DiffSeason +
Diff.Bkp + Diff.Fzp + Diff.AO + Diff.Aow +
Diff.Lag.NAOw + Diff.Lag.NAO + Diff.Lag.AO + Diff.Lag.Aow, data=data,
na.action="na.exclude")
library(MASS)
step.model<-stepAIC(mod, direction = "both",
trace = FALSE)
summary(step.model)
this gives me the following output:
Call:
lm(formula = Distance ~ Diff.Lag.NAOw + Diff.Lag.AO + DiffSeason,
data = data, na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-146.984 -48.397 -9.533 42.169 194.950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.944 20.247 3.850 0.000184 ***
Diff.Lag.NAOw 11.868 6.261 1.896 0.060209 .
Diff.Lag.AO 24.696 17.475 1.413 0.159947
DiffSeasonEW-LW 41.891 18.607 2.251 0.026014 *
DiffSeasonLW-LW 22.863 20.791 1.100 0.273465
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 67.2 on 132 degrees of freedom
Multiple R-squared: 0.06031, Adjusted R-squared: 0.03183
F-statistic: 2.118 on 4 and 132 DF, p-value: 0.08209
If I am reading this right, the output only shows me the top model (Let me know if this is incorrect!). I would like to see the other, lower-ranked models as well, with their accompanying AIC scores.
Any suggestions on how I can achieve this? Should I modify my code in any way?

How to right two lags periods?

I have a question about R; I would like to use two lags periods instead of one (please check my below code) in my model but I don't know how to write it in R. Can someone help please?
Here below are details of my R code:
library(plm)
fixed = plm(sp ~lag(debt)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
The lags must be on the variable debt.
This should give 2 lags on the debt variable.
library(plm)
fixed = plm(sp ~lag(debt, k=1:2)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
For example:
data("Grunfeld", package = "plm")
lags2mod <- plm(inv ~ lag(value, k=1:2) + capital, data = Grunfeld, model = "within")
summary(lags2mod)
Oneway (individual) effect Within Model
Call:
plm(formula = inv ~ lag(value, k = 1:2) + capital, data = Grunfeld,
model = "within")
Balanced Panel: n = 10, T = 18, N = 180
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-272.21434 -19.24168 0.42825 18.09930 260.85548
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
lag(value, k = 1:2)1 0.078234 0.015438 5.0677 1.059e-06 ***
lag(value, k = 1:2)2 -0.018754 0.016078 -1.1664 0.2451
capital 0.352658 0.021003 16.7910 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2034500
Residual Sum of Squares: 617850
R-Squared: 0.69631
Adj. R-Squared: 0.67449
F-statistic: 127.633 on 3 and 167 DF, p-value: < 2.22e-16

How to do a t.test on a linear model for a given value of beta1?

data("hprice2")
reg1 <- lm(price ~ rooms + crime + nox, hprice2)
summary(reg1)
Call:
lm(formula = price ~ rooms + crime + nox, data = hprice2)
Residuals:
Min 1Q Median 3Q Max
-18311 -3218 -772 2418 39164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19371.47 3250.94 -5.959 4.79e-09 ***
rooms 7933.18 407.87 19.450 < 2e-16 ***
crime -199.70 35.05 -5.697 2.08e-08 ***
nox -1306.06 266.14 -4.907 1.25e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6103 on 502 degrees of freedom
Multiple R-squared: 0.5634, Adjusted R-squared: 0.5608
F-statistic: 215.9 on 3 and 502 DF, p-value: < 2.2e-16
Question 1.
Run two alternative (two-sided) t-tests for: H0: B1 = 8000
predict(reg1, data.frame(rooms=8000, crime = -199.70, nox = -1306.06), interval = .99)
Report your t-statistic and whether you reject or fail to reject the null at 90, 95, and/or 99 percent confidence levels.
I suppose by beta1 you mean rooms in this case. Your t.test in the summary is tested against beta0 = 0, you can see from wiki:
so using the example of nox:
tstat = (-1306.06 - 0)/266.14
[2] -4.907417
And p.value is
2*pt(-abs(tstat),502)
[2] 1.251945e-06
the null hypothesis in your case will be 8000 and you test rooms = 8000:
tstat = (7933.18 - 8000)/407.87
2*pt(-abs(tstat),502)
You can also use linearHypothesis from cars to do the above:
library(car)
linearHypothesis(reg1, c("rooms = 8000"))

emmip (emmeans) with quadratic term

Is it possible to plot with emmip the marginal (log odds) means from a geeglm model when you have a quadratic term? I have repeated measures data and the model fits better with a treatment x time squared term in addition to an interaction term with linear time.
I just want to be able to visualise the predicted curve in the data. If it's possible I don't know how to specify it. I've tried:
mod3 <- geeglm(outcome ~ treatment*time + treatment*time_sq, data = dat, id = id, family = "binomial", corstr = "exchangeable"))
mod3a.rg <- ref_grid(mod3, at = list(time = c(1,2,3,4,5,6), time_sq = c(1,4,9,16,25,36)))
emmip(mod3a.rg, treatment ~ time)
I don't think your mod3 is including your quadratic term correctly (hard to tell since you did not include reproducible code). This will let you include your squared term for time correctly:
mod3 <- geeglm(outcome ~ treatment*time + treatment*I(time^2), data =
dat, id = id, family = "binomial", corstr = "exchangeable"))
The add plotit = TRUE to your call to emmip():
emmip(mod3a.rg, treatment ~ time, plotit = TRUE)
Here's a simple reproducible example with the savings dataset in the MASS, faraway package for comparison
library(MASS)
data(savings, package="faraway")
#fit model with polynomial term
mod <- lm(sr ~ ddpi+I(ddpi^2))
summary(mod)
The summary produces this output, note the additonal coefficient for your quadratic term
> Call: lm(formula = sr ~ ddpi + I(ddpi^2), data = savings)
>
> Residuals:
> Min 1Q Median 3Q Max
> -8.5601 -2.5612 0.5546 2.5735 7.8080
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
>(Intercept) 5.13038 1.43472 3.576 0.000821 ***
>ddpi 1.75752 0.53772 3.268 0.002026 **
>I(ddpi^2) -0.09299 0.03612 -2.574 0.013262 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 4.079 on 47 degrees of freedom Multiple
> R-squared: 0.205, Adjusted R-squared: 0.1711 F-statistic: 6.059 on
> 2 and 47 DF, p-value: 0.004559
If you don't enclose the quadratic term with I() your summary will only include the term for ddpi.
mod2 <- lm(sr ~ ddpi+ddpi^2)
summary(mod2)
produces the following summary with a coefficient only for ddpi
> lm(formula = sr ~ ddpi + ddpi^2, data = savings)
>
> Residuals:
> Min 1Q Median 3Q Max
> -8.5535 -3.7349 0.9835 2.7720 9.3104
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
>(Intercept) 7.8830 1.0110 7.797 4.46e-10 ***
>ddpi 0.4758 0.2146 2.217 0.0314 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 4.311 on 48 degrees of freedom Multiple
> R-squared: 0.0929, Adjusted R-squared: 0.074 F-statistic: 4.916 on
> 1 and 48 DF, p-value: 0.03139

R plm lag - what is the equivalent to L1.x in Stata?

Using the plm package in R to fit a fixed-effects model, what is the correct syntax to add a lagged variable to the model? Similar to the 'L1.variable' command in Stata.
Here is my attempt adding a lagged variable (this is a test model and it might not make sense):
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- plm.data(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp+lag(wks_work,1)
, model = 'within'
, data = nlswork)
summary(ffe)
R output:
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = nlswork,
model = "within")
Unbalanced Panel: n=3911, T=1-14, N=19619
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.77000 -0.10100 0.00293 0.11000 2.90000
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.02341057 0.00073832 31.7078 < 2.2e-16 ***
lag(wks_work) 0.00081576 0.00010628 7.6755 1.744e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 1296.9
Residual Sum of Squares: 1126.9
R-Squared: 0.13105
Adj. R-Squared: -0.085379
F-statistic: 1184.39 on 2 and 15706 DF, p-value: < 2.22e-16
However, I got different results compared what Stata produces.
In my actual model, I would like to instrument an endogenous variable with its lagged value.
Thanks!
For reference, here is the Stata code:
webuse nlswork.dta
xtset idcode year
xtreg ln_wage ttl_exp L1.wks_work, fe
Stata output:
Fixed-effects (within) regression Number of obs = 10,680
Group variable: idcode Number of groups = 3,671
R-sq: Obs per group:
within = 0.1492 min = 1
between = 0.2063 avg = 2.9
overall = 0.1483 max = 8
F(2,7007) = 614.60
corr(u_i, Xb) = 0.1329 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ttl_exp | .0192578 .0012233 15.74 0.000 .0168597 .0216558
|
wks_work |
L1. | .0015891 .0001957 8.12 0.000 .0012054 .0019728
|
_cons | 1.502879 .0075431 199.24 0.000 1.488092 1.517666
-------------+----------------------------------------------------------------
sigma_u | .40678942
sigma_e | .28124886
rho | .67658275 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3670, 7007) = 4.71 Prob > F = 0.0000
lag() as it is in plm lags the observations row-wise without "looking" at the time variable, i.e. it shifts the variable (per individual). If there are gaps in the time dimension, you probably want to take the value of the time variable into account. There is the (as of now) unexported function plm:::lagt.pseries which takes the time variable into account and hence handles gaps in data as you might expect.
Edit: Since plm version 1.7-0, default behaviour of lag in plm is to shift time-wise but one can control behaviour by argument shift(shift = c("time", "row")) to shift either time-wise or row-wise (old behaviour).
Use it as follows:
library(plm)
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- pdata.frame(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,1)
, model = 'within'
, data = pnlswork)
summary(ffe)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,
1), data = nlswork, model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
plm:::lagt.pseries(wks_work, 1) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Btw1: Better use pdata.frame() instead of plm.data().
Btw2: You can check for gaps in your data with plm's is.pconsecutive():
is.pconsecutive(pnlswork)
all(is.pconsecutive(pnlswork))
You can also make the data consecutive first and then use lag(), like this:
pnlswork2 <- make.pconsecutive(pnlswork)
pnlswork2$wks_work_lag <- lag(pnlswork2$wks_work)
ffe2 <- plm(ln_wage ~ ttl_exp + wks_work_lag
, model = 'within'
, data = pnlswork2)
summary(ffe2)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + wks_work_lag, data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
wks_work_lag 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Or simply:
ffe3 <- plm(ln_wage ~ ttl_exp + lag(wks_work)
, model = 'within'
, data = pnlswork2) # note: it is the consecutive panel data set here
summary(ffe3)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
lag(wks_work) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16

Resources