Using the plm package in R to fit a fixed-effects model, what is the correct syntax to add a lagged variable to the model? Similar to the 'L1.variable' command in Stata.
Here is my attempt adding a lagged variable (this is a test model and it might not make sense):
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- plm.data(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp+lag(wks_work,1)
, model = 'within'
, data = nlswork)
summary(ffe)
R output:
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = nlswork,
model = "within")
Unbalanced Panel: n=3911, T=1-14, N=19619
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.77000 -0.10100 0.00293 0.11000 2.90000
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.02341057 0.00073832 31.7078 < 2.2e-16 ***
lag(wks_work) 0.00081576 0.00010628 7.6755 1.744e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 1296.9
Residual Sum of Squares: 1126.9
R-Squared: 0.13105
Adj. R-Squared: -0.085379
F-statistic: 1184.39 on 2 and 15706 DF, p-value: < 2.22e-16
However, I got different results compared what Stata produces.
In my actual model, I would like to instrument an endogenous variable with its lagged value.
Thanks!
For reference, here is the Stata code:
webuse nlswork.dta
xtset idcode year
xtreg ln_wage ttl_exp L1.wks_work, fe
Stata output:
Fixed-effects (within) regression Number of obs = 10,680
Group variable: idcode Number of groups = 3,671
R-sq: Obs per group:
within = 0.1492 min = 1
between = 0.2063 avg = 2.9
overall = 0.1483 max = 8
F(2,7007) = 614.60
corr(u_i, Xb) = 0.1329 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ttl_exp | .0192578 .0012233 15.74 0.000 .0168597 .0216558
|
wks_work |
L1. | .0015891 .0001957 8.12 0.000 .0012054 .0019728
|
_cons | 1.502879 .0075431 199.24 0.000 1.488092 1.517666
-------------+----------------------------------------------------------------
sigma_u | .40678942
sigma_e | .28124886
rho | .67658275 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3670, 7007) = 4.71 Prob > F = 0.0000
lag() as it is in plm lags the observations row-wise without "looking" at the time variable, i.e. it shifts the variable (per individual). If there are gaps in the time dimension, you probably want to take the value of the time variable into account. There is the (as of now) unexported function plm:::lagt.pseries which takes the time variable into account and hence handles gaps in data as you might expect.
Edit: Since plm version 1.7-0, default behaviour of lag in plm is to shift time-wise but one can control behaviour by argument shift(shift = c("time", "row")) to shift either time-wise or row-wise (old behaviour).
Use it as follows:
library(plm)
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- pdata.frame(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,1)
, model = 'within'
, data = pnlswork)
summary(ffe)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,
1), data = nlswork, model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
plm:::lagt.pseries(wks_work, 1) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Btw1: Better use pdata.frame() instead of plm.data().
Btw2: You can check for gaps in your data with plm's is.pconsecutive():
is.pconsecutive(pnlswork)
all(is.pconsecutive(pnlswork))
You can also make the data consecutive first and then use lag(), like this:
pnlswork2 <- make.pconsecutive(pnlswork)
pnlswork2$wks_work_lag <- lag(pnlswork2$wks_work)
ffe2 <- plm(ln_wage ~ ttl_exp + wks_work_lag
, model = 'within'
, data = pnlswork2)
summary(ffe2)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + wks_work_lag, data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
wks_work_lag 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Or simply:
ffe3 <- plm(ln_wage ~ ttl_exp + lag(wks_work)
, model = 'within'
, data = pnlswork2) # note: it is the consecutive panel data set here
summary(ffe3)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
lag(wks_work) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Related
for Boston dataset perform polynomial regression with degree 5,4,3 and 2 I want to use loop but get error :
Error in [.data.frame(data, 0, cols, drop = FALSE) :
undefined columns selected
library(caret)
train_control <- trainControl(method = "cv", number=10)
#set.seed(5)
cv <-rep(NA,4)
n=c(5,4,3,2)
for (i in n) {
cv[i]=train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
}
outside the loop train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
works well
Since you are using poly(..., raw = FALSE) that means you are getting orthogonal contrasts. Hence no need of for-loop, use the maximum degree since the coefficients and standard errors will not change for each coefficient.
Check quick example below using lm and iris dataset:
summary(lm(Sepal.Length~poly(Sepal.Width, 2), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 2), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.63153 -0.62177 -0.08282 0.50531 2.33336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06692 87.316 <2e-16 ***
poly(Sepal.Width, 2)1 -1.18838 0.81962 -1.450 0.1492
poly(Sepal.Width, 2)2 -1.41578 0.81962 -1.727 0.0862 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8196 on 147 degrees of freedom
Multiple R-squared: 0.03344, Adjusted R-squared: 0.02029
F-statistic: 2.543 on 2 and 147 DF, p-value: 0.08209
> summary(lm(Sepal.Length~poly(Sepal.Width, 3), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 3), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6876 -0.5001 -0.0876 0.5493 2.4600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06588 88.696 <2e-16 ***
poly(Sepal.Width, 3)1 -1.18838 0.80687 -1.473 0.1430
poly(Sepal.Width, 3)2 -1.41578 0.80687 -1.755 0.0814 .
poly(Sepal.Width, 3)3 1.92349 0.80687 2.384 0.0184 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8069 on 146 degrees of freedom
Multiple R-squared: 0.06965, Adjusted R-squared: 0.05054
F-statistic: 3.644 on 3 and 146 DF, p-value: 0.01425
Take a look at the summary table. Everything is the same. Only the poly(Sepal.Width,3)3 was added when a degree of 3 was used. Meaning if we used a degree of 3, we could easily tell what degree 2 will look like. Hence no need of for loop.
Note that you could use different variables in poly: eg poly(cbind(Sepal.Width, Petal.Length, Petal.Width), 4) and still be able to easily recover poly(Sepal.Width, 2).
I am running a FE regression of firm characteristics on the dependant variable effective tax rates.
I tried both plm package and fixest package. I understand the differences in the standard errors (and I correct them with coeftest for the plm regression, not shown here), however I do not understand the difference in adjusted R-squared between fixest and plm.
Coefficients are the same in both models, so adjusted R-squared should be the same, right?
> fe <- feols(GETR ~ SIZE + LEV + CAPINT + INVINT + ROA + LLEV + CF + EK| id + year,data = panel52,cluster = ~ id+year)
> summary(fe)
OLS estimation, Dep. Var.: GETR
Observations: 19,240
Fixed-effects: id: 1,924, year: 10
Standard-errors: Clustered (id & year)
Estimate Std. Error t value Pr(>|t|)
SIZE 0.031979 0.010624 3.010150 0.0147123 *
LEV -0.021880 0.033039 -0.662243 0.5244090
CAPINT 0.098979 0.027374 3.615754 0.0056088 **
INVINT 0.045080 0.039294 1.147250 0.2808605
ROA 0.222094 0.089892 2.470664 0.0355315 *
LLEV 0.015973 0.025740 0.620558 0.5502796
CF -0.237174 0.098485 -2.408230 0.0393631 *
EK 0.027064 0.063651 0.425196 0.6806793
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.160357 Adj. R2: 0.212174
Within R2: 0.004764
> summary(fe52)
Twoways effects Within Model
Call:
plm(formula = GETR ~ SIZE + LEV + CAPINT + INVINT + ROA + LLEV +
CF + EK, data = panel52, na.action = na.exclude, effect = "twoways",
model = "within")
Balanced Panel: n = 1924, T = 10, N = 19240
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.7032714 -0.0635238 -0.0079128 0.0376269 0.9293129
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
SIZE 0.0319790 0.0065342 4.8941 9.967e-07 ***
LEV -0.0218800 0.0356996 -0.6129 0.5400
CAPINT 0.0989786 0.0222388 4.4507 8.612e-06 ***
INVINT 0.0450804 0.0366761 1.2292 0.2190
ROA 0.2220941 0.0389295 5.7050 1.182e-08 ***
LLEV 0.0159730 0.0180534 0.8848 0.3763
CF -0.2371736 0.0425711 -5.5712 2.567e-08 ***
EK 0.0270641 0.0380943 0.7104 0.4774
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 497.11
Residual Sum of Squares: 494.74
R-Squared: 0.004764
Adj. R-Squared: -0.10685
F-statistic: 10.3509 on 8 and 17299 DF, p-value: 1.4467e-14```
I have a question about R; I would like to use two lags periods instead of one (please check my below code) in my model but I don't know how to write it in R. Can someone help please?
Here below are details of my R code:
library(plm)
fixed = plm(sp ~lag(debt)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
The lags must be on the variable debt.
This should give 2 lags on the debt variable.
library(plm)
fixed = plm(sp ~lag(debt, k=1:2)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
For example:
data("Grunfeld", package = "plm")
lags2mod <- plm(inv ~ lag(value, k=1:2) + capital, data = Grunfeld, model = "within")
summary(lags2mod)
Oneway (individual) effect Within Model
Call:
plm(formula = inv ~ lag(value, k = 1:2) + capital, data = Grunfeld,
model = "within")
Balanced Panel: n = 10, T = 18, N = 180
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-272.21434 -19.24168 0.42825 18.09930 260.85548
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
lag(value, k = 1:2)1 0.078234 0.015438 5.0677 1.059e-06 ***
lag(value, k = 1:2)2 -0.018754 0.016078 -1.1664 0.2451
capital 0.352658 0.021003 16.7910 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2034500
Residual Sum of Squares: 617850
R-Squared: 0.69631
Adj. R-Squared: 0.67449
F-statistic: 127.633 on 3 and 167 DF, p-value: < 2.22e-16
I'm running following regression using felm() function in R and receiving following warning messages:
malmquist.felm <- felm(malmquist~ IT + logTA + npl | firm + year | 0 | firm + year , data=Data)
Warning message:
In newols(mm, nostats = nostats[1], exactDOF = exactDOF, onlyse = onlyse, :
Negative eigenvalues set to zero in multiway clustered variance matrix. See felm(...,psdef=FALSE)
> summary(malmquist.felm)
Call:
felm(formula = malmquist ~ IT + logTA + npl | firm + year | 0 | firm + year, data = Data)
Residuals:
Min 1Q Median 3Q Max
-0.212863 -0.043140 -0.002299 0.040482 0.302387
Coefficients:
Estimate Cluster s.e. t value Pr(>|t|)
IT -9.373e-03 8.749e-03 -1.071 0.315
logTA 8.828e-02 2.955e-03 29.871 1.71e-09 ***
npl 2.435e-03 3.410e-03 0.714 0.496
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08268 on 317 degrees of freedom
(98 observations deleted due to missingness)
Multiple R-squared(full model): 0.2651 Adjusted R-squared: 0.1352
Multiple R-squared(proj model): 0.08175 Adjusted R-squared: -0.08046
F-statistic(full model, *iid*):2.042 on 56 and 317 DF, p-value: 6.886e-05
F-statistic(proj model): 0.7007 on 6 and 8 DF, p-value: 0.6582
Warning message:
In chol.default(mat, pivot = TRUE, tol = tol) :
the matrix is either rank-deficient or indefinite
Does anyone know how to solve this problem or can I just ignore the warning messages?
I'm a newbie in R and I have this fitted model:
> mqo_reg_g <- lm(G ~ factor(year), data = data)
> summary(mqo_reg_g)
Call:
lm(formula = G ~ factor(year), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.11134 -0.06793 -0.04239 0.01324 0.85213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.111339 0.005253 21.197 < 2e-16 ***
factor(year)2002 -0.015388 0.007428 -2.071 0.038418 *
factor(year)2006 -0.016980 0.007428 -2.286 0.022343 *
factor(year)2010 -0.024432 0.007496 -3.259 0.001131 **
factor(year)2014 -0.025750 0.007436 -3.463 0.000543 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.119 on 2540 degrees of freedom
Multiple R-squared: 0.005952, Adjusted R-squared: 0.004387
F-statistic: 3.802 on 4 and 2540 DF, p-value: 0.004361
I want to test the difference between the coefficients of factor(year)2002 and Intercept; factor(year)2006 and factor(year)2002; and so on.
In STATA I know people use the function "test" that performs a Wald tests about the parameters of the fitted model. But I could find how to do in R.
How can I do it?
Thanks!