I'm running following regression using felm() function in R and receiving following warning messages:
malmquist.felm <- felm(malmquist~ IT + logTA + npl | firm + year | 0 | firm + year , data=Data)
Warning message:
In newols(mm, nostats = nostats[1], exactDOF = exactDOF, onlyse = onlyse, :
Negative eigenvalues set to zero in multiway clustered variance matrix. See felm(...,psdef=FALSE)
> summary(malmquist.felm)
Call:
felm(formula = malmquist ~ IT + logTA + npl | firm + year | 0 | firm + year, data = Data)
Residuals:
Min 1Q Median 3Q Max
-0.212863 -0.043140 -0.002299 0.040482 0.302387
Coefficients:
Estimate Cluster s.e. t value Pr(>|t|)
IT -9.373e-03 8.749e-03 -1.071 0.315
logTA 8.828e-02 2.955e-03 29.871 1.71e-09 ***
npl 2.435e-03 3.410e-03 0.714 0.496
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08268 on 317 degrees of freedom
(98 observations deleted due to missingness)
Multiple R-squared(full model): 0.2651 Adjusted R-squared: 0.1352
Multiple R-squared(proj model): 0.08175 Adjusted R-squared: -0.08046
F-statistic(full model, *iid*):2.042 on 56 and 317 DF, p-value: 6.886e-05
F-statistic(proj model): 0.7007 on 6 and 8 DF, p-value: 0.6582
Warning message:
In chol.default(mat, pivot = TRUE, tol = tol) :
the matrix is either rank-deficient or indefinite
Does anyone know how to solve this problem or can I just ignore the warning messages?
Related
I am conducting model selection on my dataset using the package MASS and the function stepAIC. This is the current code I am using:
mod <- lm(Distance~DiffAge + DiffR + DiffSize + DiffRep + DiffSeason +
Diff.Bkp + Diff.Fzp + Diff.AO + Diff.Aow +
Diff.Lag.NAOw + Diff.Lag.NAO + Diff.Lag.AO + Diff.Lag.Aow, data=data,
na.action="na.exclude")
library(MASS)
step.model<-stepAIC(mod, direction = "both",
trace = FALSE)
summary(step.model)
this gives me the following output:
Call:
lm(formula = Distance ~ Diff.Lag.NAOw + Diff.Lag.AO + DiffSeason,
data = data, na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-146.984 -48.397 -9.533 42.169 194.950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.944 20.247 3.850 0.000184 ***
Diff.Lag.NAOw 11.868 6.261 1.896 0.060209 .
Diff.Lag.AO 24.696 17.475 1.413 0.159947
DiffSeasonEW-LW 41.891 18.607 2.251 0.026014 *
DiffSeasonLW-LW 22.863 20.791 1.100 0.273465
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 67.2 on 132 degrees of freedom
Multiple R-squared: 0.06031, Adjusted R-squared: 0.03183
F-statistic: 2.118 on 4 and 132 DF, p-value: 0.08209
If I am reading this right, the output only shows me the top model (Let me know if this is incorrect!). I would like to see the other, lower-ranked models as well, with their accompanying AIC scores.
Any suggestions on how I can achieve this? Should I modify my code in any way?
I am doing negative binomial analysis for some count data in the following link:https://www.dropbox.com/s/q7fwqicw3ebvwlg/stackquestion.csv?dl=0
I had some problems (error messages) when I tried to fit all the independent variables into the model, which makes me want to look at each independent variables one by one to find out which variable caused the problem. Here is what I found:
For all the other variables, when I fit the variables to the Y which is column A looks normal:
m2 <- glm.nb(A~K, data=d)
summary(m2)
Call:
glm.nb(formula = A ~ K, data = d, init.theta = 0.5569971932,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5070 -1.2538 -0.4360 0.1796 1.9588
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.66185 0.84980 -0.779 0.436
K 0.25628 0.03016 8.498 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.557) family taken to be 1)
Null deviance: 113.202 on 56 degrees of freedom
Residual deviance: 70.092 on 55 degrees of freedom
AIC: 834.86
Number of Fisher Scoring iterations: 1
Theta: 0.5570
Std. Err.: 0.0923
2 x log-likelihood: -828.8570
However, I found this variable L, when I fit L to the Y, I got this:
m1 <- glm.nb(A~L, data=d)
There were 50 or more warnings (use warnings() to see the first 50)
summary(m1)
Call:
glm.nb(formula = A ~ L, data = d, init.theta = 5136324.722, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-67.19 -18.93 -12.07 13.25 64.00
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.45341 0.01796 192.3 <2e-16 ***
L 0.24254 0.00103 235.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(5136325) family taken to be 1)
Null deviance: 97084 on 56 degrees of freedom
Residual deviance: 28529 on 55 degrees of freedom
AIC: 28941
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument
You can see that the init.theta and AIC is too large, and there are 50 warning and an error message.
The warning message is this
In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > ... :
iteration limit reached
Actually, variables M and L are two observations of one thing. I did not find anything abnormal with variable L. For all the data, only column L has this problem.
So I am wondering what exactly does this error message mean: Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L,: invalid 'nsmall' argument. Since I just observed these data, how should I fix this error? Thank you!
The important message is in the warnings(): when L is the independent variable, the default number of iterations in the GLM convergence procedure is not high enough to converge on a model fit.
If you manually set the maxit parameter to a higher value, you can fit A ~ L without error:
glm.nb(A ~ L, data = d, control = glm.control(maxit = 500))
See the glm.control documentation for more. Note that you can also set a reasonable value for init.theta - and this will prevent both theta and AIC from fitting to unreasonable values:
m1 <- glm.nb(A ~ L, data = df, control = glm.control(maxit = 500), init.theta = 1.0)
Output:
Call:
glm.nb(formula = A ~ L, data = df, control = glm.control(maxit = 500),
init.theta = 0.8016681349, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3020 -0.9347 -0.3578 0.1435 2.5420
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.25962 0.40094 3.142 0.00168 **
L 0.38823 0.02994 12.967 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.8017) family taken to be 1)
Null deviance: 160.693 on 56 degrees of freedom
Residual deviance: 67.976 on 55 degrees of freedom
AIC: 809.41
Number of Fisher Scoring iterations: 1
Theta: 0.802
Std. Err.: 0.140
2 x log-likelihood: -803.405
data("hprice2")
reg1 <- lm(price ~ rooms + crime + nox, hprice2)
summary(reg1)
Call:
lm(formula = price ~ rooms + crime + nox, data = hprice2)
Residuals:
Min 1Q Median 3Q Max
-18311 -3218 -772 2418 39164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19371.47 3250.94 -5.959 4.79e-09 ***
rooms 7933.18 407.87 19.450 < 2e-16 ***
crime -199.70 35.05 -5.697 2.08e-08 ***
nox -1306.06 266.14 -4.907 1.25e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6103 on 502 degrees of freedom
Multiple R-squared: 0.5634, Adjusted R-squared: 0.5608
F-statistic: 215.9 on 3 and 502 DF, p-value: < 2.2e-16
Question 1.
Run two alternative (two-sided) t-tests for: H0: B1 = 8000
predict(reg1, data.frame(rooms=8000, crime = -199.70, nox = -1306.06), interval = .99)
Report your t-statistic and whether you reject or fail to reject the null at 90, 95, and/or 99 percent confidence levels.
I suppose by beta1 you mean rooms in this case. Your t.test in the summary is tested against beta0 = 0, you can see from wiki:
so using the example of nox:
tstat = (-1306.06 - 0)/266.14
[2] -4.907417
And p.value is
2*pt(-abs(tstat),502)
[2] 1.251945e-06
the null hypothesis in your case will be 8000 and you test rooms = 8000:
tstat = (7933.18 - 8000)/407.87
2*pt(-abs(tstat),502)
You can also use linearHypothesis from cars to do the above:
library(car)
linearHypothesis(reg1, c("rooms = 8000"))
I'm a newbie in R and I have this fitted model:
> mqo_reg_g <- lm(G ~ factor(year), data = data)
> summary(mqo_reg_g)
Call:
lm(formula = G ~ factor(year), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.11134 -0.06793 -0.04239 0.01324 0.85213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.111339 0.005253 21.197 < 2e-16 ***
factor(year)2002 -0.015388 0.007428 -2.071 0.038418 *
factor(year)2006 -0.016980 0.007428 -2.286 0.022343 *
factor(year)2010 -0.024432 0.007496 -3.259 0.001131 **
factor(year)2014 -0.025750 0.007436 -3.463 0.000543 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.119 on 2540 degrees of freedom
Multiple R-squared: 0.005952, Adjusted R-squared: 0.004387
F-statistic: 3.802 on 4 and 2540 DF, p-value: 0.004361
I want to test the difference between the coefficients of factor(year)2002 and Intercept; factor(year)2006 and factor(year)2002; and so on.
In STATA I know people use the function "test" that performs a Wald tests about the parameters of the fitted model. But I could find how to do in R.
How can I do it?
Thanks!
Using the plm package in R to fit a fixed-effects model, what is the correct syntax to add a lagged variable to the model? Similar to the 'L1.variable' command in Stata.
Here is my attempt adding a lagged variable (this is a test model and it might not make sense):
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- plm.data(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp+lag(wks_work,1)
, model = 'within'
, data = nlswork)
summary(ffe)
R output:
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = nlswork,
model = "within")
Unbalanced Panel: n=3911, T=1-14, N=19619
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.77000 -0.10100 0.00293 0.11000 2.90000
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.02341057 0.00073832 31.7078 < 2.2e-16 ***
lag(wks_work) 0.00081576 0.00010628 7.6755 1.744e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 1296.9
Residual Sum of Squares: 1126.9
R-Squared: 0.13105
Adj. R-Squared: -0.085379
F-statistic: 1184.39 on 2 and 15706 DF, p-value: < 2.22e-16
However, I got different results compared what Stata produces.
In my actual model, I would like to instrument an endogenous variable with its lagged value.
Thanks!
For reference, here is the Stata code:
webuse nlswork.dta
xtset idcode year
xtreg ln_wage ttl_exp L1.wks_work, fe
Stata output:
Fixed-effects (within) regression Number of obs = 10,680
Group variable: idcode Number of groups = 3,671
R-sq: Obs per group:
within = 0.1492 min = 1
between = 0.2063 avg = 2.9
overall = 0.1483 max = 8
F(2,7007) = 614.60
corr(u_i, Xb) = 0.1329 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ttl_exp | .0192578 .0012233 15.74 0.000 .0168597 .0216558
|
wks_work |
L1. | .0015891 .0001957 8.12 0.000 .0012054 .0019728
|
_cons | 1.502879 .0075431 199.24 0.000 1.488092 1.517666
-------------+----------------------------------------------------------------
sigma_u | .40678942
sigma_e | .28124886
rho | .67658275 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3670, 7007) = 4.71 Prob > F = 0.0000
lag() as it is in plm lags the observations row-wise without "looking" at the time variable, i.e. it shifts the variable (per individual). If there are gaps in the time dimension, you probably want to take the value of the time variable into account. There is the (as of now) unexported function plm:::lagt.pseries which takes the time variable into account and hence handles gaps in data as you might expect.
Edit: Since plm version 1.7-0, default behaviour of lag in plm is to shift time-wise but one can control behaviour by argument shift(shift = c("time", "row")) to shift either time-wise or row-wise (old behaviour).
Use it as follows:
library(plm)
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- pdata.frame(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,1)
, model = 'within'
, data = pnlswork)
summary(ffe)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,
1), data = nlswork, model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
plm:::lagt.pseries(wks_work, 1) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Btw1: Better use pdata.frame() instead of plm.data().
Btw2: You can check for gaps in your data with plm's is.pconsecutive():
is.pconsecutive(pnlswork)
all(is.pconsecutive(pnlswork))
You can also make the data consecutive first and then use lag(), like this:
pnlswork2 <- make.pconsecutive(pnlswork)
pnlswork2$wks_work_lag <- lag(pnlswork2$wks_work)
ffe2 <- plm(ln_wage ~ ttl_exp + wks_work_lag
, model = 'within'
, data = pnlswork2)
summary(ffe2)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + wks_work_lag, data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
wks_work_lag 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Or simply:
ffe3 <- plm(ln_wage ~ ttl_exp + lag(wks_work)
, model = 'within'
, data = pnlswork2) # note: it is the consecutive panel data set here
summary(ffe3)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
lag(wks_work) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16