I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.
Related
I understand from this question here that coefficients are the same whether we use a lm regression with as.factor() and a plm regression with fixed effects.
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
However, the R and R-squared differ significantly. Which one is correct and how does the interpretation changes between the two models? In my case, the R-squared is much larger for the plm specification and is even negative for the lm + factor one.
I am fitting training data with glm() and want to plot the coefficients. however, I had no clue how to give a right plot as follows:
set.seed(1)
trn_index = createDataPartition(y = development$EQUAL_PAY, p = 0.80, list = FALSE)
trn_pay = development[trn_index, ]
tst_pay = development[-trn_index, ]
trn_pay_f <- trn_pay %>%
mutate(EQUAL_PAY = relevel(factor(EQUAL_PAY),ref = "YES"))
pay_lgr = train(EQUAL_PAY ~ .- EQUAL_WORK - COUNTRY, method = "glm", family = binomial(link = "logit"), data = trn_pay_f,trControl = trainControl(method = 'cv', number = 10))
summary(pay_lgr)
##Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315 *
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757 .
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578 .
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289 *
## extract all parameters in a dataframe
pay_lgrFrame <- data.frame(COEFFICIENT = rownames(summary(pay_lgr)$coef),
p_value = summary(pay_lgr)$coef[,4],
z_value = summary(pay_lgr)$coef[,3],
SE = summary(pay_lgr)$coef[,2],
Estimate = summary(pay_lgr)$coef[,1])
## and I was stuck in making a plot as the image I posted the link above.
Pulling in your summary table (you can get this directly as ss <- coef(summary(pay_lgr)), but I don't have your data set):
ss <- read.delim(header=TRUE,check.names=FALSE,text="
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289")
Convert row names to a column called term:
ss2 <- tibble::rownames_to_column(ss,"term")
Draw the barplot:
library(ggplot2)
ggplot(ss2, aes(term,Estimate))+
geom_bar(stat="identity")+
coord_flip()
ggsave("bar.png")
As others have commented, there are probably better (both easier and preferable in terms of visual communication) ways to plot the coefficients. The dotwhisker::dwplot() function does several convenient things:
automatically extracts coefficients and plots them
automatically scales continuous predictors by 2*std dev, to enable comparison between coeficients (use by_2sd=FALSE if you don't want this)
automatically leaves out the intercept, which is on a different scale from the other parameters and is rarely of inferential interest
library(dotwhisker)
dwplot(lm(Murder/Population ~ ., data=as.data.frame(state.x77)))
I am replicating a negative binomial regression model in R. When calculating robust standard errors, the output does not match Stata output of standard errors.
The original Stata code is
nbreg displaced eei lcostofwar cfughh roadskm lpopdensity ltkilled, robust nolog
I have attempted both manual calculation and vcovHC from sandwich. However, neither produces the same results.
My regression model is as follows:
mod1 <- glm.nb(displaced ~ eei + costofwar_log + cfughh + roadskm + popdensity_log + tkilled_log, data = mod1_df)
With vcovHC I have tried every option from HC0 to HC5.
Attempt 1:
cov_m1 <- vcovHC(mod1, type = "HC0", sandwich = T)
se <- sqrt(diag(cov_m1))
Attempt 2:
mod1_rob <- coeftest(mod1, vcovHC = vcov(mod1, type = "HC0"))
The most successful has been HC0 and vcov = sandwich but no SEs are correct.
Any suggestions?
EDIT
My output is as follows (using HC0):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3281183 1.5441312 0.8601 0.389730
eei -0.0435529 0.0183359 -2.3753 0.017536 *
costofwar_log 0.2984376 0.1350518 2.2098 0.027119 *
cfughh -0.0380690 0.0130254 -2.9227 0.003470 **
roadskm 0.0020812 0.0010864 1.9156 0.055421 .
popdensity_log -0.4661079 0.1748682 -2.6655 0.007688 **
tkilled_log 1.0949084 0.2159161 5.0710 3.958e-07 ***
The Stata output I am attempting to replicate is:
Estimate Std. Error
(Intercept) 1.328 1.272
eei -0.044 0.015
costofwar_log 0.298 0.123
cfughh -0.038 0.018
roadskm 0.002 0.0001
popdensity_log -0.466 0.208
tkilled_log 1.095 0.209
The dataset is found here and the recoded variables are:
mod1_df <- table %>%
select(displaced, eei_01, costofwar, cfughh, roadskm, popdensity,
tkilled)
mod1_df$popdensity_log <- log(mod1_df$popdensity + 1)
mod1_df$tkilled_log <- log(mod1_df$tkilled + 1)
mod1_df$costofwar_log <- log(mod1_df$costofwar + 1)
mod1_df$eei <- mod1_df$eei_01*100
Stata uses the observed Hessian for its computations, glm.nb() uses the expected Hessian. Therefore, the default bread() employed by the sandwich() function is different, leading to different results. There are other R packages that employ the observed hessian for its variance-covariance estimate (e.g., gamlss) but these do not supply an estfun() method for the sandwich package.
Hence, below I simply set up a dedicated bread_obs() function that extracts the ML estimates from a negbin object, sets up the negative log-likelihood, computes the observed Hessian numerically via numDeriv::hessian() and computes the "bread" from it (omitting the estimate for log(theta)):
bread_obs <- function(object, method = "BFGS", maxit = 5000, reltol = 1e-12, ...) {
## data and estimated parameters
Y <- model.response(model.frame(object))
X <- model.matrix(object)
par <- c(coef(object), "log(theta)" = log(object$theta))
## dimensions
n <- NROW(X)
k <- length(par)
## nb log-likelihood
nll <- function(par) suppressWarnings(-sum(dnbinom(Y,
mu = as.vector(exp(X %*% head(par, -1))),
size = exp(tail(par, 1)), log = TRUE)))
## covariance based on observed Hessian
rval <- numDeriv::hessian(nll, par)
rval <- solve(rval) * n
rval[-k, -k]
}
With that function I can compare the sandwich() output (based on the expected Hessian) with the output using the bread_obs() (based on the observed Hessian).
s_exp <- sandwich(mod1)
s_obs <- sandwich(mod1, vcov = bread_obs)
cbind("Coef" = coef(mod1), "SE (Exp)" = sqrt(diag(s_exp)), "SE (Obs)" = sqrt(diag(s_obs)))
## Coef SE (Exp) SE (Obs)
## (Intercept) 1.328 1.259 1.259
## eei -0.044 0.017 0.015
## costofwar_log 0.298 0.160 0.121
## cfughh -0.038 0.015 0.018
## roadskm 0.002 0.001 0.001
## popdensity_log -0.466 0.135 0.207
## tkilled_log 1.095 0.179 0.208
This still has slight differences compared to Stata but these are likely numerical differences from the optimization etc.
If you create a new dedicated bread() method for negbin objects
bread.negbin <- bread_obs
then the method dispatch will use this if you do sandwich(mod1).
In R you need to manually provide a degree of freedom correction, so try this which I borrowed from this source:
dfa <- (G/(G - 1)) * (N - 1)/pm1$df.residual
# display with cluster VCE and df-adjustment
firm_c_vcov <- dfa * vcovHC(pm1, type = "HC0", cluster = "group", adjust = T)
coeftest(pm1, vcov = firm_c_vcov)
Here G is the number of Panels in your data set, N is the number of observations and pm1 is your model estimated. Obviously, you could drop the clustering.
I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.
A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772
I am doing ologit with different packages, they are VGAM, rms, MASS and ordinal, using data set wine from package ordinal.
First is vglm():
library(VGAM)
vglmfit <- vglm(rating ~ temp * contact, data = wine,
family=cumulative(parallel=TRUE, reverse=TRUE))
The coefficients are:
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4
1.4112568 -1.1435551 -3.3770742 -4.9419773
tempwarm contactyes tempwarm:contactyes
2.3212033 1.3474598 0.3595241
Second is orm():
library(rms)
ormfit <- orm(rating ~ temp * contact, data = wine)
Coef:
Coef S.E. Wald Z Pr(>|Z|)
y>=2 1.4113 0.5454 2.59 0.0097
y>=3 -1.1436 0.5097 -2.24 0.0248
y>=4 -3.3771 0.6382 -5.29 <0.0001
y>=5 -4.9420 0.7509 -6.58 <0.0001
temp=warm 2.3212 0.7009 3.31 0.0009
contact=yes 1.3475 0.6604 2.04 0.0413
temp=warm * contact=yes 0.3595 0.9238 0.39 0.6971
Third, polr:
library(MASS)
polrfit <- polr(rating ~ temp * contact, method="logistic", data = wine)
coef:
Coefficients:
tempwarm contactyes tempwarm:contactyes
2.3211214 1.3474055 0.3596357
Intercepts:
1|2 2|3 3|4 4|5
-1.411278 1.143507 3.377005 4.941901
Last, clm():
library(ordinal)
clmfit <- clm(rating ~ temp * contact, link="logit", data = wine)
coef:
Coefficients:
tempwarm contactyes tempwarm:contactyes
2.3212 1.3475 0.3595
Threshold coefficients:
1|2 2|3 3|4 4|5
-1.411 1.144 3.377 4.942
Besides, when reverse=FALSE in vglm(),
library(VGAM)
vglmfit <- vglm(rating ~ temp * contact, data = wine,
family=cumulative(parallel=TRUE, reverse=FALSE))
Coefficients:
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4
-1.4112568 1.1435551 3.3770742 4.9419773
tempwarm contactyes tempwarm:contactyes
-2.3212033 -1.3474598 -0.3595241
You may notice that the coefficients in vglm() while reverse=TRUE and those in orm() are the same, and the ones in polr() and clm() are the same. So there are two set of coefficients, the only difference is the sign of intercepts.
And while I set reverse=FALSE, it does reverse the intercepts, but at the same time the parameters of variables, which I don't want.
What's the problem of that? How could I get exactly the same result? or how could I explain it?
This is all just a matter of parametrizations. One classical way to introduce the ordered logistic regression model is to assume that there is a latent continuous response
y* = x'b + e
where e has a standard logistic distribution. Then, it is assumed that not y* itself is observed by only a discretized category y = j if y* falls between cut-offs a_j-1 and a_j. This then leads to the model equation:
logit(P(y <= j)) = a_j - x'b
Other motivations lead to similar equations but with P(y >= j) and/or a_j + x'b. This just leads to switches in the signs of the a and/or b coefficients that you observe in the different implementations. The corresponding models and predictions are equivalent, of course. Which interpretation you find easier is mostly a matter of taste.