Here's my data:
subject arm treat bline change
'subject1' 'L' N 6.3597 4.9281
'subject1' 'R' T 10.3499 1.8915
'subject3' 'L' N 12.4108 -0.9008
'subject3' 'R' T 13.2422 -0.7357
'subject4' 'L' T 8.7383 2.756
'subject4' 'R' N 10.8257 -0.531
'subject5' 'L' N 7.1766 2.0536
'subject5' 'R' T 8.1369 1.9841
'subject6' 'L' T 10.3978 9.0743
'subject6' 'R' N 11.3184 3.381
'subject8' 'L' T 10.7251 2.9658
'subject8' 'R' N 10.9818 2.9908
'subject9' 'L' T 7.3745 2.9143
'subject9' 'R' N 9.4863 -3.0847
'subject10' 'L' T 11.8132 -2.1629
'subject10' 'R' N 9.5287 0.1401
'subject11' 'L' T 8.2977 6.2219
'subject11' 'R' N 9.3691 0.7408
'subject12' 'L' T 12.6003 -0.7645
'subject12' 'R' N 11.7329 0.0342
'subject13' 'L' N 9.4918 2.0716
'subject13' 'R' T 9.6205 1.5705
'subject14' 'L' T 9.3945 4.6176
'subject14' 'R' N 11.0176 1.445
'subject16' 'L' T 8.0221 1.4751
'subject16' 'R' N 9.8307 -2.3697
When I fit a mixed model with treat and arm as factors:
m <- lmer(change ~ bline + treat + arm + (1|subject), data=change1)
ls_means(m, which = NULL, level=0.95, ddf="Kenward-Roger")
The ls_means statement returns no result. Can anyone help with what is going wrong?
I too see empty results:
> ls_means(m, which = NULL, level=0.95, ddf="Kenward-Roger")
Least Squares Means table:
Estimate Std. Error df t value lower upper Pr(>|t|)
Confidence level: 95%
Degrees of freedom method: Kenward-Roger
However, the emmeans package works fine. You can use emmeans() or lsmeans() -- the latter just re-labels the emmeans() results. "Estimated marginal means" is a more generally-appropriate term.
> library(emmeans)
> lsmeans(m, "treat")
treat lsmean SE df lower.CL upper.CL
N 0.996 0.72 15 -0.539 2.53
T 2.290 0.72 15 0.755 3.82
Results are averaged over the levels of: arm
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
> lsmeans(m, "arm")
arm lsmean SE df lower.CL upper.CL
L 1.97 0.737 15.6 0.403 3.53
R 1.32 0.737 15.6 -0.248 2.88
Results are averaged over the levels of: treat
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
I suspect that lmerTest::ls_means() does not support predictors of class "character". If you change treat and arm to factors, it may work.
We're going to need more information. Here's a reproducible example that seems just fine:
set.seed(101)
library(lme4)
library(lmerTest)
dd <- expand.grid(subject=factor(1:40), arm=c("L","R"))
## replicate N/T in random order for each subject
dd$treat <- c(replicate(40,sample(c("N","T"))))
dd$bline <- rnorm(nrow(dd))
dd$change <- simulate(~bline+treat+arm+(1|subject),
newdata=dd,
newparams=list(beta=rep(1,4),
theta=1,
sigma=1))[[1]]
m <- lmer(change ~ bline + treat + arm + (1|subject), data=dd)
ls_means(m, which = NULL, level=0.95, ddf="Kenward-Roger")
## Least Squares Means table:
##
## Estimate Std. Error df t value lower upper Pr(>|t|)
## armL 1.37494 0.22716 55.6 6.0527 0.91981 1.83007 1.275e-07 ***
## armR 2.54956 0.22716 55.6 11.2235 2.09443 3.00469 6.490e-16 ***
My best guess at this point is that you are having some problem with model fitting. lmerTest can sometimes be opaque/swallow warnings or error messages. Did you get any warnings you neglected to tell us about? If you re-run the model with lme4::lmer(...) (i.e. use the basic version from lme4, not the augmented version in lmerTest), do you see any warnings?
Related
I'm using emmeans to perform custom comparisons to a control group. The trt.vs.ctrl approach works perfectly for me if I'm only interested in comparing one factor, but then fails (or I fail) when I set the comparison to be more complicated (i.e., the control group is described by a specific combination of 2+ variables).
Example code below. Say that using the pigs data, I want to compare all diets to the low percent fish diet. Note how in the nd data frame, "fish" only has 9% associated with it. However, when I run emmeans, the function does not pick up on the nesting, and while the control is correct, the treatment groups also include various values of fish and percents. Which means that the p-value adjustment is wrong.
So the two approaches I can think of:
How do I make emmeans pick up on the nesting in this case, or
How do I do the dunnettx adjustment manually (=I can use adjustment "none", then pull out the tests I actually want, and adjust the p-value myself?).
library(emmeans)
library(dplyr)
pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)
nd <- expand.grid(source = levels(pigs$source), percent = unique(pigs$percent)) %>%
filter(percent == 9 | source != "fish")
emmeans(pigs.lm, trt.vs.ctrl ~ source + percent,
data = nd, covnest = TRUE, cov.reduce = FALSE)
Appreciate your help.
The suggestion to use include worked perfectly. Posting my code here in case anyone else has the same issue in the future.
library(emmeans)
library(dplyr)
library(tidyr)
pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)
nd <- expand.grid(source = levels(pigs$source), percent = unique(pigs$percent)) %>%
filter(percent == 9 | source != "fish")
ems <- emmeans(pigs.lm, trt.vs.ctrl ~ source + percent,
data = nd, covnest = TRUE, cov.reduce = FALSE)
# to identify which levels to exclude - in this case,
# I only want the low-percent fish to remain as the ref level
aux <- as.data.frame(ems[[1]]) %>%
mutate(ID = 1:n()) %>%
filter(!grepl("fish", source) | ID == 1)
emmeans(pigs.lm, trt.vs.ctrl ~ source + percent,
data = nd, covnest = TRUE, cov.reduce = FALSE, include = aux$ID)
I'm not totally clear on what you are trying to accomplish, but I don't think filtering the data is the solution.
If your goal is to compare the marginal means for source with the (fish, 9 percent) combination, you can do it by constructing two sets of emmeans, then subsetting and combining:
emm1 = emmeans(pigs.lm, "source")
emm2 = emmeans(pigs.lm, ~source*percent)
emm3 = emm2[1] + emm1 # or rbind(emm2[1], emm1)
Then you get
> confint(emm3, adjust ="none")
source percent emmean SE df lower.CL upper.CL
fish 9 3.22 0.0536 23 3.11 3.33
fish . 3.39 0.0367 23 3.32 3.47
soy . 3.67 0.0374 23 3.59 3.74
skim . 3.80 0.0394 23 3.72 3.88
Results are averaged over some or all of the levels of: percent
Results are given on the log (not the response) scale.
Confidence level used: 0.95
> contrast(emm3, "trt.vs.ctrl1")
contrast estimate SE df t.ratio p.value
fish,. - fish,9 0.174 0.0366 23 4.761 0.0002
soy,. - fish,9 0.447 0.0678 23 6.595 <.0001
skim,. - fish,9 0.576 0.0696 23 8.286 <.0001
Results are averaged over some or all of the levels of: percent
Results are given on the log (not the response) scale.
P value adjustment: dunnettx method for 3 tests
Another (much more tedious, more error-prone) way to do the same thing is to get the EMMs for the factor combinations, and then use custom contrasts:
> contrast(emm2, list(con1 = c(-3,0,0, 1,0,0, 1,0,0, 1,0,0)/4,
+ con2 = c(-4,1,0, 0,1,0, 0,1,0, 0,1,0)/4,
+ con3 = c(-4,0,1, 0,0,1, 0,0,1, 0,0,1)/4),
+ adjust = "mvt")
contrast estimate SE df t.ratio p.value
con1 0.174 0.0366 23 4.761 0.0002
con2 0.447 0.0678 23 6.595 <.0001
con3 0.576 0.0696 23 8.286 <.0001
Results are given on the log (not the response) scale.
P value adjustment: mvt method for 3 tests
(The mvt adjustment is the exact correction for which dunnettx is only an approximation. It doesn't default to mvt because it is computationally heavy for a large number of tests.)
In answer to the last part of the question, you may use exclude (or include) to focus on a subset of the levels; see ? pairwise.emmc.
I am replicating a negative binomial regression model in R. When calculating robust standard errors, the output does not match Stata output of standard errors.
The original Stata code is
nbreg displaced eei lcostofwar cfughh roadskm lpopdensity ltkilled, robust nolog
I have attempted both manual calculation and vcovHC from sandwich. However, neither produces the same results.
My regression model is as follows:
mod1 <- glm.nb(displaced ~ eei + costofwar_log + cfughh + roadskm + popdensity_log + tkilled_log, data = mod1_df)
With vcovHC I have tried every option from HC0 to HC5.
Attempt 1:
cov_m1 <- vcovHC(mod1, type = "HC0", sandwich = T)
se <- sqrt(diag(cov_m1))
Attempt 2:
mod1_rob <- coeftest(mod1, vcovHC = vcov(mod1, type = "HC0"))
The most successful has been HC0 and vcov = sandwich but no SEs are correct.
Any suggestions?
EDIT
My output is as follows (using HC0):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3281183 1.5441312 0.8601 0.389730
eei -0.0435529 0.0183359 -2.3753 0.017536 *
costofwar_log 0.2984376 0.1350518 2.2098 0.027119 *
cfughh -0.0380690 0.0130254 -2.9227 0.003470 **
roadskm 0.0020812 0.0010864 1.9156 0.055421 .
popdensity_log -0.4661079 0.1748682 -2.6655 0.007688 **
tkilled_log 1.0949084 0.2159161 5.0710 3.958e-07 ***
The Stata output I am attempting to replicate is:
Estimate Std. Error
(Intercept) 1.328 1.272
eei -0.044 0.015
costofwar_log 0.298 0.123
cfughh -0.038 0.018
roadskm 0.002 0.0001
popdensity_log -0.466 0.208
tkilled_log 1.095 0.209
The dataset is found here and the recoded variables are:
mod1_df <- table %>%
select(displaced, eei_01, costofwar, cfughh, roadskm, popdensity,
tkilled)
mod1_df$popdensity_log <- log(mod1_df$popdensity + 1)
mod1_df$tkilled_log <- log(mod1_df$tkilled + 1)
mod1_df$costofwar_log <- log(mod1_df$costofwar + 1)
mod1_df$eei <- mod1_df$eei_01*100
Stata uses the observed Hessian for its computations, glm.nb() uses the expected Hessian. Therefore, the default bread() employed by the sandwich() function is different, leading to different results. There are other R packages that employ the observed hessian for its variance-covariance estimate (e.g., gamlss) but these do not supply an estfun() method for the sandwich package.
Hence, below I simply set up a dedicated bread_obs() function that extracts the ML estimates from a negbin object, sets up the negative log-likelihood, computes the observed Hessian numerically via numDeriv::hessian() and computes the "bread" from it (omitting the estimate for log(theta)):
bread_obs <- function(object, method = "BFGS", maxit = 5000, reltol = 1e-12, ...) {
## data and estimated parameters
Y <- model.response(model.frame(object))
X <- model.matrix(object)
par <- c(coef(object), "log(theta)" = log(object$theta))
## dimensions
n <- NROW(X)
k <- length(par)
## nb log-likelihood
nll <- function(par) suppressWarnings(-sum(dnbinom(Y,
mu = as.vector(exp(X %*% head(par, -1))),
size = exp(tail(par, 1)), log = TRUE)))
## covariance based on observed Hessian
rval <- numDeriv::hessian(nll, par)
rval <- solve(rval) * n
rval[-k, -k]
}
With that function I can compare the sandwich() output (based on the expected Hessian) with the output using the bread_obs() (based on the observed Hessian).
s_exp <- sandwich(mod1)
s_obs <- sandwich(mod1, vcov = bread_obs)
cbind("Coef" = coef(mod1), "SE (Exp)" = sqrt(diag(s_exp)), "SE (Obs)" = sqrt(diag(s_obs)))
## Coef SE (Exp) SE (Obs)
## (Intercept) 1.328 1.259 1.259
## eei -0.044 0.017 0.015
## costofwar_log 0.298 0.160 0.121
## cfughh -0.038 0.015 0.018
## roadskm 0.002 0.001 0.001
## popdensity_log -0.466 0.135 0.207
## tkilled_log 1.095 0.179 0.208
This still has slight differences compared to Stata but these are likely numerical differences from the optimization etc.
If you create a new dedicated bread() method for negbin objects
bread.negbin <- bread_obs
then the method dispatch will use this if you do sandwich(mod1).
In R you need to manually provide a degree of freedom correction, so try this which I borrowed from this source:
dfa <- (G/(G - 1)) * (N - 1)/pm1$df.residual
# display with cluster VCE and df-adjustment
firm_c_vcov <- dfa * vcovHC(pm1, type = "HC0", cluster = "group", adjust = T)
coeftest(pm1, vcov = firm_c_vcov)
Here G is the number of Panels in your data set, N is the number of observations and pm1 is your model estimated. Obviously, you could drop the clustering.
I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.
A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772
I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.
I am doing ologit with different packages, they are VGAM, rms, MASS and ordinal, using data set wine from package ordinal.
First is vglm():
library(VGAM)
vglmfit <- vglm(rating ~ temp * contact, data = wine,
family=cumulative(parallel=TRUE, reverse=TRUE))
The coefficients are:
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4
1.4112568 -1.1435551 -3.3770742 -4.9419773
tempwarm contactyes tempwarm:contactyes
2.3212033 1.3474598 0.3595241
Second is orm():
library(rms)
ormfit <- orm(rating ~ temp * contact, data = wine)
Coef:
Coef S.E. Wald Z Pr(>|Z|)
y>=2 1.4113 0.5454 2.59 0.0097
y>=3 -1.1436 0.5097 -2.24 0.0248
y>=4 -3.3771 0.6382 -5.29 <0.0001
y>=5 -4.9420 0.7509 -6.58 <0.0001
temp=warm 2.3212 0.7009 3.31 0.0009
contact=yes 1.3475 0.6604 2.04 0.0413
temp=warm * contact=yes 0.3595 0.9238 0.39 0.6971
Third, polr:
library(MASS)
polrfit <- polr(rating ~ temp * contact, method="logistic", data = wine)
coef:
Coefficients:
tempwarm contactyes tempwarm:contactyes
2.3211214 1.3474055 0.3596357
Intercepts:
1|2 2|3 3|4 4|5
-1.411278 1.143507 3.377005 4.941901
Last, clm():
library(ordinal)
clmfit <- clm(rating ~ temp * contact, link="logit", data = wine)
coef:
Coefficients:
tempwarm contactyes tempwarm:contactyes
2.3212 1.3475 0.3595
Threshold coefficients:
1|2 2|3 3|4 4|5
-1.411 1.144 3.377 4.942
Besides, when reverse=FALSE in vglm(),
library(VGAM)
vglmfit <- vglm(rating ~ temp * contact, data = wine,
family=cumulative(parallel=TRUE, reverse=FALSE))
Coefficients:
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4
-1.4112568 1.1435551 3.3770742 4.9419773
tempwarm contactyes tempwarm:contactyes
-2.3212033 -1.3474598 -0.3595241
You may notice that the coefficients in vglm() while reverse=TRUE and those in orm() are the same, and the ones in polr() and clm() are the same. So there are two set of coefficients, the only difference is the sign of intercepts.
And while I set reverse=FALSE, it does reverse the intercepts, but at the same time the parameters of variables, which I don't want.
What's the problem of that? How could I get exactly the same result? or how could I explain it?
This is all just a matter of parametrizations. One classical way to introduce the ordered logistic regression model is to assume that there is a latent continuous response
y* = x'b + e
where e has a standard logistic distribution. Then, it is assumed that not y* itself is observed by only a discretized category y = j if y* falls between cut-offs a_j-1 and a_j. This then leads to the model equation:
logit(P(y <= j)) = a_j - x'b
Other motivations lead to similar equations but with P(y >= j) and/or a_j + x'b. This just leads to switches in the signs of the a and/or b coefficients that you observe in the different implementations. The corresponding models and predictions are equivalent, of course. Which interpretation you find easier is mostly a matter of taste.