There are two methods available to estimate confidence intervals for a gls model in R: using function confint and function intervals. The results are not the same and I want to know what are the causes of the differences and which one is the preferred to use for a gls (and for lme as well) models.
I will use the cats data set for this example. I will use four different approaches to estimate the mean difference (MD) of Hwt between sex:
t-test (heterogeneous variance)
Linear model, using lm (homogeneous variance)
Linear model, using gls (homogeneous variance)
Heteroscedastic linear model, using gls (heterogeneous variance)
for the gls approaches confint and intervals are available for calculating confidence intervals.
Here is the code:
library(pacman)
p_load(tidyverse)
p_load(MASS)
p_load(nlme)
set.seed(150)
cats%>%ggplot(aes(x=Sex,y=Hwt))+
geom_boxplot()+theme_bw()
###different approaches for the same mean difference estimation
cats_ttest<-t.test(Hwt~Sex,data=cats)
cats$Sex<-relevel(cats$Sex,ref="M")
cats_lm<-lm(Hwt~Sex,data=cats)
cats_gls_hom<-gls(Hwt~Sex,data=cats)
cats_gls_het<-gls(Hwt~Sex,weights=varIdent(form=~1|Sex),data=cats)
###store estimations and CI's from different approaches
a<-rbind(confint(cats_lm),confint(cats_gls_hom),confint(cats_gls_het),
intervals(cats_gls_hom,which = "coef")$coef[,c(1,3)],
intervals(cats_gls_het,which = "coef")$coef[,c(1,3)]) %>% data.frame%>% {cbind(par=rownames(.),.)}
a$par<-a$par %>% str_remove_all("X.|.1|.2|.3|.4")
a$par<-factor(a$par,levels =c("Intercept.","SexF"),
labels =c("Intercept.","SexF") )
a$est<-c(rep(cats_lm %>% coef,3),
cats_gls_hom %>% coef,cats_gls_het %>% coef
)
a$mod<-c(rep("cats_lm_ci",2),rep("cats_gls_hom_ci",2),rep("cats_gls_het_ci",2),
rep("cats_gls_hom_int",2),rep("cats_gls_het_int",2)
)
colnames(a)[2:3]<-c("LCI","UCI")
a<-rbind(data.frame(par="SexF",LCI=cats_ttest$conf.int[1],
UCI=cats_ttest$conf.int[2],est=cats_ttest$estimate[1]-cats_ttest$estimate[2],
mod="ttest"),a)
a$mod<-factor(a$mod,levels =c("ttest","cats_lm_ci","cats_gls_hom_ci","cats_gls_het_ci","cats_gls_hom_int","cats_gls_het_int"))
a$diff<-a$UCI-a$LCI
rownames(a)<-NULL
###results
a[order(a$par,a$diff),]
#> par LCI UCI est mod diff
#> 4 Intercept. 10.879181 11.766179 11.322680 cats_gls_hom_ci 0.8869980
#> 2 Intercept. 10.875369 11.769992 11.322680 cats_lm_ci 0.8946223
#> 8 Intercept. 10.875369 11.769992 11.322680 cats_gls_hom_int 0.8946223
#> 6 Intercept. 10.816754 11.828606 11.322680 cats_gls_het_ci 1.0118521
#> 10 Intercept. 10.812406 11.832955 11.322680 cats_gls_het_int 1.0205495
#> 7 SexF -2.758218 -1.482888 -2.120553 cats_gls_het_ci 1.2753295
#> 11 SexF -2.763699 -1.477407 -2.120553 cats_gls_het_int 1.2862917
#> 1 SexF -2.763753 -1.477352 -2.120553 ttest 1.2864011
#> 5 SexF -2.896844 -1.344261 -2.120553 cats_gls_hom_ci 1.5525835
#> 3 SexF -2.903517 -1.337588 -2.120553 cats_lm_ci 1.5659288
#> 9 SexF -2.903517 -1.337588 -2.120553 cats_gls_hom_int 1.5659288
a %>% ggplot(aes(x=par,y=est,color=mod,group=mod))+geom_point(position=position_dodge(0.5))+
geom_errorbar(aes(ymin=LCI, ymax=UCI), width=.2,
position=position_dodge(0.5))+theme_bw()
Created on 2022-09-11 by the reprex package (v2.0.1)
As you can see, there are mild differences in CI amplitudes from the different methods,and as expected, the methods which accounts for differences in variances produced the narrowest CI for the mean differences (parameter SexF in dataframe a).
So, why are two methods available to estimate confidence intervals for gls models, what are the differences between them and which one is the preferred one for this kind of models?
tl;dr use intervals(), it gives you CIs based on a Student-t rather than a Normal sampling distribution.
If you look at methods(class = "gls") you'll see that confint() is not listed. That means that when you call confint(gls_fit), R falls back to the default confint method. If we look at the code for stats::confint.default you'll see fac <- qnorm(a); ...; ci[] <- cf[parm] + ses %o% fac. In other words, confint.default is constructing CIs based on a Normal distribution.
In contrast, nlme:::intervals.gls uses
len <- -qt((1 - level)/2, dims$N - dims$p) * sqrt(diag(object$varBeta))
— i.e., an interval based on a t-distribution.
It makes very little difference in this case (CI interval width of 1.55 vs 1.56).
For what it's worth, you can streamline this kind of comparison a little bit using broom/broom.mixed (although this does not include the confint.default option for gls!)
library(broom)
library(broom.mixed)
options(pillar.sigfig = 7)
(tibble::lst(cats_ttest, cats_lm, cats_gls_hom, cats_gls_het)
|> map_dfr(tidy, .id = "model", conf.int = TRUE)
## t-test doesn't have a "term" element
|> mutate(across(term, ~ifelse(is.na(.), "SexF", term)))
|> select(model, term, estimate, lwr = conf.low, upr = conf.high)
|> mutate(width = upr - lwr)
|> arrange(term)
)
As a general rule, you should use the most specific method available — this usually happens automatically, it's sort of an accident that confint() works for gls objects (partly because the nlme package predates R itself, so doesn't follow all of its conventions ...)
Related
I have a sample of more than 50 million observations. I estimate the following model in R:
model1 <- feglm(rejection~ variable1+ variable1^2 + variable2+ variable3+ variable4 | city_fixed_effects + year_fixed_effects, family=binomial(link="logit"), data=database)
Based on the estimates from model1, I calculate the marginal effects:
mfx2 <- marginaleffects(model1)
summary(mfx2)
This line of code also calculates the marginal effects of each fixed effects which slows down R. I only need to calculate the average marginal effects of variables 1, 2, and 3. If I separately, calculate the marginal effects by using mfx2 <- marginaleffects(model1, variables = "variable1") then it does not show the standard error and the p-value of the average marginal effects.
Any solution for this issue?
Both the fixest and the marginaleffects packages have made recent
changes to improve interoperability. The next official CRAN releases
will be able to do this, but as of 2021-12-08 you can use the
development versions. Install:
library(remotes)
install_github("lrberge/fixest")
install_github("vincentarelbundock/marginaleffects")
I recommend converting your fixed effects variables to factors before
fitting your models:
library(fixest)
library(marginaleffects)
dat <- mtcars
dat$gear <- as.factor(dat$gear)
mod <- feglm(am ~ mpg + mpg^2 + hp + hp^3| gear,
family = binomial(link = "logit"),
data = dat)
Then, you can use marginaleffects and summary to compute average
marginal effects:
mfx <- marginaleffects(mod, variables = "mpg")
summary(mfx)
## Average marginal effects
## type Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## 1 response mpg 0.3352 40 0.008381 0.99331 -78.06 78.73
##
## Model type: fixest
## Prediction type: response
Note that computing average marginal effects requires calculating a
distinct marginal effect for every single row of your dataset. This can
be computationally expensive when your data includes millions of
observations.
Instead, you can compute marginal effects for specific values of the
regressors using the newdata argument and the typical function.
Please refer to the marginaleffects documentation for details on
those:
marginaleffects(mod,
variables = "mpg",
newdata = typical(mpg = 22, gear = 4))
## rowid type term dydx std.error hp mpg gear predicted
## 1 1 response mpg 1.068844 50.7849 146.6875 22 4 0.4167502
How can I use the rms package in R to execute a negative binomial regression? (I originally posted this question on Statistics SE, but it was closed apparently because it is a better fit here.)
With the MASS package, I use the glm.nb function, but I am trying to switch to the rms package because I sometimes get weird errors when bootstrapping with glm.nb and some other functions. But I cannot figure out how to do a negative binomial regression with the rms package.
Here is sample code of what I would like to do (copied from the rms::Glm function documentation):
library(rms)
## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
f <- Glm(counts ~ outcome + treatment, family=poisson())
f
anova(f)
summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))
So, instead of using family=poisson(), I would like to use something like family=negative.binomial(), but I cannot figure out how to do this.
In the documentation for family {stats}, I found this note in the "See also" section:
For binomial coefficients, choose; the binomial and negative binomial distributions, Binomial, and NegBinomial.
But even after clicking the link for ?NegBinomial, I cannot make any sense of this.
I would appreciate any help on how to use the rms package in R to execute a negative binomial regression.
opinion up front You might be better off posting (as a separate question) a reproducible example of the "weird errors" from your bootstrap attempts and seeing whether people have ideas for resolving them. It's fairly common for NB fitting procedures to throw warnings or errors when data are equi- or underdispersed, as the estimates of the dispersion parameter become infinite in this case ...
#coffeinjunky is correct that using family = negative.binomial(theta=VALUE) will work (where VALUE is a numeric constant, e.g. theta=1 for the geometric distribution [a special case of the NB]). However: you won't be able (without significantly more work) be able to fit the general NB model, i.e. the model where the dispersion parameter (theta) is estimated as part of the fitting procedure. That's what MASS::glm.nb does, and AFAICS there is no analogue in the rms package.
There are a few other packages/functions in addition to MASS::glm.nb that fit the negative binomial model, including (at least) bbmle and glmmTMB — there may be others such as gamlss.
## Dobson (1990) Page 93: Randomized Controlled Trial :
dd < data.frame(
counts = c(18,17,15,20,10,20,25,13,12)
outcome = gl(3,1,9),
treatment = gl(3,3))
MASS::glm.nb
library(MASS)
m1 <- glm.nb(counts ~ outcome + treatment, data = dd)
## "iteration limit reached" warning
glmmTMB
library(glmmTMB)
m2 <- glmmTMB(counts ~ outcome + treatment, family = nbinom2, data = dd)
## "false convergence" warning
bbmle
library(bbmle)
m3 <- mle2(counts ~ dnbinom(mu = exp(logmu), size = exp(logtheta)),
parameters = list(logmu ~outcome + treatment),
data = dd,
start = list(logmu = 0, logtheta = 0)
)
signif(cbind(MASS=coef(m1), glmmTMB=fixef(m2)$cond, bbmle=coef(m3)[1:5]), 5)
MASS glmmTMB bbmle
(Intercept) 3.0445e+00 3.04540000 3.0445e+00
outcome2 -4.5426e-01 -0.45397000 -4.5417e-01
outcome3 -2.9299e-01 -0.29253000 -2.9293e-01
treatment2 -1.1114e-06 0.00032174 8.1631e-06
treatment3 -1.9209e-06 0.00032823 6.5817e-06
These all agree fairly well (at least for the intercept/outcome parameters). This example is fairly difficult for a NB model (5 parameters + dispersion for 9 observations, data are Poisson rather than NB).
Based on this, the following seems to work:
library(rms)
library(MASS)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
Glm(counts ~ outcome + treatment, family = negative.binomial(theta = 1))
General Linear Model
rms::Glm(formula = counts ~ outcome + treatment, family = negative.binomial(theta = 1))
Model Likelihood
Ratio Test
Obs 9 LR chi2 0.31
Residual d.f.4 d.f. 4
g 0.2383063 Pr(> chi2) 0.9892
Coef S.E. Wald Z Pr(>|Z|)
Intercept 3.0756 0.2121 14.50 <0.0001
outcome=2 -0.4598 0.2333 -1.97 0.0487
outcome=3 -0.2962 0.2327 -1.27 0.2030
treatment=2 -0.0347 0.2333 -0.15 0.8819
treatment=3 -0.0503 0.2333 -0.22 0.8293
I have a multinomial logit model created with the nnet R package, using the multinom command. The dependent variable has three categories/choice options. I am modelling the probability of selecting a certain irrigation type (no irrigation, surface irrigation, drip irrigation) based on farmer characteristics.
I would like to estimate marginal effects, i.e. by how much does the probability of selecting irrigation type Y change when I increase independent variable X by one unit? I have tried doing this with the margins package (marginal_effects), but this gives only 1 value per observation in the dataset. I was expecting three values, since I want the marginal effect for each of the three irrigation types.
Does someone know if there is a better R package to use for this? Or whether I am doing something wrong with the margins packages? Thank you.
You can use the marginaleffects
package to do
that (disclaimer: I am the maintainer). Please note the warning.
library(nnet)
library(marginaleffects)
mod <- multinom(factor(cyl) ~ hp + mpg, data = mtcars, quiet = true)
mfx <- marginaleffects(mod, type = "probs")
## Warning in sanity_model_specific.multinom(model, ...): The standard errors
## estimated by `marginaleffects` do not match those produced by Stata for
## `nnet::multinom` models. Please be very careful when interpreting the results.
summary(mfx)
## Average marginal effects
## type Group Term Effect Std. Error z value Pr(>|z|) 2.5 %
## 1 probs 6 hp 2.792e-04 0.000e+00 Inf < 2.22e-16 2.792e-04
## 2 probs 6 mpg -1.334e-03 0.000e+00 -Inf < 2.22e-16 -1.334e-03
## 3 probs 8 hp 2.396e-05 1.042e-126 2.298e+121 < 2.22e-16 2.396e-05
## 4 probs 8 mpg -2.180e-04 1.481e-125 -1.472e+121 < 2.22e-16 -2.180e-04
## 97.5 %
## 1 2.792e-04
## 2 -1.334e-03
## 3 2.396e-05
## 4 -2.180e-04
##
## Model type: multinom
## Prediction type: probs
The marginaleffects package should work in theory, but my example doesn't compile because of file size restrictions (meaning I don't have enough RAM for the 1.5 GB vector it tries to use). It's not even that large of a dataset, which is odd.
If you use marginal_effects() (margins package) for multinomial models, it only displays the output for a default category. You have to manually set each category you want to see. You can clean up the output with broom and then combine some other way. It's clunky, but it can work.
marginal_effects(model, category = 'cat1')
I have a fixed effects model with only few observations and would therefore like to bootstrap in order to obtain more accurate standard errors. At the same time, I assume SE to be clustered thus I would also like to correct for clustering, i.e. do a cluster bootstrap.
I found a function for lm models (vcovBS), however could not find anything for plm models. Does anybody know an analogous function to obtain cluster bootstrapped SE for fixed effects models?
The clusterSEs package has an implementation of the wild cluster bootstrap for plm models: https://www.rdocumentation.org/packages/clusterSEs/versions/2.6.2/topics/cluster.wild.plml.
An alternative package is fwildclusterboot. It does not work with plm but with two other fixed effects regression packages, lfe and fixest, and should be significantly faster than clusterSEs.
With the fixest package, its syntax would look like this:
library(fixest)
library(fwildclusterboot)
# load data set voters included in fwildclusterboot
data(voters)
# estimate the regression model via feols
feols_fit <- feols(proposition_vote ~ treatment + ideology1 + log_income + Q1_immigration , data = voters)
# bootstrap inference
boot_feols <- boottest(feols_fit, clustid = "group_id1", param = "treatment", B = 9999)
summary(boot_feols)
#> boottest.fixest(object = lm_fit, clustid = "group_id1", param = "treatment",
#> B = 9999)
#>
#> Observations: 300
#> Bootstr. Iter: 9999
#> Bootstr. Type: rademacher
#> Clustering: 1-way
#> Confidence Sets: 95%
#> Number of Clusters: 40
#>
#> term estimate statistic p.value conf.low conf.high
#> 1 treatment 0.073 3.786 0.001 0.033 0.114
I'm running a piecewise linear random coefficient model testing the influence of a covariate on the second piece. Thereby, I want to test whether the coefficient of the second piece under the influence of the covariate (piece2 + piece2:covariate) differs from the coefficient of the first piece (piece1), hence whether the growth rate differs.
I set up some exemplary data:
set.seed(100)
# set up dependent variable
temp <- rep(seq(0,23),50)
y <- c(rep(seq(0,23),50)+rnorm(24*50), ifelse(temp <= 11, temp + runif(1200), temp + rnorm(1200) + (temp/sqrt(temp))))
# set up ID variable, variables indicating pieces and the covariate
id <- sort(rep(seq(1,100),24))
piece1 <- rep(c(seq(0,11), rep(11,12)),100)
piece2 <- rep(c(rep(0,12), seq(1,12)),100)
covariate <- c(rep(0,24*50), rep(c(rep(0,12), rep(1,12)), 50))
# data frame
example.data <- data.frame(id, y, piece1, piece2, covariate)
# run piecewise linear random effects model and show results
library(lme4)
lmer.results <- lmer(y ~ piece1 + piece2*covariate + (1|id) , example.data)
summary(lmer.results)
I came across the linearHypothesis() command from the car package to test differences in coefficients. However, I could not find an example on how to use it when including interactions.
Can I even use linearHypothesis() to test this or am I aiming for the wrong test?
I appreciate your help.
Many thanks in advance!
Mac
Assuming your output looks like this
Estimate Std. Error t value
(Intercept) 0.26293 0.04997 5.3
piece1 0.99582 0.00677 147.2
piece2 0.98083 0.00716 137.0
covariate 2.98265 0.09042 33.0
piece2:covariate 0.15287 0.01286 11.9
if I understand correctly what you want, you are looking for the contrast:
piece1-(piece2+piece2:covariate)
or
c(0,1,-1,0,-1)
My preferred tool for this is function estimable in gmodels; you could also do it by hand or with one of the functions in Frank Harrel's packages.
library(gmodels)
estimable(lmer.results,c(0,1,-1,0,-1),conf.int=TRUE)
giving
Estimate Std. Error p value Lower.CI Upper.CI
(0 1 -1 0 -1) -0.138 0.0127 0 -0.182 -0.0928