I performed a MCMCglmm (MCMCglmm package). Here is the summary of this model
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 211.0108
G-structure: ~Region
post.mean l-95% CI u-95% CI eff.samp
Region 0.2164 5.163e-17 0.358 1000
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.5529 0.1808 1.045 449.3
Location effects: Abondance ~ Human_impact/Fish.sp
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 1.335628 0.780363 1.907249 642.4 0.004 **
Human_impact 0.005781 -0.294084 0.347743 876.6 0.914
Human_impact:Fish.spA. perideraion -0.782846 -1.158798 -0.399131 649.9 <0.001 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Where are the coefficients?
post.mean is the mean of the posterior distribution?
Can post.mean be somehow considered as the equivalent of the estimates of a standard lm
What does eff.samp mean?
How can I find the degrees of freedom?
This model is based on bayesian statistics. Is it correct?
You can use summary.MCMCglmm from MCMCglmm package
summary method for class "MCMCglmm". The returned object is suitable for printing with the print.summary.MCMCglmm method.
DIC
Deviance Information Criterion
fixed.formula
model formula for the fixed terms
random.formula
model formula for the random terms
residual.formula
model formula for the residual terms
solutions
posterior mean, 95% HPD interval, MCMC p-values and effective sample size of fixed (and random) effects
Gcovariances
posterior mean, 95% HPD interval and effective sample size of random effect (co)variance components
Rcovariances
posterior mean, 95% HPD interval and effective sample size of residual (co)variance components
cutpoints
posterior mean, 95% HPD interval and effective sample size of cut-points from an ordinal model
csats
chain length, burn-in and thinning interval
Gterms
indexes random effect (co)variances by the component terms defined in the random formula
I am under the impression that MCMCglmm does not implement a "true" Bayesian glmmm. Similarly to the frequentist model, one has g(E(y∣u))=Xβ+Zu and there is a prior required on the dispersion parameter ϕ1 in addition to the fixed parameters β and the "G" variance of the random effect u.
But according to this MCMCglmm vignette, the model implemented in MCMCglmm is given by g(E(y∣u,e))=Xβ+Zu+e , and it does not involve the dispersion parameter ϕ1. It is not similar to the classical frequentist model.
Degree of Freedom
mcmcglmm is a wrapper for the MCMCglmm() function. The wrapper function allows for two variants of two defualt priors on the covariance matrices. The two defaults are InvW for an inverse- Wishart prior, which sets the degrees of freedom parameter equal to the dimension of each covariance matrix, and InvG for an inverse-Gamma prior, which sets the degrees of freedom parameter to 0.002 more than one less than the dimensions of the covariance matrix.
Related
I'm working on question C9 in Wooldridge's Intro to Econometrics Textbook. It asks you to obtain the unweighted fitted values and residuals from a weighted least squares regression. Does the following code give me the weighted or unweighted fitted values and residuals?
fitted(wlsmodel)
resid(wlsmodel)
I'm getting different answers from those in the textbook and I'm thinking it must be because the code I'm entering is giving me weighted fitted values and residuals. If this is the case, is there a way to get unweighted fitted values and residuals from a wls regression?
Okay, I've figured it out.
Chapter 8, question C9
(i) Obtain the OLS estimates in equation (8.35)
library(wooldridge)
reg<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,data=smoke)
(ii) Obtain the hhat used in the WLS estimation of equation (8.36) and reproduce equation (8.36). From this equation, obtain the unweighted residuals and fitted values; call these uhat and yhat, respectively.
uhat<-resid(reg)
uhat2<-uhat^2
ghat<-fitted(lm(log(uhat^2)~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,data=smoke))
hhat<-exp(ghat)
wls<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,weight=1/hhat,data=smoke)
uhatwls<-resid(wls)
yhatwls<-fitted(wls)
(iii) Let utilde=uhat/sqrt(hhat) and ytilde=yhat/sqrt(hhat) be the weighted quantities. Carry out the special case of the white test for heteroskedasticity by regressing utilde^2 on ytilde and ytilde^2, being sure to include an intercept, as always. Do you find heteroskedasticity in the weighted residuals?
utilde<-uhatwls/sqrt(hhat)
ytilde<-yhatwls/sqrt(hhat)
utilde2<-utilde^2
ytilde2<-ytilde^2
whitetest<-lm(utilde2~ytilde+ytilde2)
summary(whitetest)
Call:
lm(formula = utilde2 ~ ytilde + ytilde2)
Residuals:
Min 1Q Median 3Q Max
-5.579 -1.801 -1.306 -0.855 90.871
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06146 0.96043 -0.064 0.94899
ytilde 0.28667 1.41212 0.203 0.83918
ytilde2 2.40597 0.78615 3.060 0.00228 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.414 on 804 degrees of freedom
Multiple R-squared: 0.027, Adjusted R-squared: 0.02458
F-statistic: 11.15 on 2 and 804 DF, p-value: 1.667e-05
The process above gives the correct answers as given in the solutions manual, so I know it has been done correctly. The thing that was confusing me was the request to obtain the 'unweighted' residuals from the WLS. It turns out that these are just the residuals obtained by default from that regression, which are then weighted in part (iii) of the question, as per above. The goal being to then test the WLS regression for heteroskedasticity, which is indeed present in the WLS regression.
I'm working with ecological data, where I have used cameras to sample animal detections (converted to biomass) and run various models to identify the best fitting model, chosen through looking at diagnostic plots, AIC and parameter effect size. The model is a gamma GLM (due to biomass having a continuous response), with a log link. The chosen model has the predictor variables of distance to water ("dist_water") and distance to forest patch ("dist_F3"). This is the model summary:
glm(formula = RAI_biomass ~ Dist_water.std + Dist_F3.std, family = Gamma(link = "log"),
data = biomass_RAI)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3835 -1.0611 -0.3937 0.4355 1.5923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3577 0.2049 26.143 2.33e-16 ***
Dist_water.std -0.7531 0.2168 -3.474 0.00254 **
Dist_F3.std 0.5831 0.2168 2.689 0.01452 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.9239696)
Null deviance: 41.231 on 21 degrees of freedom
Residual deviance: 24.232 on 19 degrees of freedom
AIC: 287.98
Number of Fisher Scoring iterations: 7
The covariates were standardised prior to running the model. What I need to do now is to back-transform this model into natural units in order to predict biomass values at unsampled locations (in this case, farms). I made a table of each farm and their respective distance to water, and distance to forest patch. I thought the way to do this would be to use the exp(coef(biomass_glm)), but when I did this the dist_water.std coefficient changed direction and became positive.
exp(coef(biomass_glm8))
## Intercept Dist_water.std Dist_F3.std
## 212.2369519 0.4709015 1.7915026
To me this seems problematic, as in the original GLM, an increasing distance to water meant a decrease in biomass (this makes sense) - but now we are seeing the opposite? The calculated biomass response had a very narrow range, from 210.97-218.9331 (for comparison, in the original data, biomass ranged from 3.04-2227.99.
I then tried to take the exponent of the entire model, without taking the exponent of each coefficient individually:
farms$biomass_est2 <- exp(5.3577 + (-0.7531*farms$Farm_dist_water_std) + (0.5831*farms$Farm_dist_F3_std))
and this gave me a new biomass response that makes a bit more sense, i.e. more variation given the variation in the two covariates (2.93-1088.84).
I then tried converting the coefficient estimates by doing e^B - 1, which gave again different results, although most similar to the ones obtained through exp(coef(biomass_glm)):
(e^(-0.7531))-1 #dist_water = -0.5290955
(e^(0.5831))-1 #dist_F3 = 0.7915837
(e^(5.3577))-1 #intercept = 211.2362
My question is, why are these estimates different, and what is the best way to take this gamma GLM with a log link and convert it into a format that can be used to calculate predicted values? Any help would be greatly appreciated!
I'm working on a meta-analysis of epidemiological studies. The studies are very heterogeneous in terms of population, intervention and analysis, so I'm using a random effects model for meta-analysis using "metafor" in R.
I subsetted the studies into subgroups with comparable outcomes. 5/6 look fine.
However, there is one subgroup that looks all wrong because tau is 0 and I^2 is 0. Looking at the data, I don't see why total heterogeneity would be 0.
res <- rma(yi=beta, sei=se, slab=(1:7), measure="OR",data=SIPVdata, digits=3, method= "ML")
Random-Effects Model (k = 3; tau^2 estimator: ML)
logLik deviance AIC BIC AICc
-0.217 2.635 4.433 2.630 16.433
tau^2 (estimated amount of total heterogeneity): 0.000 (SE = 0.044)
tau (square root of estimated tau^2 value): 0.001
I^2 (total heterogeneity / total variability): 0.00%
H^2 (total variability / sampling variability): 1.00
Test for Heterogeneity:
Q(df = 2) = 2.635, p-val = 0.268
Model Results:
estimate se zval pval ci.lb ci.ub
-0.350 0.145 -2.417 0.016 -0.634 -0.066 *
Plotting the model output looks like this:
So you can see that 2 observations (5&3), which have small confidence intervals and similar estimates, have the most influence in the sample. The other estimates have wide CIs, which all overlap. I might expect the estimate heterogeneity to be low in this case, but not 0, and certainly not the total variability tau.
Does anyone have an idea what is going on in this meta-analysis?
Thank you very much!
The ML estimator of tau^2 is known to have negative bias. That of course does not mean that it is too low in this particular case, but I would suggest to switch to an estimator that is known to be approximately unbiased. The one I would recommend is REML. This is in fact the default estimator (i.e., if you do not specify the method argument).
Also, note that with 7 studies, the estimate of tau^2 (and hence I^2) is not going to be very precise. Run confint(res) and you will see that the confidence interval for I^2 is going to be very wide. In other words, all values within the CI are compatible with these data, so really there could indeed be no heterogeneity or a lot of it.
I'm currently struggling with how to report, following APA-6 recommendations, the output of rstanarm::stan_lmer().
First, I'll fit a mixed model within the frequentist approach, then will try to do the same using the bayesian framework.
Here's the reproducible code to get the data:
library(tidyverse)
library(neuropsychology)
library(rstanarm)
library(lmerTest)
df <- neuropsychology::personality %>%
select(Study_Level, Sex, Negative_Affect) %>%
mutate(Study_Level=as.factor(Study_Level),
Negative_Affect=scale(Negative_Affect)) # I understood that scaling variables is important
Now, let's fit a linear mixed model in the "traditional" way to test the impact of Sex (male/female) on Negative Affect (negative mood) with the study level (years of education) as random factor.
fit <- lmer(Negative_Affect ~ Sex + (1|Study_Level), df)
summary(fit)
The output is the following:
Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of
freedom [lmerMod]
Formula: Negative_Affect ~ Sex + (1 | Study_Level)
Data: df
REML criterion at convergence: 3709
Scaled residuals:
Min 1Q Median 3Q Max
-2.58199 -0.72973 0.02254 0.68668 2.92841
Random effects:
Groups Name Variance Std.Dev.
Study_Level (Intercept) 0.04096 0.2024
Residual 0.94555 0.9724
Number of obs: 1327, groups: Study_Level, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.01564 0.08908 4.70000 0.176 0.868
SexM -0.46667 0.06607 1321.20000 -7.064 2.62e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
SexM -0.149
To report it, I would say that "we fitted a linear mixed model with negative affect as outcome variable, sex as predictor and study level was entered as a random effect. Within this model, the male level led to a significant decrease of negative affect (beta = -0.47, t(1321)=-7.06, p < .001).
Is that correct?
Then, let's try to fit the model within a bayesian framework using rstanarm:
fitB <- stan_lmer(Negative_Affect ~ Sex + (1|Study_Level),
data=df,
prior=normal(location=0, scale=1),
prior_intercept=normal(location=0, scale=1),
prior_PD=F)
print(fitB, digits=2)
This returns:
stan_lmer
family: gaussian [identity]
formula: Negative_Affect ~ Sex + (1 | Study_Level)
------
Estimates:
Median MAD_SD
(Intercept) 0.02 0.10
SexM -0.47 0.07
sigma 0.97 0.02
Error terms:
Groups Name Std.Dev.
Study_Level (Intercept) 0.278
Residual 0.973
Num. levels: Study_Level 8
Sample avg. posterior predictive
distribution of y (X = xbar):
Median MAD_SD
mean_PPD 0.00 0.04
------
For info on the priors used see help('prior_summary.stanreg').
I think than median is the median of the posterior distribution of the coefficient and mad_sd the equivalent of standart deviation. These parameters are close to the beta and standart error of the frequentist model, which is reassuring. However, I do not know how to formalize and put the output in words.
Moreover, if I do the summary of the model (summary(fitB, probs=c(.025, .975), digits=2)), I get other features of the posterior distribution:
...
Estimates:
mean sd 2.5% 97.5%
(Intercept) 0.02 0.11 -0.19 0.23
SexM -0.47 0.07 -0.59 -0.34
...
Is something like the following good?
"we fitted a linear mixed model within the bayesian framework with negative affect as outcome variable, sex as predictor and study level was entered as a random effect. Priors for the coefficient and the intercept were set to normal (mean=0, sd=1). Within this model, the features of the posterior distribution of the coefficient associated with the male level suggest a decrease of negative affect (mean = -0.47, sd = 0.11, 95% CI[-0.59, -0.34]).
Thanks for your help.
The following is personal opinion that may or may not be acceptable to a psychology journal.
To report it, I would say that "we fitted a linear mixed model with negative affect as outcome variable, sex as predictor and study level was entered as a random effect. Within this model, the male level led to a significant decrease of negative affect (beta = -0.47, t(1321)=-7.06, p < .001).
Is that correct?
That is considered correct from a frequentist perspective.
The key concepts from a Bayesian perspective are that (conditional on the model, of course)
There is a 0.5 probability that the true effect is less than the posterior median and a 0.5 probability that the true effect is greater than the posterior median. Frequentists tend to see a posterior median as being like a numerical optimum.
The posterior_interval function yields credible intervals around the median with a default probability of 0.9 (although a lower number produces more accurate estimates of the bounds). So, you can legitimately say that there is a probability of 0.9 that the true effect is between those bounds. Frequentists tend to see confidence intervals as being like credible intervals.
The as.data.frame function will give you access to the raw draws, so mean(as.data.frame(fitB)$male > 0) yields the probability that the expected difference in the outcome between men and women in the same study is positive. Frequentists tend to see these probabilities as being like p-values.
For a Bayesian approach, I would say
We fit a linear model using Markov Chain Monte Carlo with negative affect as the outcome variable, sex as predictor and the intercept was allowed to vary by study level.
And then talk about the estimates using the three concepts above.
I'm trying to evaluate the output from a negative binomial mixed model using glmmadmb. To summarize the output I'm comparing the summary function with output forom the mcmc option. I have run this model:
pre1 <- glmmadmb(walleye~(1|year.center) + (1|Site) ,data=pre,
family="nbinom2",link="log",
mcmc=TRUE,mcmc.opts=mcmcControl(mcmc=1000))
I have two random intercepts: year and site. Year has 33 levels and site has 15.
The random effect parameter estimate for site and year from summary(pre1) do not seem to agree with the posterior distribution from the mcmc output. I am using the 50% confidence interval as the estimate that should coincide with the parameter estimate from the summary function. Is that incorrect? Is there a way to obtain an error around the random effect parameter using the summary function to gauge whether this is variance issue? I tried using postvar=T with ranef but that did not work. Also, Is there a way to format the mcmc output with informative row names to ensure I'm using the proper estimates?
summary output from glmmabmb:
summary(pre1)
Call:
glmmadmb(formula = walleye ~ (1 | year.center) + (1 | Site),
data = pre, family = "nbinom2", link = "log", mcmc = TRUE,
mcmc.opts = mcmcControl(mcmc = 1000))
AIC: 4199.8
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.226 0.154 21 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Number of observations: total=495, year.center=33, Site=15
Random effect variance(s):
Group=year.center
Variance StdDev
(Intercept) 0.1085 0.3295
Group=Site
Variance StdDev
(Intercept) 0.2891 0.5377
Negative binomial dispersion parameter: 2.0553 (std. err.: 0.14419)
Log-likelihood: -2095.88
mcmc output:
m <- as.mcmc(pre1$mcmc)
CI <- t(apply(m,2,quantile,c(0.025,0.5,0.975)))
2.5% 50% 97.5%
(Intercept) 2.911667943 3.211775843 3.5537371345
tmpL.1 0.226614903 0.342206509 0.4600328729
tmpL.2 0.395353518 0.554211483 0.8619127547
alpha 1.789687691 2.050871824 2.3175742167
u.01 0.676758365 0.896844797 1.0726750539
u.02 0.424938481 0.588191585 0.7364795440
these estimates continue to u.48 to include year and site specific coefficients.
Thank you in advance for any thoughts on this issue.
Tiffany
The random effect parameter estimate for site and year from summary(pre1) do not seem to agree with the posterior distribution from the mcmc output. I am using the 50% confidence interval as the estimate that should coincide with the parameter estimate from the summary function. Is that incorrect?
It's not the 50% confidence interval, it's the 50% quantile (i.e. the median). The point estimates from the Laplace approximation of the among-year and among-site standard deviations respectively are {0.3295,0.5377}, which seem quite close to the MCMC median estimates {0.342206509,0.554211483} ... as discussed below, the MCMC tmpL parameters are the random-effects standard deviations, not the variances -- this might be the main cause of your confusion?
Is there a way to obtain an error around the random effect parameter using the summary function to gauge whether this is variance issue? I tried using postvar=T with ranef but that did not work.
The lme4 package (not the glmmadmb package) allows estimates of the variances of the conditional modes (i.e. the random effects associated with particular levels) via ranef(...,condVar=TRUE) (postVar=TRUE is now deprecated). The equivalent information on the uncertainty of the conditional modes is available via ranef(model,sd=TRUE) (see ?ranef.glmmadmb).
However, I think you might be looking for the $S (variance-covariance matrices) and $sd_S (Wald standard errors of the variance-covariance estimates) instead (although as stated above, I don't think there's really a problem).
Also, Is there a way to format the mcmc output with informative row names to ensure I'm using the proper estimates?
See p. 15 of vignette("glmmADMB",package="glmmADMB"):
The MCMC output in glmmADMB is not completely translated. It includes, in order:
pz zero-inflation parameter (raw)
fixed effect parameters Named in the same way as the results of coef() or
fixef().
tmpL variances (standard-deviation scale)
tmpL1 correlation/off-diagonal elements of variance-covariance matrices (off-diagonal elements of the Cholesky factor of the correlation matrix'). (If you need to transform these to correlations, you will need to construct the relevant matrices with 1 on the diagonal and compute the cross-product, CC^T (see tcrossprod); if this makes no sense to you, contact the maintainers)
alpha overdispersion/scale parameter
u random effects (unscaled: these can be scaled using the estimated random-effects standard deviations from VarCorr())