Set contrasts in glm - r

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?

Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.

Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.

Related

lme4: How to specify random slopes while constraining all correlations to 0?

Due to an interesting turn of events, I'm trying use the lme4 package in R to fit a model in which the random slopes are not allowed to correlate with each other or the random intercept. Effectively, I want to estimate the variance parameter for each random slope, but none of the correlations/covariances. From the reading I've done so far, I think what I want is effectively a diagonal variance/covariance structure for the random effects.
An answer to a similar question here provides a workaround to specify a model where slopes are correlated with intercepts, but not with each other. I also know the || syntax in lme4 makes slopes that are correlated with each other, but not with the intercepts. Neither of these seems to fully accomplish what I'm looking to do.
Borrowing the example from the earlier post, if my model is:
m1 <- lmer (Y ~ A + B + (1+A+B|Subject), data=mydata)
is there a way to specify the model such that I estimate variance parameters for A and B while constraining all three correlations to 0? I would like to achieve a result that looks something like this:
VarCorr(m1)
## Groups Name Std.Dev. Corr
## Subject (Intercept) 1.41450
## A 1.49374 0.000
## B 2.47895 0.000 0.000
## Residual 0.96617
I'd prefer a solution that could achieve this for an arbitrary number of random slopes. For example, if I were to add a random effect for a third variable C, there would be 6 correlation parameters to fix at 0 rather than 3. However, anything that could get me started in the right direction would be extremely helpful.
Edit:
On asking this question, I misunderstood what the || syntax does in lme4. Struck through the incorrect statement above to avoid misleading anyone in the future.
This is exactly what the double-bar notation does. However, note that the || in lme4 does not work as one might expect for factor variables. It does work 'properly' in glmmTMB, and the afex::mixed() function is a wrapper for [g]lmer which does implement a fully functional version of ||. (I have meant to import this into lme4 for years but just haven't gotten around to it yet ...)
simulated example
library(lme4)
set.seed(101)
dd <- data.frame(A = runif(500), B = runif(500),
Subject = factor(rep(1:25, 20)))
dd$Y <- simulate(~ A + B + (1 + A + B|Subject),
newdata = dd,
family = gaussian,
newparams = list(beta = rep(1,3), theta = rep(1,6), sigma = 1))[[1]]
solution
summary(m <- lmer (Y ~ A + B + (1+A+B||Subject), data=dd))
The correlations aren't listed because they are structurally absent (internally, the random effects term is expanded to (1|Subject) + (0 + A|Subject) + (0+B|Subject), which is also why the groups are listed as Subject, Subject.1, Subject.2).
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.8744 0.9351
Subject.1 A 2.0016 1.4148
Subject.2 B 2.8718 1.6946
Residual 0.9456 0.9724
Number of obs: 500, groups: Subject, 25

Cohen's d effect size for multiple simple effect comparisons with more than 2 levels (following an interaction)

I'm trying to find an easy way to compute Cohen's d (standardised mean difference) for multiple simple effect comparisons following an interaction.
In this case, I have one factor with 2 levels and a second factor with 3 levels.
I'd compute a 2x2 ANOVA and find an interaction. Then I'd want to follow up with the specific t-test comparisons and report an effect size.
If there is a package or a simple function or an easy way to do this, please share!
First, make some data:
df1 <- data.frame(cond1 = rep(0:1, 500),
cond2 = sample(0:2, 1000, replace = TRUE),
dv = rnorm(1000, 2, 1)
)
#fit the model
model <- lm(dv ~ cond1*cond2, df1)
test pairwise comparisons for the interaction (which isn't sig. here but pretend that it is)
emm <- emmeans::emmeans(model, pairwise ~ cond1|cond2)
#can do cond1|cond2 or cond2|cond1, both work
This seems like it should work, but I can't figure out why I get this error message:
emmeans::eff_size(emm, sigma = sigma(model), edf = df.residual(model))
#Error in update.default(object, tran = NULL) : need an object with call component
This works:
summary(psych::cohen.d.by(df1 ~ cond1 + cond2))
but this does not work if I wanted to get the pairwise comparisons stratified the other way:
summary(psych::cohen.d.by(df1 ~ cond2 + cond1))
#Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
If I only had one condition variable, I would use rstatix::
However, as far as I know, this package and function does not allow to input more than 1 grouping variable.
rstatix::cohens_d(df1, dv ~ cond)
Any other suggestions?
What I'm looking for is the standardised mean difference for each comparison in a list for each comparison.
I know it's a lot of comparisons, but it's a common procedure in social science and there should be a function made to do this.
There is nothing wrong with your code. It runs as given:
set.seed(123)
df1 <- data.frame(cond1 = rep(0:1, 500),
cond2 = sample(0:2, 1000, replace = TRUE),
dv = rnorm(1000, 2, 1)
)
model <- lm(dv ~ cond1*cond2, df1)
emm <- emmeans::emmeans(model, pairwise ~ cond1|cond2)
emmeans::eff_size(emm, sigma = sigma(model), edf = df.residual(model))
## Since 'object' is a list, we are using the contrasts already present.
## cond2 = 1.02:
## contrast effect.size SE df lower.CL upper.CL
## (0 - 1) 0.0245 0.0633 996 -0.0996 0.149
##
## sigma used for effect sizes: 1.006
## Confidence level used: 0.95
Created on 2021-08-23 by the reprex package (v2.0.0)
So you must have done some other manipulations (e.g., that replace model with something else) before you ran eff_size()
I do suggest, however, that you set up cond2 as a factor in your model. You are treating it like a numerical predictor, and that's why it gets reduced to its average (1.02 in this case) rather than having separate effect sizes for each level.

Quick way to calculate a confidence interval after changing dispersion parameter

I'm teaching a modeling class in R. The students are all SAS users, and I have to create course materials that exactly match (when possible) SAS output. I'm working on the Poisson regression section and trying to match PROC GENMOD, with a "dscale" option that modifies the dispersion index so that the deviance/df==1.
Easy enough to do, but I need confidence intervals. I'd like to show the students how to do it without hand calculating them. Something akin to confint_default() or confint()
Data
skin_cancer <- data.frame(CASES=c(1,16,30,71,102,130,133,40,4,38,
119,221,259,310,226,65),
CITY=c(rep(0,8),rep(1,8)),
N=c(172875, 123065,96216,92051,72159,54722,
32185,8328,181343,146207,121374,111353,
83004,55932,29007,7583),
agegp=c(1:8,1:8))
skin_cancer$ln_n = log(skin_cancer$N)
The model
fit <- glm(CASES ~ CITY, family="poisson", offset=ln_n, data=skin_cancer)
Changing the dispersion index
summary(fit, dispersion= deviance(fit) / df.residual(fit)))
That gets me the "correct" standard errors (correct according to SAS). But obviously I can't run confint() on a summary() object.
Any ideas? Bonus points if you can tell me how to change the dispersion index within the model so I don't have to do it within the summary() call.
Thanks.
This is an interesting question, and slightly deeper than it seems.
The simplest potential answer is to use family="quasipoisson" instead of poisson:
fitQ <- update(fit, family="quasipoisson")
confint(fitQ)
However, this won't let you adjust the dispersion to be whatever you want; it specifically changes the dispersion to the estimate R calculates in summary.glm, which is based on the Pearson chi-squared (sum of squared Pearson residuals) rather than the deviance, i.e.
sum((object$weights * object$residuals^2)[object$weights > 0])/df.r
You should be aware that stats:::confint.glm() (which actually uses MASS:::confint.glm) computes profile confidence intervals rather than Wald confidence intervals (i.e., this is not just a matter of adjusting the standard deviations).
If you're satisfied with Wald confidence intervals (which are generally less accurate) you could hack stats::confint.default() as follows (note that the dispersion title is a little bit misleading, as this function basically assumes that the original dispersion of the model is fixed to 1: this won't work as expected if you use a model that estimates dispersion).
confint_wald_glm <- function(object, parm, level=0.95, dispersion=NULL) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm))
parm <- pnames
else if (is.numeric(parm))
parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ses <- sqrt(diag(vcov(object)))[parm]
if (!is.null(dispersion)) ses <- sqrt(dispersion)*ses
ci[] <- cf[parm] + ses %o% fac
ci
}
confint_wald_glm(fit)
confint_wald_glm(fit,dispersion=2)

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.
Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]

Choice of statistical test (in R) of two apparently different distributions

I have the following list of data each has 10 samples.
The values indicate binding strength of a particular molecule.
What I want so show is that 'x' is statistically different from
'y', 'z' and 'w'. Which it does if you look at X it has
more values greater than zero (2.8,1.00,5.4, etc) than others.
I tried t-test, but all of them shows insignificant difference
with high P-value.
What's the appropriate test for that?
Below is my code:
#!/usr/bin/Rscript
x <-c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
t.test(x,y)
t.test(x,z)
You have not specified in what way you expect the samples to differ. One typically assumes you mean the mean differs across samples. In that case, the t-test is appropriate. While x has some high values, it also has some low values which pull the mean in. It seems what you thought was a significant difference (visually) is actually a larger variance.
If your question is about variance, then you need an F-test.
The classic test for this type of data is analysis of variance. Analysis of variance tells you if the means of all four categories are the likely the same (failure to reject null hypothesis) or if at least one mean likely differs from the others (rejection of the null hypothesis).
If the anova is significant, you will often want to perform the Tukey HSD post-hoc test to figure out which category differs from the others. Tukey HSD yields p-values that are already adjusted for multiple comparisons.
library(ggplot2)
library(reshape2)
x <- c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,
0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,
0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,
0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,
0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
dat = data.frame(x, y, z, w)
mdat = melt(dat)
anova_results = aov(value ~ variable, data=mdat)
summary(anova_results)
# Df Sum Sq Mean Sq F value Pr(>F)
# variable 3 5.83 1.9431 2.134 0.113
# Residuals 36 32.78 0.9105
The anova p-value is 0.113 and the Tukey test p-values for your "x" category are in a similar range. This is the quantification of your intuition that "x" is different from the others. Most researchers would find p = 0.11 to be suggestive but still have too high risk of being a false positive. Note that the large difference in means (diff column) along with the boxplot figure below might be more persuasive than the p-value.
TukeyHSD(anova_results)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = value ~ variable, data = mdat)
#
# $variable
# diff lwr upr p adj
# y-x -0.92673335 -2.076048 0.2225815 0.1506271
# z-x -0.82314118 -1.972456 0.3261737 0.2342515
# w-x -0.88266565 -2.031981 0.2666492 0.1828672
# z-y 0.10359217 -1.045723 1.2529071 0.9948795
# w-y 0.04406770 -1.105247 1.1933826 0.9995981
# w-z -0.05952447 -1.208839 1.0897904 0.9990129
plot_1 = ggplot(mdat, aes(x=variable, y=value, colour=variable)) +
geom_boxplot() +
geom_point(size=5, shape=1)
ggsave("plot_1.png", plot_1, height=3.5, width=7, units="in")
In your question you referred to the distributions being different b/c some of them had more values greater than 0. Defining the distributions according to the "number of values greater than 0", then you would use the binomial distribution (after converting the values to 1's and 0's). A function you could then use would be prop.test()

Resources