Generating similar estimates of interactions in afex, lsmeans, and lme4 packages - r

I would like to know if there is a way get the same estimates of an interaction effect in afex & lsmeans packages as in lmer. The toy data below is for two groups with different intercepts and slopes.
set.seed(1234)
A0 <- rnorm(4,2,1)
B0 <- rnorm(4,2+3,1)
A1 <- rnorm(4,6,1)
B1 <- rnorm(4,6+2,1)
A2 <- rnorm(4,10,1)
B2 <- rnorm(4,10+1,1)
A3 <- rnorm(4,14,1)
B3 <- rnorm(4,14+0,1)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- factor(rep(1:8,times = 4, length = 32))
time <- factor(rep(0:3, each = 8, length = 32))
timeNum <- as.numeric(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id, group, time, timeNum, score)
df
And here is the plot
(ggplot(df, aes(x = time, y = score, group = group)) +
stat_summary(fun.y = "mean", geom = "line", aes(linetype = group)) +
stat_summary(fun.y = "mean", geom = "point", aes(shape = group), size = 3) +
coord_cartesian(ylim = c(0,18)))
When I run a standard lmer on the data looking for an estimate of the difference in change in score over time between groups.
summary(modelLMER <- lmer(score ~ group * timeNum + (timeNum|id), df))
I get an estimate for the group*time interaction of -1.07, which means that the increase in score for a one-unit increase in time is ~1 point less in group B than group A. This estimate matches the preset differences I built into the dataset.
What I would like to know is how to do a similar thing in the afex and lsmeans packages.
library(afex)
library(lsmeans)
First I generated the afex model object
modelLM <- aov_ez(id="id", dv="score", data=df, between="group", within="time",
type=3, return="lm")
Then passed that into the lsmeans function
lsMeansLM <- lsmeans(modelLM, ~rep.meas:group)
My goal is to generate an accurate estimate of the group*time interaction in afex and lsmeans. To do so requires specifying custom contrast matrices based on the split specified in the lsmeans function above.
groupMain = list(c(-1,-1,-1,-1,1,1,1,1)) # group main effect
linTrend = list(c(-3,-1,1,3,-3,-1,1,3)) # linear trend
linXGroup = mapply("*", groupMain, linTrend) # group x linear trend interaction
Then I made a master list
contrasts <- list(groupMain=groupMain, linTrend=linTrend, linXGroup=linXGroup)
Which I passed into the contrast function in lsmeans.
contrast(lsMeansLM, contrasts)
The F and p values in the output match those for the automatic tests for linear trend and for the group difference in linear trend generated from a mixed ANCOVA in SPSS. However the mixed ANCOVA does not generate an estimate.
The estimate of the effect using the procedure above, instead of being approx. -1, like in the lmer (and matching the difference I built into the data) is approx. -10, which is wildly inaccurate.
I assume it has something to do with how I am coding the contrast coefficients. I know if I normalise the coefficients of the groupMain matrix by dividing all coefficients by four that yields an accurate estimate of the main effect of group averaged across all timepoints. But I have no idea how to get an accurate estimate either of linear trend averaged across groups (linTrend), or an accurate estimate of the difference in linear trend across groups (linXGroup).
I am not sure if this question is more suitable for here or Cross Validated. I figured here first because it seems to be software related, but I know there are probably deeper issues involved. Any help would be much appreciated.

The issue here is that timeNum is a numeric predictor. Therefore, the interaction is a comparison of slopes. Note this:
> lstrends(modelLMER, ~group, var = "timeNum")
group timeNum.trend SE df lower.CL upper.CL
A 4.047168 0.229166 6.2 3.490738 4.603598
B 2.977761 0.229166 6.2 2.421331 3.534191
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
> pairs(.Last.value)
contrast estimate SE df t.ratio p.value
A - B 1.069407 0.3240897 6.2 3.3 0.0157
There's your 1.07 - the opposite sign because the comparison is in the other direction.
I will further explain that the lsmeans result you describe in the question is a comparison of the two group means, not an interaction contrast. lsmeans uses a reference grid:
> ref.grid(modelLMER)
'ref.grid' object with variables:
group = A, B
timeNum = 1.5
and as you can see, timeNum is being held fixed at its mean of 1.5. The LS means are predictions for each group at timeNum = 1.5 -- often called the adjusted means; and the difference is thus the difference between those two adjusted means.
Regarding the discrepancy claimed in obtaining your linear contrast of about 10.7: The linear contrast coefficients c(-3,-1,1,3) give you a multiple of the slope of the line. To get the slope, you need to divide by sum(c(-3,-1,1,3)^2) -- and also multiply by 2, because the contrast coefficients increment by 2.

Thanks to the invaluable help of #rvl I was able to solve this. Here is the code.
In order to generate the correct contrast matrices we first need to normalise them
(mainMat <- c(-1,-1,-1,-1,1,1,1,1)) # main effects matrix
(trendMat <- c(-3,-1,1,3,-3,-1,1,3) # linear trend contrast coefficients
(nTimePoints <- 4) # number of timePoints
(mainNorm <- 1/nTimePoints)
(nGroups <- 2) # number of between-Ss groups
(trendIncrem <- 2) # the incremental increase of each new trend contrast coefficient
(trendNorm <- trendIncrem/(sum(trendMat^2))) # normalising the trend coefficients
Now we create several contrast matrices in the form of lists. These are normalised using the objects we created above
(groupMain = list(mainMat*mainNorm)) # normalised group main effect
(linTrend = list(trendMat*trendNorm)) # normalised linear trend
(linXGroup = list((mainMat*trendMat)*(nGroups*trendNorm))) # group x linear trend interaction
Now pass those lists of matrices into a master list
contrasts <- list(groupMain=groupMain, linTrend=linTrend, linXGroup=linXGroup)
And pass that master list into the contrasts function in lsmeans
contrast(lsMeansLM, contrasts)
This is the output
contrast estimate SE df t.ratio p.value
c(-0.25, -0.25, -0.25, -0.25, 0.25, 0.25, 0.25, 0.25) 1.927788 0.2230903 6 8.641 0.0001
c(-0.15, -0.05, 0.05, 0.15, -0.15, -0.05, 0.05, 0.15) 3.512465 0.1609290 6 21.826 <.0001
c(0.3, 0.1, -0.1, -0.3, -0.3, -0.1, 0.1, 0.3) -1.069407 0.3218581 6 -3.323 0.0160
How do we check if these are accurate estimates?
Note first that the estimate of the group*time interaction is now approximately the same value as is returned by
summary(modelLMER)
The 'main effect' trend (for want of a better descriptor), which is the rate of change in score across the four time points averaged across both levels of group, is 3.51. If we change the coding of the group factor to simple coding via
contrasts(df$group) <- c(-.5,.5)
and run summary(modelLMER) again, the time estimate will now be 3.51.
Finally for the main effect of group, that is, the difference in score between groups averaged across all time points. We can run
pairs(lsmeans(modelLM,"group"))
And this will be -1.92. Thank you #rvl. A great answer. Using afex and lsmeans we have now forced a mixed ANCOVA that treats the repeated measures variable as categorical to give us estimates of group differences in trend and main effects that match those returned by a mixed-effects model where the repeated measures variable is continuous, and with p- and F-values that match those of SPSS.

Related

Specifying quantile (tau) varying formula in rqpd regression with non-quantile varying fixed effects in R

I am using the rqpd package from the R Forge repository in R to run quantile regressions with fixed effects. In my setting, I am also applying instrumental variables in line with Chernozhukov and Hansen (2006) (the paper that the implementation approximates is Galvao (2011)).
Part of my implementation of the method consists in, for a given quantile tau, estimating y - alpha(tau) * D ~ Z + X | as.factor(country), where y is the outcome variable, D is an endogenous explanatory variable, Z is the instrument for that endogenous variable, X is an exogenous variable, and alpha(tau) is a, for the given estimation, fixed coefficient on the endogenous variable. This estimation should be run repeatedly over different values of alpha(tau) (looping over a vector of candidate alphas for each tau (quantile). The specification that yields the smallest norm of the coefficients on the instrument (or, in the case with only one instrument, as here, the smallest absolute value of the t-statistic). However, the fixed effects should be independent of the tau's.
In order to keep the fixed effects independent of the tau's, I am not using the standard rq command but the rqpd command which allows for fixed effects and runs estimations simultaneously across all specified values of tau. However, in doing this, I am not sure how to allow the alpha values to vary with taus, so that the chosen alpha coefficient on the endogenous variable can be a different one for every tau.
The code looks something like this (but is, as you see, not correct at the moment, given that I don't know how to specify the alphas in the rqpd setting).
taus <- c(0.3, 0.6, 0.9)
alpha <- matrix(0, nrow = 20, ncol = length(taus))
for(i in 1:ncol(alpha)) {
alpha[,i] <- seq(0.05, 0.95, length.out = 20)
}
for (i in 1:nrow(alpha)){
est <- rqpd(y - alpha[i,] * D ~ Z + X | as.factor(country), data = df_final, panel(taus=taus, tauw=rep(1/length(taus), length(taus))), method = "pfe", na.omit)
}
How can alpha vary with tau while the fixed effects do not?

converting the glmer output from logit to response scale

I hoping get some can help solving this mystery in my mind. The 7th coefficient in my glmer() call is 0.28779305 on the logit scale.
This coefficient can be also obtained by using contrast() in the emmeans package. However, this package apparently outputs the 7th coefficient on a different scale, the response scale.
I wonder how to convert the estimate given by the contrast() call so it matches the 7th coefficient in my glmer() call?
ps. This answer provides some insight but I don't see a way the coefficient from these two packages might be related.
library(lme4)
library(emmeans)
library(broom.mixed)
dat <- read.csv("https://raw.githubusercontent.com/fpqq/w/main/d.csv")
form2 <- y ~ item_type*time + (1 | user_id)
m2 <- glmer(form2, family = binomial, data = dat,
control =
glmerControl(optimizer = "bobyqa"))
coef(summary(m2))[7,]
# Estimate Std. Error z value Pr(>|z|) # This Estimate
#0.28779305 0.11271202 2.55334842 0.01066927 # is on logit scale
#------------------------------------------------------------------
EMM <- emmeans(m2, ~ item_type * time)
CON <- list(c1 = c(1, 0, -1, 0, -1, 0, 1, 0))
contrast(regrid(EMM), CON)
# contrast estimate SE df z.ratio p.value # This Estimate
# c1 0.106 0.0299 Inf 3.526 0.0004 # is on response scale
Following on from #RussLenth's answer:
contrast(EMM, CON) gives you log(o1) - log(o3) - log(o5) + log(o7), and it is indeed equal to the 7th coefficient value.
contrast(REMM, CON) (i.e. regridded) gives you p1 - p3 - p5 - p7.
(where oi = odds in group i, pi = probability in group i, oi = pi/(1-pi)).
While it is possible to convert an individual log-odds value to a probability (if loi is the log-odds value, then pi = 1/(1+exp(-loi))), I don't think there's any way to convert a linear combination of log-odds values directly to the corresponding linear combination of probability values; instead, you'd have to do what emmeans is doing anyway — i.e., convert the individual log-odds values to the probability scale and then compute the linear combination. (In fact emmeans is going the other direction — computing the probabilities from the log-odds when you specify regridding.)
Your question also reveals a misunderstanding of what's going on here: you say
apparently outputs the 7th coefficient on a different scale, the response scale
The package does not output the 7th coefficient on the response scale; instead, it applies the same contrast that would give the 7th coefficient on the log scale to the group values on the probability scale. "[T]he 7th coefficient on the response scale" would be
L(log(o1) - log(o3) - log(o5) + log(o7))
(where L(x) is the logistic function 1/(1+exp(-x)) [plogis() in R]). This is not the same as
L(log(o1)) - L(log(o3)) - L(log(o5)) + L(log(o7))
which is what emmeans gives you.
The bottom line is that if you do
contrast(EMM, CON)
then maybe you'll get that coefficient you are asking about. That's assuming that the 7th coefficient is correctly identified.
The EMM object contains information about estimates of 8 different probabilities (call them p1, p2, ..., p8), on the logit scale.
If you do summary(EMM), you will get estimates of logit(p1), logit(p2), ..., logit(p8).
If you do summary(EMM, type = "response"), you will get estimates of p1, p2, ..., p8
If you do contrast(EMM, CON) you will get an estimate of logit(p1) - logit(p3) - logit(p5) + logit(p7) = log(o1) - log(o3) - log(o5) + log(o7), where oj = pj / (1 - pj) is the odds for the jth case.
If you do contrast(EMM, CON, type = "response"), you get an estimate of exp(log(o1) - log(o3) - log(o5) + log(o7)) = (o1*o7) / (o3*o5)
Now, as documented, REMM = regrid(EMM) undoes the logit transformation once and for all. It preserves no memory of where it came from, it just has information about estimates on the response scale and their covariance matrix. Thus
If you do summary(REMM), you get estimates of p1, p2, ..., p8
If you do summary(REMM, type = "response"), you get estimates of p1, p2, ..., p8. There is no transformation information in REMM, it is already on the response scale.
If you do contrast(REMM, CON) (or contrast(REMM, CON, type = "response")), you get an estimate of p1 - p3 - p5 + p7.
The emmeans package documents all of this, and in addition contains several vignettes with examples. In particular, the one on transformations and the one on comparisons and contrasts are especially pertinent here.

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

rstanarm for Bayesian hierarchical modeling of binomial experiments

Suppose there are three binomial experiments conducted chronologically. For each experiment, I know the #of trials as well as the #of successes. To use the first two older experiments as prior for the third experiment, I want to "fit a Bayesian hierarchical model on the two older experiments and use the posterior form that as prior for the third experiment".
Given my available data (below), my question is: is my rstanarm code below capturing what I described above?
Study1_trial = 70
Study1_succs = 27
#==================
Study2_trial = 84
Study2_succs = 31
#==================
Study3_trial = 100
Study3_succs = 55
What I have tried in package rstanarm:
library("rstanarm")
data <- data.frame(n = c(70, 84, 100), y = c(27, 31, 55));
mod <- stan_glm(cbind(y, n - y) ~ 1, prior = NULL, data = data, family = binomial(link = 'logit'))
## can I use a beta(1.2, 1.2) as prior for the first experiment?
TL;DR: If you were directly predicting the probability of success, the model would be a Bernoulli likelihood with parameter theta (the probability of success) that could take on values between zero and one. You could use a Beta prior for theta in this case. But with a logistic regression model, you're actually modeling the log odds of success, which can take on any value from -Inf to Inf, so a prior with a normal distribution (or some other prior that can take on any real value within some range determined by the available prior information) would be more appropriate.
For a model where the only parameter is the intercept, the prior is the probability distribution for the log odds of success. Mathematically, the model is:
log(p/(1-p)) =  a
Where p is the probability of success and a, the parameter you're estimating, is the intercept, which can be any real number. If the odds of success are 1:1 (that is, p = 0.5) then a = 0. If the odds are greater than 1:1 then a is positive. If the odds are less than 1:1 then a is negative.
Since we want a prior for a, we need a probability distribution that can take on any real value. If we didn't know anything about the odds of success, we might use a very weakly informative prior like a normal distribution with, say, mean=0 and sd=10 (this is the rstanarm default), meaning that one standard deviation would encompass odds of success ranging from about 22000:1 to 1:22000! So this prior is essentially flat.
If we take your first two studies to construct the prior, we can use the probability density based on those studies and then transform it to the log odds scale:
# Possible outcomes (that is, the possible number of successes)
s = 0:(70+84)
# Probability density over all possible outcomes
dens = dbinom(s, 70+84, (27+31)/(70+84))
Assuming we'll use a normal distribution for the prior, we want the most likely probability of success (which will be the mean for the prior) and the standard deviation of the mean.
# Prior parameters
pp = s[which.max(dens)]/(70+84) # most likely probability
psd = sum(dens * (s/max(s) - pp)^2)^0.5 # standard deviation
# Convert prior to log odds scale
pp_logodds = log(pp/(1-pp))
psd_logodds = log(pp/(1-pp)) - log((pp-psd)/(1 - (pp-psd)))
c(pp_logodds, psd_logodds)
[1] -0.5039052 0.1702006
You could generate essentially the same prior by running stan_glm on the first two studies with the default (flat) prior:
prior = stan_glm(cbind(y, n-y) ~ 1,
data = data[1:2,],
family = binomial(link = 'logit'))
c(coef(prior), se(prior))
[1] -0.5090579 0.1664091
Now let's fit the model using data from Study 3 using the default prior and the prior we just generated. I've switched to a standard data frame, since stan_glm seems to fail when the data frame has only one row (as in data = data[3, ]).
# Default weakly informative prior
mod1 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
family = binomial(link = 'logit'))
# Prior based on studies 1 & 2
mod2 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
prior_intercept = normal(location=pp_logodds, scale=psd_logodds),
family = binomial(link = 'logit'))
For comparison, let's also generate a model with all three studies and the default flat prior. We would expect this model to give virtually the same results as mod2:
mod3 <- stan_glm(cbind(y, n - y) ~ 1,
data = data,
family = binomial(link = 'logit'))
Now let's compare the three models:
library(tidyverse)
list(`Study 3, Flat Prior`=mod1,
`Study 3, Prior from Studies 1 & 2`=mod2,
`All Studies, Flat Prior`=mod3) %>%
map_df(~data.frame(log_odds=coef(.x),
p_success=predict(.x, type="response")[1]),
.id="Model")
Model log_odds p_success
1 Study 3, Flat Prior 0.2008133 0.5500353
2 Study 3, Prior from Studies 1 & 2 -0.2115362 0.4473123
3 All Studies, Flat Prior -0.2206890 0.4450506
For Study 3 with the flat prior (row 1), the predicted probability of success is 0.55, as expected, since that's what the data says and the prior provides no additional information.
For Study 3 with a prior based on studies 1 and 2, the probability of success is 0.45. The lower probability of success is due to the lower probability of success in Studies 1 and 2 adding additional information. In fact, the probability of success from mod2 is exactly what you'd calculate directly from the data: with(data, sum(y)/sum(n)). mod3 puts all the information into the likelihood instead of splitting it between the prior and the likelihood, but is otherwise essentially the same as mod2.
Answer to (now deleted) comment: If all you know is the number of trials and successes and you think that a binomial probability is a reasonable model for how the data were generated, then it doesn't matter how you split up the data into "prior" and "likelihood" or whether you shuffle the order of the data. The resulting model fit will be the same.

How to obtain profile confidence intervals of the difference in probability of success between two groups from a logit model (glmer)?

I am struggling to transform the log odds ratio profile confidence intervals obtained from a logit model into probabilities. I would like to know how to calculate the confidence intervals of the difference between two groups.
If the p-value is > 0.05, the 95% CI of the difference should span from below zero to above zero. However, I don’t know how negative values can be obtained when the log ratios have to be exponentiated. Therefore I tried to calculate the CI of one of the groups (B) and see what the difference of the lower and the upper end of the CI to the estimate of group A is. I believe this is not the correct way to calculate the CI of the difference because the estimate of A is also uncertain.
I would be happy if anyone could help me out.
library(lme4)
# Example data:
set.seed(11)
treatment = c(rep("A",30), rep("B", 40))
site = rep(1:14, each = 5)
presence = c(rbinom(30, 1, 0.6),rbinom(40, 1, 0.8))
df = data.frame(presence, treatment, site)
# Likelihood ratio test
M0 = glmer(presence ~ 1 + (1|site), family = "binomial", data = df)
M1 = glmer(presence ~ treatment + (1|site), family = "binomial", data = df)
anova(M1, M0)
# Calculating confidence intervals
cc <- confint(M1, parm = "beta_")
ctab <- cbind(est = fixef(M1), cc)
cdat = as.data.frame(ctab)
# Function to back-transform to probability (0-1)
unlogit = function(y){
y_retransfromed = exp(y)/(1+exp(y))
y_retransfromed
}
# Getting estimates
A_est = unlogit(cdat$est[1])
B_est = unlogit(cdat$est[1] + cdat$est[2])
B_lwr = unlogit(cdat$est[1] + cdat[2,2])
B_upr = unlogit(cdat$est[1] + cdat[2,3])
Difference_est = B_est - A_est
# This is how I tried to calculate the CI of the difference
Difference_lwr = B_lwr - A_est
Difference_upr = B_upr - A_est
# However, I believe this is wrong because A_est is also “uncertain”
How to get the confidence interval of the difference of the probability of presence?
We can calculate the average treatment effect in the following way. From the original data, create two new datasets, one in which all units receive treatment A, and one in which all units receive treatment B. Now, based on your model estimates (in your case, M1), we compute predicted outcomes for units in each of these two datasets. We then compute the mean difference in the outcomes between the two datasets to get our estimated average treatment effect. Here, we can write a function that takes a glmer object and computes the average treatment effect:
ate <- function(.) {
treat_A <- treat_B <- df
treat_A$treatment <- "A"
treat_B$treatment <- "B"
c("ate" = mean(predict(., newdata = treat_B, type = "response") -
predict(., newdata = treat_A, type = "response")))
}
ate(M1)
# ate
# 0.09478276
How do we get the uncertainty interval? We can use the bootstrap, i.e. re-estimate the model many times using randomly generated samples from your original data, calculating the average treatment effect each time. We can then use the distribution of the bootstrapped average treatment effects to compute our uncertainty interval. Here we generate 100 simulations using the bootMer function
out <- bootMer(M1, ate, seed = 1234, nsim = 100)
and inspect the distribution of the effect:
quantile(out$t, c(0.025, 0.5, 0.975))
# 2.5% 50% 97.5%
# -0.06761338 0.10508751 0.26907504

Resources