I'm trying to run a relatively straightforward glmer model and get warnings that it isSingular and I can't figure out why.
In my dataset, 40 participants did 108 trials. They responded to a question (the response is coded as correct/incorrect - 0/1) and rated confidence in their response on a continuous scale from 0 to 1.
library(lme4)
library(tidybayes)
library(tidyverse)
set.seed(5)
n_trials = 108
n_subjs = 40
data =
tibble(
subject = as.factor(rep(c(1:n_subjs), n_trials)),
correct = sample(c(0,1), replace=TRUE, size=(n_trials*n_subjs)),
confidence = runif(n_trials*n_subjs)
)
I'm trying to run a mixed effects logistic regression, to estimate each participant's ability to associate high confidence to correct responses only. That means, I have good reasons to add the random slope of confidence in my model.
The simplest model that I'm interested in gives me:
model = glmer(correct ~ confidence + (confidence|subject) ,
data = data,
family = binomial)
boundary (singular) fit: see ?isSingular, and
> isSingular(model)
[1] TRUE
So I simplify the model beyond usefulness, and get the same problem:
model = glmer(correct ~ confidence + (1|subject) ,
data = data,
family = binomial)
I tried to bin confidence (I'm sure there are more elegant ways), in case that helped, but it didn't:
#Initialize as vector of 0s
data$confidence_binned <- numeric(dim(data)[1])
nbins = 4
bins=seq(0,1,length.out = (nbins+1))
for (b in 1:(length(bins)-1)) {
data$confidence_binned[data$confidence>=bins[b] & data$confidence<bins[b+1]] = b
}
data$confidence_binned[data$confidence_binned==1]=nbins
model = glmer(correct ~ confidence_binned + (confidence_binned|subject) ,
data = data,
family = binomial)
boundary (singular) fit: see ?isSingular
There are many posts and SO questions about the isSingular warning, but all the ones I've found say that the model is too complex for the data, and the solution is usually to 'keep it maximal'. However, this model is as simple as it can get, and I am confused that with (what sounds to me like) enough trials it still fails.
I also tried changing the controller, but it didn't help:
ctrl = glmerControl(optimizer = "bobyqa",
boundary.tol = 1e-5,
calc.derivs=TRUE,
use.last.params=FALSE,
sparseX = FALSE,
tolPwrss=1e-7,
compDev=TRUE,
nAGQ0initStep=TRUE,
## optimizer args
optCtrl = list(maxfun = 1e5))
model <- glmer(correct ~ confidence_binned + (confidence_binned|subject),
data=data,
verbose=T,
control=ctrl,
family = binomial)
Any help or pointers on what to look out for in the data are appreciated.
EDIT to respond to a comment:
The result of ggplot(data,aes(x=subject, y=correct)) + stat_summary(fun.data=mean_cl_normal)
GLMMs with random slopes and random intercepts correlated (aka the maximal model) are notoriously difficult to fit even under well-fitting data despite some who advocate for this approach. Unless you see some seriously fluctuating by-subject or by-item variance with their random slope predictors, my best advice is to fit a random intercepts only model and see if it fits better.
For three comprehensive papers that have very different views on this subject, see below. The first is a paper often cited for the maximal approach. The second is written by the guy who created the lme4 package, who makes arguments for parsimonious models. The third is an additional peer-reviewed paper by Bates recommended by Ben Bolker.
Citations:
Maximal Model Perspective: Barr et al., 2013
Parisomonious Model Perspective: Bates et al., 2018
Balancing Type I Error and Power: Matuschek et al., 2017
Related
I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...
I have a multivariate model with this (approximate) form:
library(MCMCglmm)
mod.1 <- MCMCglmm(
cbind(OFT1, MIS1, PC1, PC2) ~
trait-1 +
trait:sex +
trait:date,
random = ~us(trait):squirrel_id + us(trait):year,
rcov = ~us(trait):units,
family = c("gaussian", "gaussian", "gaussian", "gaussian"),
data= final_MCMC,
prior = prior.invgamma,
verbose = FALSE,
pr=TRUE, #this saves the BLUPs
nitt=103000, #number of iterations
thin=100, #interval at which the Markov chain is stored
burnin=3000)
For publication purposes, I've been asked to report the Gelman-Rubin statistic to indicate that the model has converged.
I have been trying to run:
gelman.diag(mod.1)
But, I get this error:
Error in mcmc.list(x) : Arguments must be mcmc objects
Any suggestions on the proper approach? I assume that the error means I can't pass my mod.1 output through gelman.diag(), but I am not sure what it is I am supposed to put there instead? My knowledge is quite limited here, so I'd appreciate any and all help!
Note that I haven't added the data here, but I suspect the answer is more code syntax and not data related.
The gelman.diag requires a mcmc.list. If we are running models with different set of parameters, extract the 'Sol' and place it in a list (Below, it is the same model)
library(MCMCglmm)
model1 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1)
model2 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1 )
mclist <- mcmc.list(model1$Sol, model2$Sol)
gelman.diag(mclist)
# gelman.diag(mclist)
#Potential scale reduction factors:
# Point est. Upper C.I.
#(Intercept) 1 1
According to the documentation, it seems to be applicable for more than one mcmc chain
Gelman and Rubin (1992) propose a general approach to monitoring convergence of MCMC output in which m > 1 parallel chains are run with starting values that are overdispersed relative to the posterior distribution.
The input x here is
x - An mcmc.list object with more than one chain, and with starting values that are overdispersed with respect to the posterior distribution.
I am trying to run code in R (I am very new at this), and I was given a very large dataset that I need to use to fit a poisson glm such as log(mu) = B0 +B1x1. Let Yi be the response count for subject i, and xi = 1(black) and xi = 0 (white).
The dataset can be found at www.stat.ufl.edu/~aa/glm/data.
I loaded the data, and I am having difficulty understanding a model for this.
Here is the code I have so far, but clearly I am missing something.
str(hdata)
head(hdata)
attach(hdata)
hfit = glm(count ~ factor(race), family = poisson(link = log))
summary(hfit)
#plot the model
par(mfrow = c(2,2))
plot(hfit)
#overdispersion test
library(AER)
dispersiontest(hfit, trafo =1)
#goodness of fit test
sum(resid(hfit, type="pearson")^2)
#pvalue
1 - pchisq(2279.873, 1306)
I need help with this model because I can't seem to separate each race, and I think that is what I need to do. When I ran the summary of hfit, I ended up with -2.38 as the intercept and 1.73 for factor(race)1. The AIC was 1122. Also, when I ran the overdispersion test I got c = 0.743, and if the model had equidispersion the c = 0. Am I right? Thank you
I am fitting a maximal model to untransformed response times to correct trials, with two, two-level, centered categorical predictors (Stimulation, Cognate Status) and an orthogonal second-order polynomial with 5 levels (Block). Random effects include full crossed structure with correlations. 32 subjects, 60 items, balanced, within-subjects design, 12,406 observations. The model converges but the summary takes an age to process.
The model runs without any convergence issues but summary() initialises a memory-intensive process but does not compile/print the output. I don't have any issues with the summary() function for any other elements.
I have included the code for the model for reference.
Max.lmer.RT = lmer(RT ~ StimCent.r*(ot1 + ot2)*CogStatCent.r +
(1 + StimCent.r*(ot1 + ot2)*CogStatCent.r | PID) +
(1 + StimCent.r*(ot1 + ot2) | DutchName:CogStatCent.r),
data = TDL.cent.RT, REML = FALSE, control =
lmerControl(optimizer = "nloptwrap2", optCtrl =
list(maxfun = 100000)))
summary(Max.lmer.RT)
A fix or suggestions on what might be causing this would be much appreciated.
Disregarding how "important" it is, I am interested in trying to estimate how much of the variance is attributed to a single fixed effect (it being a main effect, or interaction term).
As a quick thought I imagined that constructing a linear model for the predicted values of mixed model (without the random effect), and assessing the ANOVA-table would provide a estimate (yes, the residual variance will then be zero, but we know(?) this from the mixed model). However, from playing around apparently not.
Where is the flaw in my reasoning? Or did I do something wrong along the way? Is there an alternative method?
Disclaimer: I know some people have suggested looking at the change in residual variance when removing/adding fixed effects, but as this does not take into account the correlation between fixed and random effects I am not interested .
data(Orthodont,package="nlme")
Orthodont = na.omit(Orthodont)
#Fitting a linear mixed model
library(lme4)
mod = lmer(distance ~ age*Sex + (1|Subject) , data=Orthodont)
# Predicting across all observed values,
pred.frame = expand.grid(age = seq(min(Orthodont$age, na.rm = T),max(Orthodont$age, na.rm=T)),
Sex = unique(Orthodont$Sex))
# But not including random effects
pred.frame$fit = predict(mod, newdata = pred.frame, re.form=NA)
anova(lm(fit~age*Sex, data = pred.frame))
library(data.table)
Orthodont = data.table(Orthodont)
# to test the validity of the approach
# by estimating a linear model using a random observation
# per individual and look at the means
tmp = sapply(1:500, function(x){
print(x)
as.matrix(anova(lm(distance~age*Sex, data =Orthodont[,.SD[sample(1:.N,1)],"Subject"])))[,2]
}
)
# These are clearly not similar
prop.table(as.table(rowMeans(tmp)[-4]))
age Sex age:Sex
0.60895615 0.31874622 0.07229763
> prop.table(as.table(anova(lm(fit~age*Sex, data = pred.frame))[1:3,2]))
A B C
0.52597575 0.44342996 0.03059429