Variable sample size per cluster/group in mixed effects logistic regression - r

I am attempting to run mixed effects logistic regression models, yet am concerned about the variable samples sizes in each cluster/group, and also the very low number of "successes" in some models.
I have ~ 700 trees distributed across 163 field plots (i.e., the cluster/group), visited annually from 2004-11. I am fitting separate mixed effects logistic regression models (hereafter GLMMs) for each year of the study to compare this output to inference from a shared frailty model (i.e., survival analysis with random effect).
The number of trees per plot varies from 1-22. Also, some years have a very low number of "successes" (i.e., diseased trees). For example, in 2011 there were only 4 successes out of 694 "failures" (i.e., healthy trees).
My questions are: (1) is there a general rule for the ideal number of samples|group when the inference focus is only on estimating the fixed effects in the GLMM, and (2) are GLMMs stable when there is such an extreme difference in the ratio of successes:failures.
Thank you for any advice or suggestions of sources.
-Sarah

(Hi, Sarah, sorry I didn't answer previously via e-mail ...)
It's hard to answer these questions in general -- you're stuck
with your data, right? So it's not a question of power analysis.
If you want to make sure that your results will be reasonably
reliable, probably the best thing to do is to run some simulations.
I'm going to show off a fairly recent feature of lme4 (in the
development version 1.1-1, on Github), which is to simulate
data from a GLMM given a formula and a set of parameters.
First I have to simulate the predictor variables (you wouldn't
have to do this, since you already have the data -- although
you might want to try varying the range of number of plots,
trees per plot, etc.).
set.seed(101)
## simulate number of trees per plot
## want mean of 700/163=4.3 trees, range=1-22
## by trial and error this is about right
r1 <- rnbinom(163,mu=3.3,size=2)+1
## generate plots and trees within plots
d <- data.frame(plot=factor(rep(1:163,r1)),
tree=factor(unlist(lapply(r1,seq))))
## expand by year
library(plyr)
d2 <- ddply(d,c("plot","tree"),
transform,year=factor(2004:2011))
Now set up the parameters: I'm going to assume year is a fixed
effect and that overall disease incidence is plogis(-2)=0.12 except
in 2011 when it is plogis(-2-3)=0.0067. The among-plot standard deviation
is 1 (on the logit scale), as is the among-tree-within-plot standard
deviation:
beta <- c(-2,0,0,0,0,0,0,-3)
theta <- c(1,1) ## sd by plot and plot:tree
Now simulate: year as fixed effect, plot and tree-within-plot as
random effects
library(lme4)
s1 <- simulate(~year+(1|plot/tree),family=binomial,
newdata=d2,newparams=list(beta=beta,theta=theta))
d2$diseased <- s1[[1]]
Summarize/check:
d2sum <- ddply(d2,c("year","plot"),
summarise,
n=length(tree),
nDis=sum(diseased),
propDis=nDis/n)
library(ggplot2)
library(Hmisc) ## for mean_cl_boot
theme_set(theme_bw())
ggplot(d2sum,aes(x=year,y=propDis))+geom_point(aes(size=n),alpha=0.3)+
stat_summary(fun.data=mean_cl_boot,colour="red")
Now fit the model:
g1 <- glmer(diseased~year+(1|plot/tree),family=binomial,
data=d2)
fixef(g1)
You can try this many times and see how often the results are reliable ...

As Josh said, this is a better questions for CrossValidated.
There are no hard and fast rules for logistic regression, but one rule of thumb is 10 successes and 10 failures are needed per cell in the design (cluster in this case) times the number continuous variables in the model.
In your case, I would think the model, if it converges, would be unstable. You can examine that by bootstrapping the errors of the estimates of the fixed effects.

Related

Justifying need for zero-inflated model in GLMMs

I am using GLMMs in R to examine the influence of continuous predictor variables (x) on biological counts variable (y). My response variables (n=5) each have a high number of zeros (data distribution), so I have tested the fit of various distributions (genpois, poisson, nbinom1, nbinom2, zip, zinb1, zinb2) and selected the best fit one according to the lowest AIC/LogLik value.
According to this selection criteria, three of my response variables with the highest number of zeros are best fit to the zero inflated negative binomial (zinb2) distribution. Compared to the regular NB distribution (non-zero inflated), the delta AIC is between 30-150.
My question is: must I use the ZI models for these variables considering the dAIC? I have received advice from a statistician that if dAIC is small enough between the ZI and non-ZI model, use the non-ZI model even if it is marginally worse fit since ZI models involve much more complicated modelling & interpretation. The distribution matters in this case because ZINB / NB models select a different combination of top candidate models when testing my predictors.
Thank you for any clarification!

Extracting linear term from a polynomial predictor in a GLM

I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.

Type of regression for large dataset, nonlinear, skewed in R

I'm researching moth biomass in different biotopes, and I want to find a model that estimates the biomass. I have measured the length and width of the forewing, abdomen and thorax of 37088 specimens, and I have weighed them individually (dried).
First, I wanted to a simple linear regression of each variable on the biomass. The problem is, none of the assumptions are met. The data is not linear, biomass (and some variables) don't follow a normal distribution, there is heteroskedasticity, and a lot of outliers. Now I have tried to transform my data using log, x^2, 1/x, and boxcox, but none of them actually helped. I have also tried Thiel-Sen regression (not possible because of too much data) and Siegel regression (biomass is not a vector). Is there some other form of non-parametric or median-based regression that I can try? Because I am really out of ideas.
Here is a frequency histogram for biomass:
Frequency histogram dry biomass
So what I actually want to do is to build a model that accurately estimates the dry biomass, based on the measurements I performed. I have a power function (Rogers et al.) that is general for all insects, but there is a significant difference between this estimate and what I actually weighed. Therefore, I just want to build to build a model with all significant variables. I am not very familiar with power functions, but maybe it is possible to build one myself? Can anyone recommend a method? Thanks in advance.
To fit a power function, you could perhaps try nlsLM from the minpack.lm package
library(minpack.lm)
m <- nlsLM( y ~ a*x^b, data=your.data.here )
Then see if it performs satisfactory.

R language, how to use bootstraps to generate maximum likelihood and AICc?

Sorry for a quite stupid question. I am doing multiple comparisons of morphologic traits through correlations of bootstraped data. I'm curious if such multiple comparisons are impacting my level of inference, as well as the effect of the potential multicollinearity in my data. Perhaps, a reasonable option would be to use my bootstraps to generate maximum likelihood and then generate AICc-s to do comparisons with all of my parameters, to see what comes out as most important... the problem is that although I have (more or less clear) the way, I don't know how to implement this in R. Can anybody be so kind as to throw some light on this for me?
So far, here an example (using R language, but not my data):
library(boot)
data(iris)
head(iris)
# The function
pearson <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,2], method='p'),
median(dt[,1]),
median(dt[,2])
)
}
# One example: iris$Sepal.Length ~ iris$Sepal.Width
# I calculate the r-squared with 1000 replications
set.seed(12345)
dat <- iris[,c(1,2)]
dat <- na.omit(dat)
results <- boot(dat, statistic=pearson, R=1000)
# 95% CIs
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% (-0.2490, 0.0423 )
Calculations and Intervals on Original Scale
plot(results)
I have several more pairs of comparisons.
More of a Cross Validated question.
Multicollinearity shouldn't be a problem if you're just assessing the relationship between two variables (in your case correlation). Multicollinearity only becomes an issue when you fit a model, e.g. multiple regression, with several highly correlated predictors.
Multiple comparisons is always a problem though because it increases your type-I error. The way to address that is to do a multiple comparison correction, e.g. Bonferroni-Holm or the less conservative FDR. That can have its downsides though, especially if you have a lot of predictors and few observations - it may lower your power so much that you won't be able to find any effect, no matter how big it is.
In high-dimensional setting like this, your best bet may be with some sort of regularized regression method. With regularization, you put all predictors into your model at once, similarly to doing multiple regression, however, the trick is that you constrain the model so that all of the regression slopes are pulled towards zero, so that only the ones with the big effects "survive". The machine learning versions of regularized regression are called ridge, LASSO, and elastic net, and they can be fitted using the glmnet package. There is also Bayesian equivalents in so-called shrinkage priors, such as horseshoe (see e.g. https://avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf). You can fit Bayesian regularized regression using the brms package.

estimating density in a multidimensional space with R

I have two types of individuals, say M and F, each described with six variables (forming a 6D space S). I would like to identify the regions in S where the densities of M and F differ maximally. I first tried a logistic binomial model linking F/ M to the six variables but the result of this GLM model is very hard to interpret (in part due to the numerous significant interaction terms). Thus I am thinking to an “spatial” analysis where I would separately estimate the density of M and F individuals everywhere in S, then calculating the difference in densities. Eventually I would manually look for the largest difference in densities, and extract the values at the 6 variables.
I found the function sm.density in the package sm that can estimate densities in a 3d space, but I find nothing for a space with n>3. Would you know something that would manage to do this in R? Alternatively, would have a more elegant method to answer my first question (2nd sentence)?
In advance,
Thanks a lot for your help
The function kde of the package ks performs kernel density estimation for multinomial data with dimensions ranging from 1 to 6.
pdfCluster and np packages propose functions to perform kernel density estimation in higher dimension.
If you prefer parametric techniques, you look at R packages doing gaussian mixture estimation like mclust or mixtools.
The ability to do this with GLM models may be constrained both by interpretablity issues that you already encountered as well as by numerical stability issues. Furthermore, you don't describe the GLM models, so it's not possible to see whether you include consideration of non-linearity. If you have lots of data, you might consider using 2D crossed spline terms. (These are not really density estimates.) If I were doing initial exploration with facilities in the rms/Hmisc packages in five dimensions it might look like:
library(rms)
dd <- datadist(dat)
options(datadist="dd")
big.mod <- lrm( MF ~ ( rcs(var1, 3) + # `lrm` is logistic regression in rms
rcs(var2, 3) +
rcs(var3, 3) +
rcs(var4, 3) +
rcs(var5, 3) )^2,# all 2way interactions
data=dat,
max.iter=50) # these fits may take longer times
bplot( Predict(bid.mod, var1,var2, n=10) )
That should show the simultaneous functional form of var1's and var2's contribution to the "5 dimensional" model estimates at 10 points each and at the median value of the three other variables.

Resources