Zero-inflated negative binomial model in R: Computationally singular - r

I have been comparing Poisson, negative binomial (NB), and zero-inflated Poisson and NB models in R. My dependent variable is a symptom count for generalized anxiety disorder (GAD), and my predictors are two personality traits (disinhibition [ZDis_winz] and meanness [ZMean_winz]), their interaction, and covariates of age and assessment site (dummy-coded; there are 8 sites so I have 7 of these dummy variables). I have a sample of 1206 with full data (and these are the only individuals included in the data frame).
I am using NB models for this disorder because the variance (~40) far exceeds the mean (~4). I wanted to consider the possibility of a ZINB model as well, given that ~30% of the sample has 0 symptoms.
For other symptom counts (e.g., conduct disorder), I have run ZINB models perfectly fine in R, but I am getting an error when I do the exact same thing with the GAD model. The standard NB model works fine for GAD; it is only the GAD ZINB model that's erroring out.
Here is the error I'm receiving:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 4.80021e-36
Here is the code I'm using for the (working) NB model:
summary(
NB_GAD_uw_int <- glm.nb(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data=eurodata))
Here is the code I'm using for the (not working) ZINB model (which is identical to other ZINB models I've run for other disorders):
summary(
ZINB_GAD_uw_int <- zeroinfl(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data = eurodata,
dist = "negbin",
model = TRUE,
y = TRUE, x = TRUE))
I have seen a few other posts on StackOverflow and other forums about this type of issue. As far as I can tell, people generally say that this is an issue of either 1) collinear predictors or 2) too complex a model for too little data. (Please let me know if I am misinterpreting this! I'm fairly new to Poisson-based models.) However, I am still confused about these answers because: 1) In this case, none of my predictors are correlated more highly than .15, except for the main predictors of interest (ZMean_winz and ZDis_winz), which are correlated about .45. The same predictors are used in other ZINB models that have worked. 2) With 1206 participants, and having run the same ZINB model with similarly distributed count data for other disorders, I'm a little confused how this could be too complex a model for my data.
If anyone has any explanation for why this version of my model will not run and/or any suggestions for troubleshooting, I would really appreciate it! I am also happy to provide more info if needed.
Thank you so much!

The problem may be that zeroinfl is not converting categorical variables into dummy variables.
You can dummify your variables using model.matrix, which is what glm, glm.nb, etc. call internally to dummify categorical variables. This is usually preferred over manually dummifying categorical variables, and should be done to avoid mistakes and to ensure full rank of your model matrix (a full rank matrix is non-singular).
You can of course dummify categorical variables yourself; in that case I would use model.matrix to transform your input data involving categorical variables (and potentially interactions between categorical variables and other variables) into the correct model matrix.
Here is an example:
set.seed(2017)
df <- data.frame(
DV = rnorm(100),
IV1_num = rnorm(100),
IV2_cat = sample(c("catA", "catB", "catC"), 100, replace = T))
head(df)
# DV IV1_num IV2_cat
#1 1.43420148 0.01745491 catC
#2 -0.07729196 1.37688667 catC
#3 0.73913723 -0.06869535 catC
#4 -1.75860473 0.84190898 catC
#5 -0.06982523 -0.96624056 catB
#6 0.45190553 -1.96971566 catC
mat <- model.matrix(DV ~ IV1_num + IV2_cat, data = df)
head(mat)
# (Intercept) IV1_num IV2_catcatB IV2_catcatC
#1 1 0.01745491 0 1
#2 1 1.37688667 0 1
#3 1 -0.06869535 0 1
#4 1 0.84190898 0 1
#5 1 -0.96624056 1 0
#6 1 -1.96971566 0 1
The manually dummified input data would then be
df.dummified = cbind.data.frame(DV = df$DV, mat[, -1])
# DV IV1_num IV2_catB IV2_catC
#1 1.43420148 0.01745491 0 1
#2 -0.07729196 1.37688667 0 1
#3 0.73913723 -0.06869535 0 1
#4 -1.75860473 0.84190898 0 1
#5 -0.06982523 -0.96624056 1 0
#6 0.45190553 -1.96971566 0 1
which you'd use in e.g.
glm.nb(DV ~ ., data = df.dummified)

Related

Plotting multiple probabilities of an logistic multilevel model

I need to estimate and plot a logistic multilevel model. I've got the binary dependent variable employment status (empl) (0 = unemployed; 1 = employed) and level of internet connectivity (isoc) as (continuous) independent variable and need to include random effects (intercept and slope) alongside the education level (educ) (1 = low-skilled worker; 2 = middle-skilled; 3 = high-skilled). Also I have some control variables I'm not going to mention here. I'm using the glmer function of the lme4 package. Here is a sample data frame and my (simplified) code:
library(lme4)
library(lmerTest)
library(tidyverse)
library(dplyr)
library(sjPlot)
library(moonBook)
library(sjmisc)
library(sjlabelled)
set.seed(1212)
d <- data.frame(empl=c(1,1,1,0,1,0,1,1,0,1,1,1,0,1,0,1,1,1,1,0),
isoc=runif(20, min=-0.2, max=0.2),
educ=sample(1:3, 20, replace=TRUE))
Results:
empl isoc educ
1 1 0.078604108 1
2 1 0.093667591 3
3 1 -0.061523272 2
4 0 0.009468908 3
5 1 -0.169220134 2
6 0 -0.038594789 3
7 1 0.170506490 1
8 1 -0.098487991 1
9 0 0.073339737 1
10 1 0.144211813 3
11 1 -0.133510687 1
12 1 -0.045306606 3
13 0 0.124211903 3
14 1 -0.003908486 3
15 0 -0.080673475 3
16 1 0.061406993 3
17 1 0.015401951 2
18 1 0.073351501 2
19 1 0.075648137 2
20 0 0.041450192 1
Fit:
m <- glmer(empl ~ isoc + (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
Now the question: I'm looking for a plot with three graphs, one graph for each educ-level, with the probabilities (values just between 0 and 1). Here is an sample image from the web:
Below is my (simplified) code for the plot. But it only produces crap I cannot interpret.
plot_model(m, type="pred",
terms=c("isoc [all]", "educ"),
show.data=TRUE)
There is one thing I can do to get a "kind-of" right plot but I have to alter the model above in a way I think it's wrong (keyword: multicollinearity). Additionally I don't think the three graphs of this plot are correct either. The modified model looks like this:
m <- glmer(empl ~ isoc + educ (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
I appreciate any help! I think my problem resembles this question but unfortunately there has been no answer to this yet and unfortunately I'm not able to comment with my low reputation.
I think you want
plot_model(m, type="pred",
pred.type = "re",
terms = c("isoc[n=100]","educ"), show.data = TRUE)
pred.type = "re" takes the random effects into account when making predictions
isoc[n=100] uses 100 distinct values across the range of isoc - this is better than making predictions only at the observed values of isoc, which is what [all] specifies
For the example you've given the prediction lines are all on top of each other (because the fit is singular/the random-effects variance is effectively zero), but that's presumably because your sample data set is so small.
For what it's worth, although this is a perfectly well-posed programming problem, I would not recommend treating educ as a random effect:
the number of levels is impractically small
the levels are not exchangeable (i.e. it wouldn't make sense to relabel "high-skilled" as "low-skilled").
Feel free to ask more questions about your model setup/definition on CrossValidated

Logistic regression with proportions in R (dependent variable is not binary). What is R doing?

So I stumbled across the following code:
#Importing the data:
seeds.df <-
read.table('http://www.uib.no/People/nzlkj/statkey/data/seeds.csv',header=T)
attach(seeds.df)
#Making a plot of seeds eaten depending on seed type:
plot(Seed.type, Eaten)
#Testing the hypothesis:
fit1.glm <- glm(cbind(Eaten,Not.eaten)~Seed.type, binomial)
summary(fit1.glm)
From https://folk.uib.no/nzlkj/statkey/logistic.html#proportions
which provides a method for doing logistic regression on proportion data.
My question is what is R actually doing mathematically? the response variable is two columns. As far as I knew logistic regression is supposed to be performed on binary dependent variable.
is R creating a new response variable of length Eaten + Not.eaten
which is populated by rep(1, Eaten) rep(0, Not.eaten) and performing log regression on that?
e.g. for row 1 in seeds.df Eaten = 2 Not.eaten = 48
row# eaten.or.not seed.type Hamster
1 1 B 1
2 1 B 1
3 0 B 1
0 B 1
...
50 0 B 1
then R would do glm(eaten.or.not ~ seed.type, family = 'binomial')
I tested the above and it didn't produce the same answer
or is R doing the following
ln(prob of being eaten/ (1-prob of being eaten)) = intercept + B1(seed.type)
I also tested this and I got something different, but I'm not sure I did it correctly.
Anyway, could someone shed light on what R is doing mathematically for the log regression with a proportion that would be great.
thank you for your time

Repeated measures ANOVA and link to mixed-effect models in R

I have a problem when performing a two-way rm ANOVA in R on the following data (link : https://drive.google.com/open?id=1nIlFfijUm4Ib6TJoHUUNeEJnZnnNzO29):
subjectnbr is the id of the subject and blockType and linesTTL are the independent variables. RT2 is the dependent variable
I first performed the rm ANOVA through using ezANOVA with the following code:
ANOVA_RTS <- ezANOVA(
data=castRTs
, dv=RT2
, wid=subjectnbr
, within = .(blockType,linesTTL)
, type = 2
, detailed = TRUE
, return_aov = FALSE
)
ANOVA_RTS
The result is correct (I double-checked using statistica).
However, when I perform the rm ANOVA using the lme function, I do not get the same answer and I have no clue why.
There is my code:
lmeRTs <- lme(
RT2 ~ blockType*linesTTL,
random = ~1|subjectnbr/blockType/linesTTL,
data=castRTs)
anova(lmeRTs)
Here are the outputs of both ezANOVA and lme.
I hope I have been clear enough and have given you all the information needed.
I'm looking forward for your help as I am trying to figure it out for at least 4 hours!
Thanks in advance.
Here is a step-by-step example on how to reproduce ezANOVA results with nlme::lme.
The data
We read in the data and ensure that all categorical variables are factors.
# Read in data
library(tidyverse);
df <- read.csv("castRTs.csv");
df <- df %>%
mutate(
blockType = factor(blockType),
linesTTL = factor(linesTTL));
Results from ezANOVA
As a check, we reproduce the ez::ezANOVA results.
## ANOVA using ez::ezANOVA
library(ez);
model1 <- ezANOVA(
data = df,
dv = RT2,
wid = subjectnbr,
within = .(blockType, linesTTL),
type = 2,
detailed = TRUE,
return_aov = FALSE);
model1;
# $ANOVA
# Effect DFn DFd SSn SSd F p
#1 (Intercept) 1 13 2047405.6654 34886.767 762.9332235 6.260010e-13
#2 blockType 1 13 236.5412 5011.442 0.6136028 4.474711e-01
#3 linesTTL 1 13 6584.7222 7294.620 11.7348665 4.514589e-03
#4 blockType:linesTTL 1 13 1019.1854 2521.860 5.2538251 3.922784e-02
# p<.05 ges
#1 * 0.976293831
#2 0.004735442
#3 * 0.116958989
#4 * 0.020088855
Results from nlme::lme
We now run nlme::lme
## ANOVA using nlme::lme
library(nlme);
model2 <- anova(lme(
RT2 ~ blockType * linesTTL,
random = list(subjectnbr = pdBlocked(list(~1, pdIdent(~blockType - 1), pdIdent(~linesTTL - 1)))),
data = df))
model2;
# numDF denDF F-value p-value
#(Intercept) 1 39 762.9332 <.0001
#blockType 1 39 0.6136 0.4382
#linesTTL 1 39 11.7349 0.0015
#blockType:linesTTL 1 39 5.2538 0.0274
Results/conclusion
We can see that the F test results from both methods are identical. The somewhat complicated structure of the random effect definition in lme arises from the fact that you have two crossed random effects. Here "crossed" means that for every combination of blockType and linesTTL there exists an observation for every subjectnbr.
Some additional (optional) details
To understand the role of pdBlocked and pdIdent we need to take a look at the corresponding two-level mixed effect model
The predictor variables are your categorical variables blockType and linesTTL, which are generally encoded using dummy variables.
The variance-covariance matrix for the random effects can take different forms, depending on the underlying correlation structure of your random effect coefficients. To be consistent with the assumptions of a two-level repeated measure ANOVA, we must specify a block-diagonal variance-covariance matrix pdBlocked, where we create diagonal blocks for the offset ~1, and for the categorical predictor variables blockType pdIdent(~blockType - 1) and linesTTL pdIdent(~linesTTL - 1), respectively. Note that we need to subtract the offset from the last two blocks (since we've already accounted for the offset).
Some relevant/interesting resources
Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS, Springer (2000)
Potvin and Schutz, Statistical power for the two-factor
repeated measures ANOVA, Behavior Research Methods, Instruments & Computers, 32, 347-356 (2000)
Deming Mi, How to understand and apply
mixed-effect models, Department of Biostatistics, Vanderbilt university

Adding a random coefficient for an interaction term in a GLMM using lmer() in R

I am trying to build a model with nested random effects and a random coefficient for an
interaction term using lmer() in R.
As seen in the created data below, I have a binary Response and two explanatory variables.
Time is continuous and Binary is a factor. These data are taken from 6 individuals (AAA:FFF)
in three StudyAreas (CO, UT,MT). Because individuals only occur at one StudyArea, IndID is
nested within StudyArea.
#Make data
Response <- as.factor(round(runif(150, 0, 1)))
Time <- round(runif(150, 2,50))
Binary <- round(runif(150, 0, 1))
IndID <- as.factor(rep(c("AAA", "BBB", "CCC", "DDD", "EEE", "FFF"),25))
StudyArea <- as.factor(rep(c("CO", "UT", "MT"),50))
Data <- data.frame(Response, Time, Binary, IndID, StudyArea)
head(Data)
> head(Data)
Response Time Binary IndID StudyArea
1 0 44 1 AAA CO
2 1 16 0 BBB UT
3 1 43 0 CCC MT
4 0 13 1 DDD CO
5 0 34 1 EEE UT
6 1 10 1 FFF MT
Because I want to account for the difference across IndID and also StudyArea I have included both terms
as random effects with adjustments to the intercept in the model below.
require(lme4)
lmer1 <- lmer(Response ~ Time + Binary + (1|StudyArea) + (1|IndID), data=Data, family=binomial)
summary(lmer1)
Lets say that within a GLM structure the interaction between Time and StudyArea (i.e. (Time*StudyArea))
is a significant term. Thus, in addition to the adjustments to the intercept, I also need an adjustment
to the slope to account for differences in Time as a function of StudyArea.
While I have seen a number of examples in the Bates book (http://lme4.r-forge.r-project.org/book/Ch4.pdf)
and other posts for adding a random coef, I have not seen a rand coef for an interaction term.
From what I have gleaned from other posts the model structure should look something like the model
below, but I look forward to the feedback and suggestions of others. This code will fit a model,
although I am not sure it is correct theoretically
lmer2 <- lmer(Response ~ Time + Binary + (0+Time|StudyArea) + (1|StudyArea) + (1|IndID), data=Data, family=binomial)
Note: These are made up data and the results/p-values are obviously meaningless.
Thanks in advance.

Repeated-measures / within-subjects ANOVA in R

I'm attempting to run a repeated-meaures ANOVA using R. I've gone through various examples on various websites, but they never seem to talk about the error that I'm encountering. I assume I'm misunderstanding something important.
The ANOVA I'm trying to run is on some data from an experiment using human participants. It has one DV and three IVs. All of the levels of all of the IVs are run on all participants, making it a three-way repeated-measures / within-subjects ANOVA.
The code I'm running in R is as follows:
aov.output = aov(DV~ IV1 * IV2 * IV3 + Error(PARTICIPANT_ID / (IV1 * IV2 * IV3)),
data=fulldata)
When I run this, I get the following warning:
Error() model is singular
Any ideas what I might be doing wrong?
Try using the lmer function in the lme4 package. The aov function is probably not appropriate here. Look for references from Dougles Bates, e.g. http://lme4.r-forge.r-project.org/book/Ch4.pdf (the other chapters are great too, but that is the repeated measures chapter, this is the intro: http://lme4.r-forge.r-project.org/book/Ch1.pdf). The R code is at the same place and for longitudinal data, it seems to be generally considered wrong these days to just fit OLS instead of a components of variance model like in the lme4 package, or in nlme, which to me seems to have been wildly overtaken by lme4 in popularity recently. You may note Brian Ripley's referenced post in the comments section above just recommends switching to lme also.
By the way, a huge advantage off the jump is you will be able to get estimates for the level of each effect as adjustments to the grand mean with the typical syntax:
lmer(DV ~ 1 +IV1*IV2*IV3 +(IV1*IV2*IV3|Subject), dataset))
Note your random effects will be vector valued.
I know the answer has been chosen for this post. I still wish to point out how to specify a correct error term/random effect when fitting a aov or lmer model to a multi-way repeated-measures data. I assume that both independent variables (IVs) are fixed, and are crossed with each other and with subjects, meaning all subjects are exposed to all combinations of the IVs. I am going to use data taken from Kirk’s Experimental Deisign: Procedures for the Behavioral Sciences (2013).
library(lme4)
library(foreign)
library(lmerTest)
library(dplyr)
file_name <- "http://www.ats.ucla.edu/stat/stata/examples/kirk/rbf33.dta" #1
d <- read.dta(file_name) %>% #2
mutate(a_f = factor(a), b_f = factor(b), s_f = factor(s)) #3
head(d)
## a b s y a_f b_f s_f
## 1 1 1 1 37 1 1 1
## 2 1 2 1 43 1 2 1
## 3 1 3 1 48 1 3 1
## 4 2 1 1 39 2 1 1
## 5 2 2 1 35 2 2 1
In this study 5 subjects (s) are exposed to 2 treatments - type of beat (a) and training duration (b) - with 3 levels each. The outcome variable is the attitude toward minority. In #3 I made a, b, and s into factor variables, a_f, b_f, and s_f. Let p and q be the numbers of levels for a_f and b_f (3 each), and n be the sample size (5).
In this example the degrees of freedom (dfs) for the tests of a_f, b_f, and their interaction should be p-1=2, q-1=2, and (p-1)*(q-1)=4, respectively. The df for the s_f error term is (n-1) = 4, and the df for the Within (s_f:a_f:b_f) error term is (n-1)(pq-1)=32. So the correct model(s) should give you these dfs.
Using aov
Now let’s try different model specifications using aov:
aov(y ~ a_f*b_f + Error(s_f), data=d) %>% summary() # m1
aov(y ~ a_f*b_f + Error(s_f/a_f:b_f), data=d) %>% summary() # m2
aov(y ~ a_f*b_f + Error(s_f/a_f*b_f), data=d) %>% summary() # m3
Simply specifying the error as Error(s_f) in m1 gives you the correct dfs and F-ratios matching the values in the book. m2 also gives the same value as m1, but also the infamous “Warning: Error() model is singular”. m3 is doing something strange. It is further partitioning Within residuals in m1 (634.9) into residuals for three error terms: s_f:a_f (174.2), s_f:b_f (173.6), and s_f:a_f:b_f (287.1). This is wrong, since we would not get three error terms when we run a 2-way between-subjects ANOVA! Multiple error terms are also contrary to the point of using block factorial designs, which allows us to use the same error term for the test of A, B, and AB, unlike split-plot designs which requires 2 error terms.
Using lmer
How can we get the same dfs and F-values using lmer? If your data is balanced, the Kenward-Roger approximation used in the lmerTest will give you exact dfs and F-ratio.
lmer(y ~ a_f*b_f + (1|s_f), data=d) %>% anova() # mem1
lmer(y ~ a_f*b_f + (1|s_f/a_f:b_f), data=d) %>% anova() # mem2
lmer(y ~ a_f*b_f + (1|s_f/a_f*b_f), data=d) %>% anova() # mem3
lmer(y ~ a_f*b_f + (1|s_f:a_f:b_f), data=d) %>% anova() # mem4
lmer(y ~ a_f*b_f + (a_f*b_f|s_f), data=d) %>% anova() # mem5
Again simply specifying the random effect as (1|s_f) give you the correct dfs and F-ratios (mem1). mem2-5 did not even give results, presumably the numbers of random effects it needed to estimate were greater than the sample size.

Resources