I am running a model with the lavaan R package that predicts a continuous outcome by a continuous and two categorical codes. One of them is a dichotomous variable (let's call it A; 0 = no, 1 = yes) and the other is a three-level categorical variable (let's call it B; 0 = low, medium, 3 = high). Below is an example of the data:
outcome gender age continuous A B
1 1.333333 2 23.22404 1.333333 1 0
2 1.500000 2 23.18033 1.833333 1 1
3 1.500000 2 22.37978 2.166667 1 NA
4 2.250000 1 18.74044 1.916667 1 0
5 1.250000 1 22.37978 1.916667 1 1
6 1.500000 2 20.16940 1.500000 1 NA
In addition to a continuous, a dichotomous, and a three-level categorical variable, my model also includes some control variables:
model.1a <- 'outcome ~ gender + age + continuous + A + B
A ~~ continuous
A ~~ B
continuous ~~ B'
fit.1a <- sem(model=model.1a, data=dat)
summary(fit.1a, fit.measures=TRUE, standardized=TRUE, ci=TRUE, rsquare=T)
In a second step, I also want to include an interaction between variable A and B. For this, I first centered these two variables and then included the interaction in the model:
model.1b <- 'outcome ~ gender + age + continuous + A_centr + B_centr + interaction
A_centr ~~ continuous
A_centr ~~ B_centr
continuous ~~ B_centr
interaction ~~ 0*gender + 0*age
gender ~~ age'
fit.1b <- sem(model=model.1b, data=dat)
summary(fit.1b, fit.measures=TRUE, standardized=TRUE, ci=TRUE, rsquare=T)
However, when I run this model, I get the following error:
Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge, x.idx = x.idx[[g]], :
lavaan ERROR: sample covariance matrix is not positive-definite
From what I can tell, this is the case because the interaction between the two categorical variables is very similar to the original variables, but I am unsure how to solve this. Does anyone have a suggestion for solving the issue?
For your information, I have already tried using the non-centered version for one or both of the categorical variables for creating the interaction term and in the regression model.
Related
I need to estimate and plot a logistic multilevel model. I've got the binary dependent variable employment status (empl) (0 = unemployed; 1 = employed) and level of internet connectivity (isoc) as (continuous) independent variable and need to include random effects (intercept and slope) alongside the education level (educ) (1 = low-skilled worker; 2 = middle-skilled; 3 = high-skilled). Also I have some control variables I'm not going to mention here. I'm using the glmer function of the lme4 package. Here is a sample data frame and my (simplified) code:
library(lme4)
library(lmerTest)
library(tidyverse)
library(dplyr)
library(sjPlot)
library(moonBook)
library(sjmisc)
library(sjlabelled)
set.seed(1212)
d <- data.frame(empl=c(1,1,1,0,1,0,1,1,0,1,1,1,0,1,0,1,1,1,1,0),
isoc=runif(20, min=-0.2, max=0.2),
educ=sample(1:3, 20, replace=TRUE))
Results:
empl isoc educ
1 1 0.078604108 1
2 1 0.093667591 3
3 1 -0.061523272 2
4 0 0.009468908 3
5 1 -0.169220134 2
6 0 -0.038594789 3
7 1 0.170506490 1
8 1 -0.098487991 1
9 0 0.073339737 1
10 1 0.144211813 3
11 1 -0.133510687 1
12 1 -0.045306606 3
13 0 0.124211903 3
14 1 -0.003908486 3
15 0 -0.080673475 3
16 1 0.061406993 3
17 1 0.015401951 2
18 1 0.073351501 2
19 1 0.075648137 2
20 0 0.041450192 1
Fit:
m <- glmer(empl ~ isoc + (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
Now the question: I'm looking for a plot with three graphs, one graph for each educ-level, with the probabilities (values just between 0 and 1). Here is an sample image from the web:
Below is my (simplified) code for the plot. But it only produces crap I cannot interpret.
plot_model(m, type="pred",
terms=c("isoc [all]", "educ"),
show.data=TRUE)
There is one thing I can do to get a "kind-of" right plot but I have to alter the model above in a way I think it's wrong (keyword: multicollinearity). Additionally I don't think the three graphs of this plot are correct either. The modified model looks like this:
m <- glmer(empl ~ isoc + educ (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
I appreciate any help! I think my problem resembles this question but unfortunately there has been no answer to this yet and unfortunately I'm not able to comment with my low reputation.
I think you want
plot_model(m, type="pred",
pred.type = "re",
terms = c("isoc[n=100]","educ"), show.data = TRUE)
pred.type = "re" takes the random effects into account when making predictions
isoc[n=100] uses 100 distinct values across the range of isoc - this is better than making predictions only at the observed values of isoc, which is what [all] specifies
For the example you've given the prediction lines are all on top of each other (because the fit is singular/the random-effects variance is effectively zero), but that's presumably because your sample data set is so small.
For what it's worth, although this is a perfectly well-posed programming problem, I would not recommend treating educ as a random effect:
the number of levels is impractically small
the levels are not exchangeable (i.e. it wouldn't make sense to relabel "high-skilled" as "low-skilled").
Feel free to ask more questions about your model setup/definition on CrossValidated
I have been comparing Poisson, negative binomial (NB), and zero-inflated Poisson and NB models in R. My dependent variable is a symptom count for generalized anxiety disorder (GAD), and my predictors are two personality traits (disinhibition [ZDis_winz] and meanness [ZMean_winz]), their interaction, and covariates of age and assessment site (dummy-coded; there are 8 sites so I have 7 of these dummy variables). I have a sample of 1206 with full data (and these are the only individuals included in the data frame).
I am using NB models for this disorder because the variance (~40) far exceeds the mean (~4). I wanted to consider the possibility of a ZINB model as well, given that ~30% of the sample has 0 symptoms.
For other symptom counts (e.g., conduct disorder), I have run ZINB models perfectly fine in R, but I am getting an error when I do the exact same thing with the GAD model. The standard NB model works fine for GAD; it is only the GAD ZINB model that's erroring out.
Here is the error I'm receiving:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 4.80021e-36
Here is the code I'm using for the (working) NB model:
summary(
NB_GAD_uw_int <- glm.nb(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data=eurodata))
Here is the code I'm using for the (not working) ZINB model (which is identical to other ZINB models I've run for other disorders):
summary(
ZINB_GAD_uw_int <- zeroinfl(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data = eurodata,
dist = "negbin",
model = TRUE,
y = TRUE, x = TRUE))
I have seen a few other posts on StackOverflow and other forums about this type of issue. As far as I can tell, people generally say that this is an issue of either 1) collinear predictors or 2) too complex a model for too little data. (Please let me know if I am misinterpreting this! I'm fairly new to Poisson-based models.) However, I am still confused about these answers because: 1) In this case, none of my predictors are correlated more highly than .15, except for the main predictors of interest (ZMean_winz and ZDis_winz), which are correlated about .45. The same predictors are used in other ZINB models that have worked. 2) With 1206 participants, and having run the same ZINB model with similarly distributed count data for other disorders, I'm a little confused how this could be too complex a model for my data.
If anyone has any explanation for why this version of my model will not run and/or any suggestions for troubleshooting, I would really appreciate it! I am also happy to provide more info if needed.
Thank you so much!
The problem may be that zeroinfl is not converting categorical variables into dummy variables.
You can dummify your variables using model.matrix, which is what glm, glm.nb, etc. call internally to dummify categorical variables. This is usually preferred over manually dummifying categorical variables, and should be done to avoid mistakes and to ensure full rank of your model matrix (a full rank matrix is non-singular).
You can of course dummify categorical variables yourself; in that case I would use model.matrix to transform your input data involving categorical variables (and potentially interactions between categorical variables and other variables) into the correct model matrix.
Here is an example:
set.seed(2017)
df <- data.frame(
DV = rnorm(100),
IV1_num = rnorm(100),
IV2_cat = sample(c("catA", "catB", "catC"), 100, replace = T))
head(df)
# DV IV1_num IV2_cat
#1 1.43420148 0.01745491 catC
#2 -0.07729196 1.37688667 catC
#3 0.73913723 -0.06869535 catC
#4 -1.75860473 0.84190898 catC
#5 -0.06982523 -0.96624056 catB
#6 0.45190553 -1.96971566 catC
mat <- model.matrix(DV ~ IV1_num + IV2_cat, data = df)
head(mat)
# (Intercept) IV1_num IV2_catcatB IV2_catcatC
#1 1 0.01745491 0 1
#2 1 1.37688667 0 1
#3 1 -0.06869535 0 1
#4 1 0.84190898 0 1
#5 1 -0.96624056 1 0
#6 1 -1.96971566 0 1
The manually dummified input data would then be
df.dummified = cbind.data.frame(DV = df$DV, mat[, -1])
# DV IV1_num IV2_catB IV2_catC
#1 1.43420148 0.01745491 0 1
#2 -0.07729196 1.37688667 0 1
#3 0.73913723 -0.06869535 0 1
#4 -1.75860473 0.84190898 0 1
#5 -0.06982523 -0.96624056 1 0
#6 0.45190553 -1.96971566 0 1
which you'd use in e.g.
glm.nb(DV ~ ., data = df.dummified)
In R, what is the best way to incorporate the interaction term between a covariate and time, when the proportionality test (with coxph) shows that the proportionality assumption in the Cox model is violated? I know that you can either use strata or an interaction with time term, I'm interested in the latter. I haven't been able to find a definitive clear explanation with examples on how to do this on the internet. In the most common example using the Rossi dataset, Fox suggested to do,
coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + age:stop + prio, data = Rossi.2)
Is there a difference between modeling with age:stop versus age:start? Does the formula have to use this format? If I use the Surv with the two parameter format, would the following also make sense?
coxph(formula = Surv(week, arrest) ~ fin + age + age:week + prio, data = Rossi)
Or you have to split the dataset and use the Surv(start,stop,event) method?
Also, there is the time-transform method, so,
coxph(formula = Surv(week, arrest) ~ fin + age + tt(age) + prio, data = Rossi, tt=function(x,t,...) x*t)
I know that some people would prefer model with the log(t) instead of t here. But which one of these is the correct method to model interaction with time? Do these all refer to the same/different underlying statistical model? And the end, are all modeling (for the interaction term): h(t) = h0(t)exp(b*X*t)?
This is essentially a 3 part question:
How to estimate time-varying effects?
What is the difference between different specifications of time-varying effects using survival::coxph function
How to decide what shape the time-variation has, i.e., linear, logarithmic, ...
I will try to answer these questions in the following using the veteran data example, which is featured in section 4.2 of the vignette on time-dependent covariates and time-dependent coefficients (also known as time-varying effects) in the survival package:
library(dplyr)
library(survival)
data("veteran", package = "survival")
veteran <- veteran %>%
mutate(
trt = 1L * (trt == 2),
prior = 1L * (prior == 10))
head(veteran)
#> trt celltype time status karno diagtime age prior
#> 1 0 squamous 72 1 60 7 69 0
#> 2 0 squamous 411 1 70 5 64 1
#> 3 0 squamous 228 1 60 3 38 0
#> 4 0 squamous 126 1 60 9 63 1
#> 5 0 squamous 118 1 70 11 65 1
#> 6 0 squamous 10 1 20 5 49 0
1. How to estimate time-varying effects
There are different popular methods and implementations, e.g. survival::coxph, timereg::aalen or using GAMs after appropriate data transformation (see below).
Although the specific methods and their implementaitons differ, a general idea ist to create a long form data set where
the follow-up is partitioned into intervals
for each subject, the status is 0 in all intervals except the last (if an event)
the time variable is updated in each interval
Then, the time (or a transformation of time, e.g. log(t)) is simply a covariate and time-varying effects can be estimated by an interaction between the covariate of interest and the (transformed) covariate of time.
If the functional form of the time-variation is known, you can use the tt() aproach:
cph_tt <- coxph(
formula = Surv(time, status) ~ trt + prior + karno + tt(karno),
data = veteran,
tt = function(x, t, ...) x * log(t + 20))
2. What is the difference between different specifications of time-varying effects using survival::coxph function
There is no difference. I assume the tt() function is simply a short-cut for the estimation via transformation to the long-format. You can verify that the two approaches are equivalent using the code below:
transform to long format
veteran_long <- survSplit(Surv(time, status)~., data = veteran, id = "id",
cut = unique(veteran$time)) %>%
mutate(log_time = log(time + 20))
head(veteran_long) %>% select(id, trt, age, tstart, time, log_time, status)
#> id trt age tstart time log_time status
#> 1 1 0 69 0 1 3.044522 0
#> 2 1 0 69 1 2 3.091042 0
#> 3 1 0 69 2 3 3.135494 0
#> 4 1 0 69 3 4 3.178054 0
#> 5 1 0 69 4 7 3.295837 0
#> 6 1 0 69 7 8 3.332205 0
cph_long <- coxph(formula = Surv(tstart, time, status)~
trt + prior + karno + karno:log_time, data = veteran_long)
## models are equivalent, just different specification
cbind(coef(cph_long), coef(cph_tt))
#> [,1] [,2]
#> trt 0.01647766 0.01647766
#> prior -0.09317362 -0.09317362
#> karno -0.12466229 -0.12466229
#> karno:log_time 0.02130957 0.02130957
3. How to decide what shape the time-variation has?
As mentioned before, time-varying effects are simply interactions of a covariate x and time t, thus time-varying effects can have different specifications, equivalent to interactions in standard regression models, e.g.
x*t: linear covariate effect, linearly time-varying effect
f(x)*t: non-linear covariate effect, linearly time-varying effect
f(t)*x: linear covariate effect, non-linearly time-varying (for categorical x) this essentially represents a stratified baseline hazard
f(x, t): non-linear, non-linearly time-varying effect
In each case, the functional form of the effect f can either be estimated from the data or prespecified (e.g. f(t)*x = karno * log(t + 20) above).
In most cases you would prefer to estimate f from the data. The support for the (penalized) estimation of such effects is to my knowledge limited in the survival package. However, you can use mgcv::gam to estimate any of the effects specified above (after appropriate data transformation). An example is given below and shows that the effect of karno goes towards 0 as time progresses, regardless of the Karnofsky score at the beginning of the follow-up (see here for details and also Section 4.2 here):
library(pammtools)
# data transformation
ped <- as_ped(veteran, Surv(time, status)~., max_time = 400)
# model
pam <- mgcv::gam(ped_status ~ s(tend) + trt + prior + te(tend, karno, k = 10),
data = ped, family = poisson(), offset = offset, method = "REML")
p_2d <- gg_tensor(pam)
p_slice <- gg_slice(ped, pam, "karno", tend = unique(tend), karno = c(20, 50, 80), reference = list(karno = 60))
gridExtra::grid.arrange(p_2d, p_slice, nrow = 1)
I am trying to fit a random effect mix model using INLA, having previously attempted to fit using "glmer" under the frequentist approach. This failed to converge due to large number of random effects in my data.
The data comes from a case-control type study (1 = case, 0 = control), and a list of risk factors (x1, x2, x3 ...) were calculated for each samples. All variables were categorised into groups, and the data looks as follows:
res age breed x1 x2 location
0 1 (0-1 yrs) beef 1 1 A1
0 2 (1-2 yrs) dairy 1 2 A1
1 1 beef 1 2 B2
0 1 beef 2 1 C1
1 3 (>3 yrs) dairy 3 3 B1
1 2 beef 1 1 A1
0 3 beef 2 1 B4
... ... ... .. ..
There are around 20,000 data points with 9000 distinct locations. The INLA procedure I used is:
formula <- res ~ age + breed + x1 + x2 + x3 + f(location, model = "iid")
model <- inla(formula, data = data, family = "binomial", Ntrials = 1, control.compute = list(dic = TRUE, cpo = TRUE))
Results from standard logistic regression (excluding random effect) offers similar parameter estimates between "glm" and INLA, however when random effect is included in the model structure as above, parameter estimates (in the logit scale) increased by more than 2 times. This means that when interpreting the odds ratio, it increases exponentially (i.e. exp(parameter estimate)), which does not seems to make sense having an odds ratio of 40...
My question: is the INLA model specification ("iid") appropriate for this type of analysis? and if so, how do I interpret the results in terms of odds ratios between different risk groups? (using "rw2" seems to give reasonable estimates but I cannot interpret the random effect estimate under this approach)
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) -10.8871 0.4949 -11.8440 -10.8927 -9.9506 -10.9516 0
age_2 3.8959 0.3272 3.2739 3.8889 4.5581 3.8746 0
age_3 4.1865 0.3421 3.5346 4.1797 4.8772 4.1659 0
breedDairy 1.2053 0.1365 0.9393 1.2046 1.4746 1.2032 0
x1_2 4.8721 0.4258 4.0498 4.8682 5.7156 4.8600 0
x1_3 4.1444 0.3322 3.5055 4.1408 4.8039 4.1337 0
x2_2 -1.0174 0.2727 -1.5485 -1.0189 -0.4782 -1.0220 0
x2_3 1.9669 0.4119 1.1570 1.9672 2.7744 1.9677 0
Random effects:
Name Model
location IID model
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for location 0.0621 0.0051 0.0536 0.0615 0.0734 0.0602
Expected number of effective parameters(std dev): 4174.80(49.56)
Number of equivalent replicates : 4.458
Deviance Information Criterion: 7865.86
Effective number of parameters: 2875.54
Marginal Likelihood: -5074.21
CPO and PIT are computed
Many thanks for your help!
Best
s
I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.