Plotting multiple probabilities of an logistic multilevel model - r

I need to estimate and plot a logistic multilevel model. I've got the binary dependent variable employment status (empl) (0 = unemployed; 1 = employed) and level of internet connectivity (isoc) as (continuous) independent variable and need to include random effects (intercept and slope) alongside the education level (educ) (1 = low-skilled worker; 2 = middle-skilled; 3 = high-skilled). Also I have some control variables I'm not going to mention here. I'm using the glmer function of the lme4 package. Here is a sample data frame and my (simplified) code:
library(lme4)
library(lmerTest)
library(tidyverse)
library(dplyr)
library(sjPlot)
library(moonBook)
library(sjmisc)
library(sjlabelled)
set.seed(1212)
d <- data.frame(empl=c(1,1,1,0,1,0,1,1,0,1,1,1,0,1,0,1,1,1,1,0),
isoc=runif(20, min=-0.2, max=0.2),
educ=sample(1:3, 20, replace=TRUE))
Results:
empl isoc educ
1 1 0.078604108 1
2 1 0.093667591 3
3 1 -0.061523272 2
4 0 0.009468908 3
5 1 -0.169220134 2
6 0 -0.038594789 3
7 1 0.170506490 1
8 1 -0.098487991 1
9 0 0.073339737 1
10 1 0.144211813 3
11 1 -0.133510687 1
12 1 -0.045306606 3
13 0 0.124211903 3
14 1 -0.003908486 3
15 0 -0.080673475 3
16 1 0.061406993 3
17 1 0.015401951 2
18 1 0.073351501 2
19 1 0.075648137 2
20 0 0.041450192 1
Fit:
m <- glmer(empl ~ isoc + (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
Now the question: I'm looking for a plot with three graphs, one graph for each educ-level, with the probabilities (values just between 0 and 1). Here is an sample image from the web:
Below is my (simplified) code for the plot. But it only produces crap I cannot interpret.
plot_model(m, type="pred",
terms=c("isoc [all]", "educ"),
show.data=TRUE)
There is one thing I can do to get a "kind-of" right plot but I have to alter the model above in a way I think it's wrong (keyword: multicollinearity). Additionally I don't think the three graphs of this plot are correct either. The modified model looks like this:
m <- glmer(empl ~ isoc + educ (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
I appreciate any help! I think my problem resembles this question but unfortunately there has been no answer to this yet and unfortunately I'm not able to comment with my low reputation.

I think you want
plot_model(m, type="pred",
pred.type = "re",
terms = c("isoc[n=100]","educ"), show.data = TRUE)
pred.type = "re" takes the random effects into account when making predictions
isoc[n=100] uses 100 distinct values across the range of isoc - this is better than making predictions only at the observed values of isoc, which is what [all] specifies
For the example you've given the prediction lines are all on top of each other (because the fit is singular/the random-effects variance is effectively zero), but that's presumably because your sample data set is so small.
For what it's worth, although this is a perfectly well-posed programming problem, I would not recommend treating educ as a random effect:
the number of levels is impractically small
the levels are not exchangeable (i.e. it wouldn't make sense to relabel "high-skilled" as "low-skilled").
Feel free to ask more questions about your model setup/definition on CrossValidated

Related

How to structure dataset to run Binomial GLM of ratio of counts over time?

I am trying to do an analysis using a binomial GLM to test for differences in relative count frequency over time (Days). The GLM model/formula would look something like this:
(𝐴1:𝐴2) ∼ 𝛽 Day
Where we are testing for the effect of Day on the frequency of A1:A2. Basically this is a binomial generalized linear model where A1 and A2 refer to the read counts of alternative alleles at each gene and Day is a multilevel factor. The other thing is that I would be testing this on many different genes (100's) so that we would be doing many tests.
The basic model formula in R is straightforward (e.g. using a long format dataset): `
glm(AF1:AF2 ~ Day, data = dfLong, family = "binomial")
But Im not really sure how to structure the data or loop over the Gene variable to accomplish this task?
Here is an example dataframe:
> df<-read.csv("test.csv")
> df
Gene A.count_1 A.count_2 Day
1 1 60 40 1
2 2 100 30 1
3 3 100 3 1
4 1 55 100 3
5 2 423 410 3
6 3 191 89 3
7 1 20 10 5
8 2 200 10 5
9 3 100 20 5
The output I'd like is the test of the effect of Day as a factor (not a numeric variable) on allele count ratios for each gene, producing a p-value for each gene (e.g. 1,2, and 3, or more, 100s, in the general case).
Any help to set me in the right direction would be mnuch appreciated.
Thanks!!
I think that
library('lme4')
m <- lmList(cbind(A.count_1,A.count_2) ~ Day | Gene, data = dfLong,
family = "binomial")
summary(m)
should probably do it? (From ?binomial, a two-column matrix response is treated as {number of successes, number of failures})
This works, for some built-in data that comes with the lme4 package:
lmList(cbind(incidence, size-incidence) ~ period | herd,
data = cbpp, family = binomial)

Logistic regression with proportions in R (dependent variable is not binary). What is R doing?

So I stumbled across the following code:
#Importing the data:
seeds.df <-
read.table('http://www.uib.no/People/nzlkj/statkey/data/seeds.csv',header=T)
attach(seeds.df)
#Making a plot of seeds eaten depending on seed type:
plot(Seed.type, Eaten)
#Testing the hypothesis:
fit1.glm <- glm(cbind(Eaten,Not.eaten)~Seed.type, binomial)
summary(fit1.glm)
From https://folk.uib.no/nzlkj/statkey/logistic.html#proportions
which provides a method for doing logistic regression on proportion data.
My question is what is R actually doing mathematically? the response variable is two columns. As far as I knew logistic regression is supposed to be performed on binary dependent variable.
is R creating a new response variable of length Eaten + Not.eaten
which is populated by rep(1, Eaten) rep(0, Not.eaten) and performing log regression on that?
e.g. for row 1 in seeds.df Eaten = 2 Not.eaten = 48
row# eaten.or.not seed.type Hamster
1 1 B 1
2 1 B 1
3 0 B 1
0 B 1
...
50 0 B 1
then R would do glm(eaten.or.not ~ seed.type, family = 'binomial')
I tested the above and it didn't produce the same answer
or is R doing the following
ln(prob of being eaten/ (1-prob of being eaten)) = intercept + B1(seed.type)
I also tested this and I got something different, but I'm not sure I did it correctly.
Anyway, could someone shed light on what R is doing mathematically for the log regression with a proportion that would be great.
thank you for your time

Zero-inflated negative binomial model in R: Computationally singular

I have been comparing Poisson, negative binomial (NB), and zero-inflated Poisson and NB models in R. My dependent variable is a symptom count for generalized anxiety disorder (GAD), and my predictors are two personality traits (disinhibition [ZDis_winz] and meanness [ZMean_winz]), their interaction, and covariates of age and assessment site (dummy-coded; there are 8 sites so I have 7 of these dummy variables). I have a sample of 1206 with full data (and these are the only individuals included in the data frame).
I am using NB models for this disorder because the variance (~40) far exceeds the mean (~4). I wanted to consider the possibility of a ZINB model as well, given that ~30% of the sample has 0 symptoms.
For other symptom counts (e.g., conduct disorder), I have run ZINB models perfectly fine in R, but I am getting an error when I do the exact same thing with the GAD model. The standard NB model works fine for GAD; it is only the GAD ZINB model that's erroring out.
Here is the error I'm receiving:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 4.80021e-36
Here is the code I'm using for the (working) NB model:
summary(
NB_GAD_uw_int <- glm.nb(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data=eurodata))
Here is the code I'm using for the (not working) ZINB model (which is identical to other ZINB models I've run for other disorders):
summary(
ZINB_GAD_uw_int <- zeroinfl(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data = eurodata,
dist = "negbin",
model = TRUE,
y = TRUE, x = TRUE))
I have seen a few other posts on StackOverflow and other forums about this type of issue. As far as I can tell, people generally say that this is an issue of either 1) collinear predictors or 2) too complex a model for too little data. (Please let me know if I am misinterpreting this! I'm fairly new to Poisson-based models.) However, I am still confused about these answers because: 1) In this case, none of my predictors are correlated more highly than .15, except for the main predictors of interest (ZMean_winz and ZDis_winz), which are correlated about .45. The same predictors are used in other ZINB models that have worked. 2) With 1206 participants, and having run the same ZINB model with similarly distributed count data for other disorders, I'm a little confused how this could be too complex a model for my data.
If anyone has any explanation for why this version of my model will not run and/or any suggestions for troubleshooting, I would really appreciate it! I am also happy to provide more info if needed.
Thank you so much!
The problem may be that zeroinfl is not converting categorical variables into dummy variables.
You can dummify your variables using model.matrix, which is what glm, glm.nb, etc. call internally to dummify categorical variables. This is usually preferred over manually dummifying categorical variables, and should be done to avoid mistakes and to ensure full rank of your model matrix (a full rank matrix is non-singular).
You can of course dummify categorical variables yourself; in that case I would use model.matrix to transform your input data involving categorical variables (and potentially interactions between categorical variables and other variables) into the correct model matrix.
Here is an example:
set.seed(2017)
df <- data.frame(
DV = rnorm(100),
IV1_num = rnorm(100),
IV2_cat = sample(c("catA", "catB", "catC"), 100, replace = T))
head(df)
# DV IV1_num IV2_cat
#1 1.43420148 0.01745491 catC
#2 -0.07729196 1.37688667 catC
#3 0.73913723 -0.06869535 catC
#4 -1.75860473 0.84190898 catC
#5 -0.06982523 -0.96624056 catB
#6 0.45190553 -1.96971566 catC
mat <- model.matrix(DV ~ IV1_num + IV2_cat, data = df)
head(mat)
# (Intercept) IV1_num IV2_catcatB IV2_catcatC
#1 1 0.01745491 0 1
#2 1 1.37688667 0 1
#3 1 -0.06869535 0 1
#4 1 0.84190898 0 1
#5 1 -0.96624056 1 0
#6 1 -1.96971566 0 1
The manually dummified input data would then be
df.dummified = cbind.data.frame(DV = df$DV, mat[, -1])
# DV IV1_num IV2_catB IV2_catC
#1 1.43420148 0.01745491 0 1
#2 -0.07729196 1.37688667 0 1
#3 0.73913723 -0.06869535 0 1
#4 -1.75860473 0.84190898 0 1
#5 -0.06982523 -0.96624056 1 0
#6 0.45190553 -1.96971566 0 1
which you'd use in e.g.
glm.nb(DV ~ ., data = df.dummified)

How to model interaction of covariate with time when proportionality assumption is violated in survival analysis in R

In R, what is the best way to incorporate the interaction term between a covariate and time, when the proportionality test (with coxph) shows that the proportionality assumption in the Cox model is violated? I know that you can either use strata or an interaction with time term, I'm interested in the latter. I haven't been able to find a definitive clear explanation with examples on how to do this on the internet. In the most common example using the Rossi dataset, Fox suggested to do,
coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + age:stop + prio, data = Rossi.2)
Is there a difference between modeling with age:stop versus age:start? Does the formula have to use this format? If I use the Surv with the two parameter format, would the following also make sense?
coxph(formula = Surv(week, arrest) ~ fin + age + age:week + prio, data = Rossi)
Or you have to split the dataset and use the Surv(start,stop,event) method?
Also, there is the time-transform method, so,
coxph(formula = Surv(week, arrest) ~ fin + age + tt(age) + prio, data = Rossi, tt=function(x,t,...) x*t)
I know that some people would prefer model with the log(t) instead of t here. But which one of these is the correct method to model interaction with time? Do these all refer to the same/different underlying statistical model? And the end, are all modeling (for the interaction term): h(t) = h0(t)exp(b*X*t)?
This is essentially a 3 part question:
How to estimate time-varying effects?
What is the difference between different specifications of time-varying effects using survival::coxph function
How to decide what shape the time-variation has, i.e., linear, logarithmic, ...
I will try to answer these questions in the following using the veteran data example, which is featured in section 4.2 of the vignette on time-dependent covariates and time-dependent coefficients (also known as time-varying effects) in the survival package:
library(dplyr)
library(survival)
data("veteran", package = "survival")
veteran <- veteran %>%
mutate(
trt = 1L * (trt == 2),
prior = 1L * (prior == 10))
head(veteran)
#> trt celltype time status karno diagtime age prior
#> 1 0 squamous 72 1 60 7 69 0
#> 2 0 squamous 411 1 70 5 64 1
#> 3 0 squamous 228 1 60 3 38 0
#> 4 0 squamous 126 1 60 9 63 1
#> 5 0 squamous 118 1 70 11 65 1
#> 6 0 squamous 10 1 20 5 49 0
1. How to estimate time-varying effects
There are different popular methods and implementations, e.g. survival::coxph, timereg::aalen or using GAMs after appropriate data transformation (see below).
Although the specific methods and their implementaitons differ, a general idea ist to create a long form data set where
the follow-up is partitioned into intervals
for each subject, the status is 0 in all intervals except the last (if an event)
the time variable is updated in each interval
Then, the time (or a transformation of time, e.g. log(t)) is simply a covariate and time-varying effects can be estimated by an interaction between the covariate of interest and the (transformed) covariate of time.
If the functional form of the time-variation is known, you can use the tt() aproach:
cph_tt <- coxph(
formula = Surv(time, status) ~ trt + prior + karno + tt(karno),
data = veteran,
tt = function(x, t, ...) x * log(t + 20))
2. What is the difference between different specifications of time-varying effects using survival::coxph function
There is no difference. I assume the tt() function is simply a short-cut for the estimation via transformation to the long-format. You can verify that the two approaches are equivalent using the code below:
transform to long format
veteran_long <- survSplit(Surv(time, status)~., data = veteran, id = "id",
cut = unique(veteran$time)) %>%
mutate(log_time = log(time + 20))
head(veteran_long) %>% select(id, trt, age, tstart, time, log_time, status)
#> id trt age tstart time log_time status
#> 1 1 0 69 0 1 3.044522 0
#> 2 1 0 69 1 2 3.091042 0
#> 3 1 0 69 2 3 3.135494 0
#> 4 1 0 69 3 4 3.178054 0
#> 5 1 0 69 4 7 3.295837 0
#> 6 1 0 69 7 8 3.332205 0
cph_long <- coxph(formula = Surv(tstart, time, status)~
trt + prior + karno + karno:log_time, data = veteran_long)
## models are equivalent, just different specification
cbind(coef(cph_long), coef(cph_tt))
#> [,1] [,2]
#> trt 0.01647766 0.01647766
#> prior -0.09317362 -0.09317362
#> karno -0.12466229 -0.12466229
#> karno:log_time 0.02130957 0.02130957
3. How to decide what shape the time-variation has?
As mentioned before, time-varying effects are simply interactions of a covariate x and time t, thus time-varying effects can have different specifications, equivalent to interactions in standard regression models, e.g.
x*t: linear covariate effect, linearly time-varying effect
f(x)*t: non-linear covariate effect, linearly time-varying effect
f(t)*x: linear covariate effect, non-linearly time-varying (for categorical x) this essentially represents a stratified baseline hazard
f(x, t): non-linear, non-linearly time-varying effect
In each case, the functional form of the effect f can either be estimated from the data or prespecified (e.g. f(t)*x = karno * log(t + 20) above).
In most cases you would prefer to estimate f from the data. The support for the (penalized) estimation of such effects is to my knowledge limited in the survival package. However, you can use mgcv::gam to estimate any of the effects specified above (after appropriate data transformation). An example is given below and shows that the effect of karno goes towards 0 as time progresses, regardless of the Karnofsky score at the beginning of the follow-up (see here for details and also Section 4.2 here):
library(pammtools)
# data transformation
ped <- as_ped(veteran, Surv(time, status)~., max_time = 400)
# model
pam <- mgcv::gam(ped_status ~ s(tend) + trt + prior + te(tend, karno, k = 10),
data = ped, family = poisson(), offset = offset, method = "REML")
p_2d <- gg_tensor(pam)
p_slice <- gg_slice(ped, pam, "karno", tend = unique(tend), karno = c(20, 50, 80), reference = list(karno = 60))
gridExtra::grid.arrange(p_2d, p_slice, nrow = 1)

lm function in R does not give coefficients for all factor levels in categorical data

I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.
Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.
> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
states population
1 WA 0.5
2 TE 0.2
3 GE 0.6
4 LA 0.7
5 SF 0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)
Call:
lm(formula = population ~ states, data = df)
Coefficients:
(Intercept) statesLA statesSF statesTE statesWA
0.6 0.1 0.3 -0.4 -0.1
I also tried with a larger data set by doing the following, but still see the same behavior
for(i in 1:10)
{
df = rbind(df,df)
}
EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?
I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?
> df1
population GE MI TE WA
1 1 0 0 0 1
2 2 1 0 0 0
3 2 0 0 1 0
4 1 0 1 0 0
lm(formula = population ~ (GE+MI+TE+WA),data=df1)
Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)
Coefficients:
(Intercept) GE MI TE WA
1 1 0 1 NA
GE is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states with GE as the baseline (statesLA = 0.1 meaning LA is, on average, 0.1x more than GE).
EDIT:
To respond to your updated question:
If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.
As for the interpretation of GE in your y=mx+c equation, you can calculate the expected y by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.
e.g.
y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c
If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).

Resources