Marginal effects of logit in weighted survey data using R - r

I´m trying to estimate marginal effect of a logit model in which I have several dichotomous explanatory variables.
Let's say the model estimated by
logit<- svyglm ( if_member ~ if_female + dummy_agegroup_2 + dummy_agegroup_3 + dummy_education_2 + dummy_education_3 + dummy_education_4, family = quasibinomial(link = "logit"), design = survey_design)
I know about the marginpred function in survey package, but I am not very familiar with it. I have only dichotomous variebles in the model so I am wondering how to estimate marginal effects by this function, especially I am not sure about the predictat (A data frame giving values of the variables in model to predict at).

Are you looking for marginal effects or marginal predictions?
As the name implies, the marginpred() function returns predictions. The argument for predictat is a data frame with both the control variables and the variables that are in the model. Let me emphasize that: control variables should be left out of the model.
library("survey")
odds2prob <- function(x) x / (x + 1)
prob2odds <- function(x) x / (1 - x)
expit <- function(x) odds2prob(exp(x))
logit <- function(x) log(prob2odds(x))
set.seed(1)
survey_data <- data.frame(
if_female = rbinom(n = 100, size = 1, prob = 0.5),
agegroup = factor(sample(x = 1:3, size = 100, replace = TRUE)),
education = NA_integer_,
if_member = NA_integer_)
survey_data["agegroup"] <- relevel(survey_data$agegroup, ref = 3)
# Different probabilities between female and male persons
survey_data[survey_data$if_female == 0, "education"] <- sample(
x = 1:4,
size = sum(survey_data$if_female == 0),
replace = TRUE,
prob = c(0.1, 0.1, 0.5, 0.3))
survey_data[survey_data$if_female == 1, "education"] <-sample(
x = 1:4,
size = sum(survey_data$if_female == 1),
replace = TRUE,
prob = c(0.1, 0.1, 0.3, 0.5))
survey_data["if_member"] <- rbinom(n = 100, size = 1, prob =
expit((survey_data$education - 3)/2))
survey_data["education"] <- factor(survey_data$education)
survey_data["education"] <- relevel(survey_data$education, ref = 3)
survey_design <- svydesign(ids = ~ 1, data = survey_data)
logit <- svyglm(if_member ~ if_female + agegroup + education,
family = quasibinomial(link = "logit"),
design = survey_design)
exp(cbind(`odds ratio` = coef(logit), confint(logit)))
newdf <- data.frame(if_female = 0:1, education = c(3, 3), agegroup = = c(3, 3))
# Fails
mp <- marginpred(model = logit, adjustfor = ~ agegroup + education,
predictat = newdf, se = TRUE, type = "response")
logit2 <- svyglm(if_member ~ if_female,
family = quasibinomial(link = "logit"),
design = survey_design)
mp <- marginpred(model = logit2, adjustfor = ~ agegroup + education,
predictat = newdf, se = TRUE, type = "response")
# Probability for male and for female persons controlling for agegroup and education
cbind(prob = mp, confint(mp))
That's how I estimate marginal effects with the survey package:
# Probability difference between female and male persons
# when agegroup and education are set to 3
svycontrast(full_model, quote(
(exp(`(Intercept)` + if_female) / (exp(`(Intercept)` + if_female) + 1)) -
(exp(`(Intercept)`) / (exp(`(Intercept)`) + 1))))
# Can't use custom functions like expit :_(
There are probably smarter ways, but I hope it helps.
Please note that the difference between the probabilities predicted by marginpred() is different from the difference estimated by svycontrast(). The probabilities predicted by marginpred() don't seem to be affected by changing the value of the control variables (in example,
education = c(4, 4) instead of education = c(3, 3)), but the estimates from svycontrast() are affected as implied by the regression model.

Related

How to generate covariate-adjusted cox survival/hazard functions?

I'm using the survminer package to try to generate survival and hazard function graphs for a longitudinal student-level dataset that has 5 subgroups of interest.
I've had success creating a model that shows the survival functions without adjusting for student-level covariates using ggsurvplot.
ggsurvplot(survfit(Surv(expectedgr, sped) ~ langstatus_new, data=mydata), pvalue=TRUE)
Output example
However, I cannot manage to get these curves adjusted for covariates. My aim is to create graphs like these. As you can see, these are covariate-adjusted survival curves according to some factor variable. Does anyone how such graphs can be obtained in R?
You want to obtain survival probabilities from a Cox model for certain values of some covariate of interest, while adjusting for other covariates. However, because we do not make any assumption on the distribution of the survival times in a Cox model, we cannot directly obtain survival probabilities from it. We first have to estimate the baseline hazard function, which is typically done with the non-parametric Breslow estimator. When the Cox model is fitted with coxph from the survival package, we can obtain such probabilites with a call to the survfit() function. You may consult ?survfit.coxph for more information.
Let's see how we can do this by using the lung data set.
library(survival)
# select covariates of interest
df <- subset(lung, select = c(time, status, age, sex, ph.karno))
# assess whether there are any missing observations
apply(df, 2, \(x) sum(is.na(x))) # 1 in ph.karno
# listwise delete missing observations
df <- df[complete.cases(df), ]
# Cox model
fit <- coxph(Surv(time, status == 2) ~ age + sex + ph.karno, data = df)
## Note that I ignore the fact that ph.karno does not satisfy the PH assumption.
# specify for which combinations of values of age, sex, and
# ph.karno we want to derive survival probabilies
ND1 <- with(df, expand.grid(
age = median(age),
sex = c(1,2),
ph.karno = median(ph.karno)
))
ND2 <- with(df, expand.grid(
age = median(age),
sex = 1, # males
ph.karno = round(create_intervals(n_groups = 3L))
))
# Obtain the expected survival times
sfit1 <- survfit(fit, newdata = ND1)
sfit2 <- survfit(fit, newdata = ND2)
The code behind the function create_intervals() can be found in this post. I just simply replaced speed with ph.karno in the function.
The output sfit1 contains the expected median survival times and the corresponding 95% confidence intervals for the combinations of covariates as specified in ND1.
> sfit1
Call: survfit(formula = fit, newdata = ND)
n events median 0.95LCL 0.95UCL
1 227 164 283 223 329
2 227 164 371 320 524
Survival probabilities at specific follow-up times be obtained with the times argument of the summary() method.
# survival probabilities at 200 days of follow-up
summary(sfit1, times = 200)
The output contains again the expected survival probability, but now after 200 days of follow-up, wherein survival1 corresponds to the expected survival probability of the first row of ND1, i.e. a male and female patient of median age with median ph.karno.
> summary(sfit1, times = 200)
Call: survfit(formula = fit, newdata = ND1)
time n.risk n.event survival1 survival2
200 144 71 0.625 0.751
The 95% confidence limits associated with these two probabilities can be manually extracted from summary().
sum_sfit <- summary(sfit1, times = 200)
sum_sfit <- t(rbind(sum_sfit$surv, sum_sfit$lower, sum_sfit$upper))
colnames(sum_sfit) <- c("S_hat", "2.5 %", "97.5 %")
# ------------------------------------------------------
> sum_sfit
S_hat 2.5 % 97.5 %
1 0.6250586 0.5541646 0.7050220
2 0.7513961 0.6842830 0.8250914
If you would like to use ggplot to depict the expected survival probabilities (and the corresponding 95% confidence intervals) for the combinations of values as specified in ND1 and ND2, we first need to make data.frames that contain all the information in an appropriate format.
# function which returns the output from a survfit.object
# in an appropriate format, which can be used in a call
# to ggplot()
df_fun <- \(surv_obj, newdata, factor) {
len <- length(unique(newdata[[factor]]))
out <- data.frame(
time = rep(surv_obj[['time']], times = len),
n.risk = rep(surv_obj[['n.risk']], times = len),
n.event = rep(surv_obj[['n.event']], times = len),
surv = stack(data.frame(surv_obj[['surv']]))[, 'values'],
upper = stack(data.frame(surv_obj[['upper']]))[, 'values'],
lower = stack(data.frame(surv_obj[['lower']]))[, 'values']
)
out[, 7] <- gl(len, length(surv_obj[['time']]))
names(out)[7] <- 'factor'
return(out)
}
# data for the first panel (A)
df_leftPanel <- df_fun(surv_obj = sfit1, newdata = ND1, factor = 'sex')
# data for the second panel (B)
df_rightPanel <- df_fun(surv_obj = sfit2, newdata = ND2, factor = 'ph.karno')
Now that we have defined our data.frames, we need to define a new function which allows us to plot the 95% CIs. We assign it the generic name geom_stepribbon.
library(ggplot2)
# Function for geom_stepribbon
geom_stepribbon <- function(
mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
data = data,
mapping = mapping,
stat = stat,
geom = GeomStepribbon,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ... )
)
}
GeomStepribbon <- ggproto(
"GeomStepribbon", GeomRibbon,
extra_params = c("na.rm"),
draw_group = function(data, panel_scales, coord, na.rm = FALSE) {
if (na.rm) data <- data[complete.cases(data[c("x", "ymin", "ymax")]), ]
data <- rbind(data, data)
data <- data[order(data$x), ]
data$x <- c(data$x[2:nrow(data)], NA)
data <- data[complete.cases(data["x"]), ]
GeomRibbon$draw_group(data, panel_scales, coord, na.rm = FALSE)
}
)
Finally, we can plot the expected survival probabilities for ND1 and ND2.
yl <- 'Expected Survival probability\n'
xl <- '\nTime (days)'
# left panel
my_colours <- c('blue4', 'darkorange')
adj_colour <- \(x) adjustcolor(x, alpha.f = 0.2)
my_colours <- c(
my_colours, adj_colour(my_colours[1]), adj_colour(my_colours[2])
)
left_panel <- ggplot(df_leftPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Sex',
values = c('1' = my_colours[1],
'2' = my_colours[2]),
labels = c('1' = 'Males',
'2' = 'Females')) +
scale_fill_manual(name = 'Sex',
values = c('1' = my_colours[3],
'2' = my_colours[4]),
labels = c('1' = 'Males',
'2' = 'Females')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# right panel
my_colours <- c('blue4', 'darkorange', '#00b0a4')
my_colours <- c(
my_colours, adj_colour(my_colours[1]),
adj_colour(my_colours[2]), adj_colour(my_colours[3])
)
right_panel <- ggplot(df_rightPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Ph.karno',
values = c('1' = my_colours[1],
'2' = my_colours[2],
'3' = my_colours[3]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
scale_fill_manual(name = 'Ph.karno',
values = c('1' = my_colours[4],
'2' = my_colours[5],
'3' = my_colours[6]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# composite plot
library(ggpubr)
ggarrange(left_panel, right_panel,
ncol = 2, nrow = 1,
labels = c('A', 'B'))
Output
Interpretation
Panel A shows the expected survival probabilities for a male and female patient of median age with a median ph.karno.
Panel B shows the expected survival probabilities for three male patients of median age with ph.karnos of 67 (low), 83 (middle), and 100 (high).
These survival curves will always satisfy the PH assumption, as they were derived from the Cox model.
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
Although correct, I believe that the method described in the answer of Dion Groothof is not what is usually of interest. Usually, researchers are interested in visualizing the causal effect of a variable adjusted for confounders. Simply showing the predicted survival curve for one single covariate combination does not really do the trick here. I would recommend reading up on confounder-adjusted survival curves. See https://arxiv.org/abs/2203.10002 for example.
Those type of curves can be calculated in R using the adjustedCurves package: https://github.com/RobinDenz1/adjustedCurves
In your example, the following code could be used:
library(survival)
library(devtools)
# install adjustedCurves from github, load it
devtools::install_github("/RobinDenz1/adjustedCurves")
library(adjustedCurves)
# "event" needs to be binary
lung$status <- lung$status - 1
# "variable" needs to be a factor
lung$ph.ecog <- factor(lung$ph.ecog)
fit <- coxph(Surv(time, status) ~ ph.ecog + age + sex, data=lung,
x=TRUE)
# calculate and plot curves
adj <- adjustedsurv(data=lung, variable="ph.ecog", ev_time="time",
event="status", method="direct",
outcome_model=fit, conf_int=TRUE)
plot(adj)
Producing the following output:
These survival curves are adjusted for the effect of age and sex. More information on how this adjustment works can be found in the documentation of the adjustedCurves package or the article I cited above.

Can anyone please help fix the following glmmLasso R package errors?

I have been trying to run the following code and I am getting various errors. Anyone know how to fix the current one? I am trying to run a generalized linear mixed model with a tuning parameter (specifically LASSO), but was trying to start at the basics and get the fixed effects to work first. Thank you!
y <- rbinom(n = 50, size = 1, prob = .5)
x <- rnorm(n = 50, mean = 1, sd = .5)
data <- data.frame(x, y)
mod1 <- glmmLasso(fix = y ~ x , rnd = NULL, family = binomial(link = logit), lambda = 10, data = data)
error: the condition has length > 1 and only the first element will be usedthe condition has length > 1 and only the first element will be used
another error: data length is not a multiple of split variable (this does not happen with this simulation data, but it does with my real data)
Another note - I have tried the exact code in the help documentation for generalized linear mixed models with the soccer data and I get the same error about the length > 1
I guess it is a problem with R 4.0.3.
I used glmmLasso without any errors and this error occurred when I updated my Base R from 3.6 to 4.0.3. I wrote an email to the author.
My guess is that with only 1 predictor and data that has no relationship, your lambda is too high for 1 variable and throws a weird solution of the matrix, you can check see the source code throws error at these two lines:
finish<-(sqrt(sum((Eta.ma[l,]-Eta.ma[l+1,])^2))/sqrt(sum((Eta.ma[l,])^2))<eps)
finish2<-(sqrt(sum((Eta.ma[l-1,]-Eta.ma[l+1,])^2))/sqrt(sum((Eta.ma[l-1,])^2))<eps)
if(finish || finish2)
To make this reproducible:
set.seed(2)
y <- rbinom(n = 50, size = 1, prob = .5)
x <- rnorm(n = 50, mean = 1, sd = .5)
data <- data.frame(x, y)
mod1 <- glmmLasso(fix = y ~ x , rnd = NULL, family = binomial(link = logit), lambda = 10, data = data)
Error in if (finish || finish2) break :
missing value where TRUE/FALSE needed
mod1 <- glmmLasso(fix = y ~ x , rnd = NULL, family = binomial(link = logit), lambda = 1, data = data)
mod1
Call:
glmmLasso(fix = y ~ x, rnd = NULL, data = data, lambda = 1, family = binomial(link = logit))
Fixed Effects:
Coefficients:
(Intercept) x
-0.8089034 0.8678967
If we try another seed, you can see there's no issue, although you can see the end solution is a the coefficient set to zero:
set.seed(1)
y <- rbinom(n = 50, size = 1, prob = .5)
x <- rnorm(n = 50, mean = 1, sd = .5)
data <- data.frame(x, y)
mod1 <- glmmLasso(fix = y ~ x , rnd = NULL, family = binomial(link = logit), lambda = 10, data = data)
mod1
Call:
glmmLasso(fix = y ~ x, rnd = NULL, data = data, lambda = 10,
family = binomial(link = logit))
Fixed Effects:
Coefficients:
(Intercept) x
0.1603426 0.0000000
To sum up.. most likely for your data, you need to move it through some lambdas to examine the fit

Logistic regression for non-linear data

I have a data with continuous independent variable and binary dependent. Therefore I was trying to apply logistic regression for the analysis of this data. However in contrast to the classical case with S-shaped transition, I have a two transitions.
Here is an example of what I mean
library(ggplot)
library(visreg)
classic.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 14), 1, 0, rep(1, times = 14)))
model.classic = glm(formula = y ~ x,
data = classic.data,
family = "binomial")
summary(model.classic)
visreg(model.classic,
partial = FALSE,
scale = "response",
alpha = 0)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
model.my = glm(formula = y ~ x,
data = my.data,
family = "binomial")
summary(model.my)
visreg(model.my,
partial = FALSE,
scale = "response",
alpha = 0)
The blue lines on both plots - it is outcome of glm, while red line it what I want to have.
Is there any way to apply logistic regression to such data? Or should I apply some other type of regression analysis?
In your second model, y is not a linear function of x. When you write y ~ x you assume that when x increases, y will increase/decrease depending on a positive/negative coefficient. That is not the case, it's increasing and then decreasing, making the average effect of x zero (hence the strait line). You therefore need a non-linear function. You could do that with a gam from the mgcv package, where the effect of x is modelled as a smooth function:
library(mgcv)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
m = gam(y ~ s(x), data = my.data, family = binomial)
plot(m)
That would lead to the following fit on the original scale:
my.data$prediction = predict(m, type = "response")
plot(my.data$x, my.data$y)
lines(my.data$x, my.data$prediction, col = "red")

Plotting with GLMMadaptive for zero-inflated, semi-continuous data?

I'm trying to utilize the effectPlotData as described here: https://cran.r-project.org/web/packages/GLMMadaptive/vignettes/Methods_MixMod.html
But, I'm trying to apply it to a model (two-part mixed model for zero-inflated semi-continuous data) that includes random/fixed effects for both a linear and logistic portion (hurdle lognormal). I get the following error:
'Error in Qs[1, ] : incorrect number of dimensions'
Which, I think is from having more than one set of random/fixed effect outcomes, but if anyone else has come across this error or can advise, it would be appreciated! I've tried changing the terms in the new data frame and tried a couple of different options with length.out (attempted this as number of subjects and then number of total observations across all subjects), but get the same error each time.
Code below, specifies the model into m and new data frame into nDF:
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3,
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X), max(X), length.out = 908),
Y = levels(Y)))
effectPlotData(m, nDF)
It seems to work for with the following example:
library("GLMMadaptive")
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients non-zero part
sigma <- 0.5 # standard deviation error terms non-zero part
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.1 # variance of random slopes non-zero part
D33 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)), rnorm(n, sd = sqrt(D33)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1:2, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 3, drop = FALSE]))
# we simulate log-normal longitudinal data
DF$y <- exp(rnorm(n * K, mean = eta_y, sd = sigma))
# we set the zeros from the logistic regression
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
###############################################################################
km1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
family = hurdle.lognormal(),
zi_fixed = ~ sex)
km1
nDF <- with(DF, expand.grid(time = seq(min(time), max(time), length.out = 15),
sex = levels(sex)))
plot_data <- effectPlotData(km1, nDF)
library("lattice")
xyplot(pred + low + upp ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
local({
km1$Funs$mu_fun <- function (eta) {
pmax(exp(eta + 0.5 * exp(2 * km1$phis)), .Machine$double.eps)
}
km1$family$linkfun <- function (mu) log(mu)
plot_data <- effectPlotData(km1, nDF)
xyplot(exp(pred) + exp(low) + exp(upp) ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
})
In case someone comes across the same error, I was filtering data from my data frame within the model -- which caused the dimensions of the model and the variable from the data frame to not match. I applied the same filtering to the new data frame (I've also moved forward with a completely new data frame that only includes trials that are actually used by the model so that no filtering has to be used at any step).
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3[data_combined_temp_Fix_Num3$Z>=4 &
data_combined_temp_Fix_Num3$ZZ>= 4,],
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)`
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])),
max(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])), length.out = 908),
Y = levels(Y)))`
effectPlotData(m, nDF)

R lmer with variance structure

i want to fit a mixed model but with variance structure.
I create my data frame
DTF <- data.frame(Y = rnorm(7*4, 0, 1),
TIME = rep(1:7, 4),
GROUP = c(rep(1,7), rep(2,7), rep(3,7), rep(4,7)),
X = rep(rnorm(7, 20, 5), 4))
and than i run this command
model <- lmer(Y ~ TIME:GROUP + X + (1 | GROUP), DTF, REML = FALSE,
varWeights(varFixed(~ TIME)))
and i have
Error in lmer(...): 'control' is not a list; use lmerControl()
I don't know what i have to change.
Thanks for your help!

Resources