Plotting predicted survival curves for continuous covariates in ggplot - r

How can I plot survival curves for representative values of a continuous covariate in a cox proportional hazards model? Specifically, I would like to do this in ggplot using a "survfit.cox" "survfit" object.
This may seem like a question that has already been answered, but I have searched through everything in SO with the terms 'survfit' and 'newdata' (plus many other search terms). This is the thread that comes closest to answering my question so far: Plot Kaplan-Meier for Cox regression
In keeping with the reproducible example offered in one of the answers to that post:
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
df <- read.table(url, header = TRUE)
library(dplyr)
library(ggplot2)
library(survival)
library(magrittr)
library(broom)
# Identifying the 25th and 75th percentiles for prio (continuous covariate)
summary(df$prio)
# Cox proportional hazards model with other covariates
# 'prio' is our explanatory variable of interest
m1 <- coxph(Surv(week, arrest) ~
fin + age + race + prio,
data = df)
# Creating new df to get survival predictions
# Want separate curves for the the different 'fin' and 'race'
# groups as well as the 25th and 75th percentile of prio
newdf <- df %$%
expand.grid(fin = levels(fin),
age = 30,
race = levels(race),
prio = c(1,4))
# Obtain the fitted survival curve, then tidy
# into a dataframe that can be used in ggplot
survcurv <- survfit(m1, newdata = newdf) %>%
tidy()
The problem is, that once I have this dataframe called survcurv, I cannot tell which of the 'estimate' variables belongs to which pattern because none of the original variables are retained. For example, which of the 'estimate' variables represents the fitted curve for 30 year old, race = 'other', prio = '4', fin = 'no'?
In all other examples i've seen, usually one puts the survfit object into a generic plot() function and does not add a legend. I want to use ggplot and add a legend for each of the predicted curves.
In my own dataset, the model is a lot more complex and there are a lot more curves than I show here, so as you can imagine seeing 40 different 'estimate.1'..'estimate.40' variables makes it hard to understand what is what.

Thanks for providing a well phrased question and a good example. I'm a little surpirsed that tidy does a relatively poor job here of creating sensible output. Please see below for my attempt at creating some plottable data:
library(tidyr)
newdf$group <- as.character(1:nrow(newdf))
survcurv <- survfit(m1, newdata = newdf) %>%
tidy() %>%
gather('key', 'value', -time, -n.risk, -n.event, -n.censor) %>%
mutate(group = substr(key, nchar(key), nchar(key)),
key = substr(key, 1, nchar(key) - 2)) %>%
left_join(newdf, 'group') %>%
spread(key, value)
And the create a plot (perhaps you'd like to use geom_step instead, but there is not step shaped ribbon, unfortunately):
ggplot(survcurv, aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high,
col = race, fill = race)) +
geom_line(size = 1) +
geom_ribbon(alpha = 0.2, col = NA) +
facet_grid(prio ~ fin)

Try defining your survcurv like this:
survcurv <-
lapply(1:nrow(newdf),
function(x, m1, newdata){
cbind(newdata[x, ], survfit(m1, newdata[x, ]) %>% tidy)
},
m1,
newdf) %>%
bind_rows()
This will include all of the predictor values as columns with the predicted estimates.

Related

Produce a facet with an overall group in ggsurvplot in R

In R, I am using ggsurvplot_facet to produce survival curves plotted for groups sex as a facet stratified by a variable ecog. However, I would like to have an overall group as well in the same facet as well. Is this possible?
ggsurvplot_add_all did not help.
Here is some example data:
library(survminer)
lung$ecog <- ifelse(lung$ph.ecog == 0, 0, 1)
fit <- surv_fit(Surv(time, status) ~ sex, data = lung)
fig_os <- ggsurvplot_facet(fit, data = lung, facet.by = 'ecog')
I need a survival curve for the whole population, independent of ecog.
Cheat a little bit, by row-binding lung to a copy of lung where in the latter ecog has been replaced with a constant.
Cheat, making a copy, setting ecog to 2, row-binding, and changing ecog in the row-bound dataset to a factor.
lung2 <- copy(lung)
lung2$ecog <- 2
lung2 <- rbind(lung,lung2)
lung2$ecog <- factor(lung2$ecog,labels = c("0", "1", "Overall"))
Then use your code above, but using lung2 as the dataset.
fit <- surv_fit(Surv(time, status) ~ sex, data = lung2)
fig_os <- ggsurvplot_facet(fit, data = lung2, facet.by = 'ecog')
Output:

Function to update `ggplot2::labs(caption=)` using data passed to `ggplot2::ggplot(data=)`

I've recently written my first ggplot2 stat and geom methods. I want to write another that uses the data passed in ggplot2::ggplot(data=) to add a p-value as a caption to the figure. Is that possible?
For example, I would like to write something like this:
library(ggplot2)
mtcars |>
ggplot(aes(x = mpg, y = cyl)) +
add_pvalue()
Where add_pvalue() would calculate a p-value (e.g. an anova p-value for different mean MPG by the number of cylinders), and add the p-value as a caption, labs(caption = "p = 0.45").
Thank you!
Daniel, it's possible. You can use this example. Hope that help you !
library(ggplot2)
library(glue)
p_value <- 0.05
mtcars |>
ggplot(aes(x = mpg, y = cyl)) +
labs(caption = glue("p = {p_value}"))
You could do something like the following, picking your preferred statistical model, "types" of p-values, and formatting of the p-value. If you wanted to build in lots of functionality to make it useful for a wide variety of models, you'd want to add conditional extractor functions for those models.
# Packages
library(ggplot2)
library(dplyr)
library(rlang)
# Define "add_pvalue()" function
# adds p-value from linear regression of y on x
# note that this assumes x and y are reals or integers
add_pvalue <- function(ggplot_obj) {
# Get x and y variable names from ggplot object
x <- ggplot_obj$mapping$x %>%
rlang::quo_get_expr() %>%
deparse()
y <- ggplot_obj$mapping$y %>%
rlang::quo_get_expr() %>%
deparse()
# Build regression model formula, fit model, return model summary
mod <- paste0(y, "~ ", x) %>%
as.formula() %>%
lm(data = ggplot_obj$data) %>%
summary()
# Extract two-tailed t-test p-value from model object (reformat as desired)
pval <- mod$coefficients[x, "Pr(>|t|)"]
# Add p_value as plot caption
ggplot_obj +
labs(caption = paste0("p = ", pval))
}
# Example with p-value for linear model and 95% confidence intervals
mtcars %>%
ggplot(aes(x = mpg, y = cyl)) %>%
add_pvalue() +
geom_smooth(method = "lm", se = TRUE, level = 0.95)
#> `geom_smooth()` using formula 'y ~ x'
Note that blindly fitting a linear regression or ANOVA to your data is probably not the best decision since x or y may not be real or integer types. If they aren't, this won't really make sense since some models either throw runtime errors or employ one-hot encoding when passed other types of variables.
Similarly, the p-values you obtain may be utterly meaningless if, for example, each row in the data is not an independent observation, you run lots of models that invalidate the sampling assumptions of p-values, your hypothesis doesn't match the test, etc.
Finally, you could also try using the output of stat_smooth() that is produced when you call geom_smooth() to do this. The upside would be that you wouldn't need to fit the model twice to have both that geom and the p-value (using the standard error and coefficients plus normal distribution to get the p-value). That's a bit outside of the scope and would be more limiting since you're stuck with the models it employs and the same issues plague those as well. It's also pretty annoying to extract those: Method to extract stat_smooth line fit

Graphing model results of longitudinal data in R

I am looking to create a graph of longitudinal data by age and sex, similar to the graph in this image , from this paper https://www.thelancet.com/journals/lanpub/article/PIIS2468-2667(20)30258-9/fulltext.
To graph model results in the past, I have used both ggplot2 and ggpredict. I prefer ggpredict because it graphs the results accounting for covariates, but I am OK with graphing in ggplot2 if it can't be done in ggpredict.
I am providing a minimal reproducible example below, with id, wave (2 waves, separated by 6 years), age, sex, tst (total sleep time), and bmi for a covariate.
id<-rep(1:50, 2)
wave<-c(rep(1, 50),rep(2, 50))
tst<-c(sample(7:9,50, replace = T),sample(4:7,50, replace = T))
mydf<-data.frame(id,wave,tst)
mydf$age[mydf$wave==1]<-sample(40:90,50, replace = T)
mydf$age[mydf$wave==2]<-mydf$age[mydf$wave==1]+6
mydf$bmi<-sample(20:30,50, replace = T)
mydf$sex<-sample(1:2,50, replace = T)
mydf$age.cat<-cut(mydf$age[mydf$wave==1], breaks = 3,labels = c(1,2,3))
##Overall model##
(model <- lmer( tst ~ wave + age + sex + bmi +(1|id), data = mydf))
I tried to graph it with ggplot2 using the following syntax, however I'm not sure that the graph is exactly what I'm looking for. I would like to graph change in tst between waves 1 and 2, by age group and sex. TST would be on the y axis, age would be on the x axis, with separate lines for age group and sex, with standard errors. The lines will correspond to within-person change in TST between waves 1 and 2.
I think that the graph right now is showing the between subjects effects of age on tst, and not taking into account the fact that the data is nested within-person. Any help would be greatly appreciated.
ggplot(mydf,aes(x=age, y=tst, color=as.factor(sex), group=as.factor(age.cat), linetype=as.factor(age.cat)))+
geom_smooth(data=mydf[mydf$sex==1,], method = lm, formula = y~x)+
geom_smooth(data=mydf[mydf$sex==2,], method = lm, formula = y~x)+
geom_point() +
theme_bw()

Using predictNLS to create confidence intervals around fitted values in R?

I want to build confidence intervals around a large set of fitted values using predictNLS from the propogate package in R. As an example, I will use the data set they reference in the function description (https://rdrr.io/github/anspiess/propagate/man/predictNLS.html), DNase, and building a model that takes the values conc and density as features:
library(propogate)
library(dplyr)
library(modelr)
DNase <- DNase
modeldna <- DNase %>% group_by(Run) %>%
do(run_model = nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = .)) %>% ungroup()
I then want to give each row the model that it is assigned to so that predictions can be added:
DNApredict <- full_join(as_tibble(DNase), modeldna, by = "Run")
Add in the predictions:
DNApredict <- DNApredict %>%
group_by(Run) %>%
do(add_predictions(., var = "predicted_density", first(.$run_model)))
And then, I want to add the confidence interval data that predictNLS seems to provide, by giving it that same data and asking it to give a confidence interval for each fitted point in the predicted_density column:
confidence_interval <- predictNLS(model = modeldna, newdata = DNApredict$predicted_density, interval = "confidence")
However, the following error arises:
Error in as.list(object$call$formula) :
argument "object" is missing, with no default
Does anyone know what might be causing this? I know that it will likely seem obvious to some of you what the object it is calling is, so I apologize if this is a ridiculous question. I am really hoping to be able to use this function to create confidence intervals around a series of fitted values. Thank you very much in advance.
Since you are running an nls on each Run in the sample data set, it is easy to get a list of nls models by splitting each run into its own data frame, and running nls on each data frame using lapply
library(propagate)
DNase <- DNase
modeldna <- DNase %>% split(DNase$Run)
models <- lapply(modeldna, function(d) nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = d))
Now we can get predictions for each point in each model just as easily by running predictNLS on each model (again inside lapply)
results <- lapply(seq_along(modeldna), function(i) {
predictNLS(models[[i]], newdata = data.frame(conc = modeldna[[i]]$conc))
})
Because of the output structure of predictNLS, we need to extract the predictions for each row and coerce them into a data frame:
predictions <- lapply(results, function(x) {
as.data.frame(do.call(rbind, lapply(x$prop, function(y) y$prop)))})
Finally, we can stick our predictions (including confidence intervals) back onto the original data frame:
all_results <- do.call(rbind, lapply(seq_along(modeldna),
function(i) cbind(modeldna[[i]], predictions[[i]])))
This now gives us a complete data frame of original data points, and the relevant predictions with confidence intervals.
To show this, we can plot the results in ggplot. Here we show one plot for each run, including its original data, the predicted value as a dotted line, and the 95% confidence limit as a pale blue ribbon:
library(ggplot2)
ggplot(all_results, aes(x = conc, y = density)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`),
fill = "deepskyblue4", alpha = 0.2) +
geom_point() +
geom_line(aes(y = Mean.1), linetype = 2) +
facet_wrap(.~factor(Run, levels = 1:11)) +
theme_bw()

Extract Model for Specific Factor

Say I've fit a model as follows fit = lm(Y ~ X + Dummy1 + Dummy2)
How can I extract the regression for a specific dummy variable?
I'm hoping to do something like the following to plot all the regressions:
plot(...)
abline(extracted.lm.dummy1)
abline(extracted.lm.dummy2)
I would look into the sjPlot package. Here is the documentation for sjp.lm, which can be used to visualize linear models in various ways. The package also has some nice tools for tabular summaries of models.
An example:
library(sjPlot)
library(dplyr)
# add a second categorical variable to the iris dataset
# then generate a linear model
set.seed(123)
fit <- iris %>%
mutate(Category = factor(sample(c("A", "B"), 150, replace = TRUE))) %>%
lm(Sepal.Length ~ Sepal.Width + Species + Category, data = .)
Different kinds of plot include:
Marginal effects plot, probably closest to what you want
sjp.lm(fit, type = "eff", vars = c("Category", "Species"))
"Forest plot" (beta coefficients + confidence interval)
sjp.lm(fit)

Resources