Accelerated Failure Time modelling: plotting survival probabilities with CI-s (example provided) - r

As proportional hazards assumption is violated with my real data, I am using an AFT model, trying to calculate adjusted survival probabilities for study groups in interest. The example below is on kidney data and I tried to follow ciTools vignette.
library(tidyverse)
library(ciTools)
library(here)
library(survival)
library(survminer)
#data
kidney
Model
fit1 = survreg(Surv(time, censored) ~ sex + age + disease, data = kidney)
Call:
survreg(formula = Surv(time, censored) ~ sex + age + disease,
data = kidney)
Coefficients:
(Intercept) sexfemale age diseaseGN diseaseAN diseasePKD
8.41830937 -0.93959839 -0.01578812 -0.25274448 -0.38306425 -0.32830433
Scale= 1.642239
Loglik(model)= -122.1 Loglik(intercept only)= -122.7
Chisq= 1.33 on 5 degrees of freedom, p= 0.931
n= 76
Adding survival probabilities for both sexes for surviving at least 365 days
probs = ciTools::add_probs(kidney, fit1, q = 365,
name = c("prob", "lcb", "ucb"),
comparison = ">")
probs
Trying to plot one-year survival probabilities for both sexes, but there are multiple point estimates for geom_point?
It seems for me that these point estimates are given for each age value. Can I edit the prediction so that it is made for mean or median age?
probs %>% ggplot(aes(x = prob, y = sex)) +
ggtitle("1-year survival probability") +
xlim(c(0,1)) +
theme_bw() +
geom_point(aes(x = prob), alpha = 0.5, colour = "red")+
geom_linerange(aes(xmin = lcb, xmax = ucb), alpha = 0.5)
However, this approach seems to work with a simple model
fit2 = survreg(Surv(time, censored) ~ sex, data = kidney)
probs2 = ciTools::add_probs(kidney, fit2, q = 365,
name = c("prob", "lcb", "ucb"),
comparison = ">")
probs2 %>% ggplot(aes(x = prob, y = sex)) +
ggtitle("1-year survival probability") +
xlim(c(0,1)) +
theme_bw() +
geom_point(aes(x = prob), alpha = 0.5, colour = "red")+
geom_linerange(aes(xmin = lcb, xmax = ucb), alpha = 0.5)
Questions:
How can I get adjusted survival probabilities for both sexes? Or if this is impossible, what would be possible alternatives? Code would help with alternatives.
If I would like to get adjusted survival probabilities for both sexes and for different time points, should I edit "q" value in ciTools::add_probs() function? For example: q = 30 for one month; q = 90 for three months etc. Or I should run a separate model for each time period?

The way you have set up these models, the predictions are being returned for each individual in the kidney data set, based on the covariate values included in the model.
In your first model, you have included sex + age + disease so that you get a prediction for each combination of those 3 covariate values in your data set.
In the second model, you have only included sex as a predictor, so you only get predictions based on sex.
Survival model prediction functions allow you to specify a set of covariate values from which to predict in a new data frame. According to the manual for add_probs.survreg, you do so by specifying a new data frame with specified covariate values in the df argument to the function. You used the kidney data frame there, so you got predictions for all those cases.
I'm not familiar with ciTools::add_probs specifically, but such software typically will (without warning) accept your values for the covariates you specify and then use some type of "average" value for the covariates that you don't specify. As "averages" don't have much meaning for categorical covariates like disease, it's usually better to specify a complete useful set of values for all covariates yourself.
The functions in the rms package in R often do a better job at providing useful predictions, as they choose typical rather than "average" values for unspecified covariates, based on an initial evaluation of the data set by a datadist() function whose output you then specify as a system option. The learning curve for this package is a bit steep but well worth it if you will be doing a lot of survival or other regression modeling.

Related

LMM on Chickweight data

I would like to write a model with random intercepts and random slopes with respect to time. I am not sure if my code is correct.
model4<-lmer(weight~Time + Diet + Time*Diet + (1+Time|Chick), data = Data, REML = TRUE)
summary(model4)
Yes, that is the correct specification for those random effects. You can check this out, by applying a similar model, but temporarily removing the fixed effect on diet and the interaction between time and diet
model4<-lmer(weight~Time + (1+Time|Chick), data = ChickWeight, REML = TRUE)
Column bind the original data, plus predictions from this simple model above, and select five random Chicks to plot
weight_hat = predict(model4)
cw = cbind(ChickWeight,weight_hat)
random_chicks = sample(unique(cw$Chick),5)
ggplot(cw[cw$Chick %in% random_chicks,], aes(Time, color=Chick)) +
geom_point(aes(y=weight), size=2) +
geom_line(aes(y=weight_hat), size=1.5) +
theme(legend.position="bottom")+
guides(color=guide_legend(nrow=1))
You can see that the intercept and slope for each Chick differs.

ROC for Logistic regression in R

I would like to ask for help with my project. My goal is to get ROC curve from existing logistic regression.
First of all, here is what I'm analyzing.
glm.fit <- glm(Severity_Binary ~ Side + State + Timezone + Temperature.F. + Wind_Chill.F. + Humidity... + Pressure.in. + Visibility.mi. + Wind_Direction + Wind_Speed.mph. + Precipitation.in. + Amenity + Bump + Crossing + Give_Way + Junction + No_Exit + Railway + Station + Stop + Traffic_Calming + Traffic_Signal + Sunrise_Sunset , data = train_data, family = binomial)
glm.probs <- predict(glm.fit,type = "response")
glm.probs = predict(glm.fit, newdata = test_data, type = "response")
glm.pred = ifelse(glm.probs > 0.5, "1", "0")
This part works fine, I am able to show a table of prediction and mean result. But here comes the problem for me, I'm using pROC library, but I am open to use anything else which you can help me with. I'm using test_data with approximately 975 rows, but variable proc has only 3 sensitivities/specificities values.
library(pROC)
proc <- roc(test_data$Severity_Binary,glm.probs)
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
ggplot(test_data, aes(x=spec, y=sens)) + geom_line()
HereĀ“s what I have as a result:
With Warning message:
Removed 972 row(s) containing missing values (geom_path).
As I found out, proc has only 3 values as I said.
You can't (and shouldn't) assign the sensitivity and specificity to the data. They are summary data and exist in a different dimension than your data.
Specifically, these two lines are wrong and make no sense at all:
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
Instead you must either save them to a new data.frame, or use some of the existing functions like ggroc:
ggroc(proc)
If you consider what the ROC curve does, there is no reason to expect it to have the same dimensions as your dataframe. It provides summary statistics of your model performance (sensitivity, specificity) evaluated on your dataset for different thresholds in your prediction.
Usually you would expect some more nuance on the curve (more than the 3 datapoints at thresholds -Inf, 0.5, Inf). You can look at the distribution of your glm.probs - this ROC curve indicates that all predictions are either 0 or 1, with very little inbetween (hence only one threshold at 0.5 on your curve). [This could also mean that you unintentially used your binary glm.pred for calculating the ROC curve, and not glm.probs as shown in the question (?)]
This seems to be more an issue with your model than with your code - here an example from a random different dataset, using the same steps you took (glm(..., family = binomial, predict(, type = "response"). This produces a ROC curve with 333 steps for ~1300 datapoints.
PS: (Ingore the fact that this is evaluated on training data, the point is the code looks alright up to the point of generating the ROC curve)
m1 <- glm(survived ~ passengerClass + sex + age, data = dftitanic, family = binomial)
myroc <- roc(dftitanic$survived,predict(m1, dftitanic, type = "response"))
plot(myroc)

Shaded confidence interval bands for glm coefficients with covariates set to mean values

I would like to plot the line and the shaded 95% confidence interval bands (for example using polygon)from a glm model (family binomial)or using gglot. For linear models (lm), I have previously been able to plot the confidence intervals from the predictions as they included the fit, lower and upper level but I do not know how to do it here. I have tried to use the function predict.glm with the optional argument se.fit set to TRUE, and then using the prediction +/- 1.96 * std.error to calculate the confidence intervals but it did not work for me.
Thanks for help in advance. You can find here the data that I used (it contains 10 variables and 996 observations): https://drive.google.com/file/d/1Yu7Dk2eh0R1ztKiuNTtN_W5Yg4C2Ne-2/view?usp=sharing Code and figure here:
# Models
mod= glm(site ~S + age + pH + soil + peat+
spruce+ I(spruce^2)+pine+ birch+
tsumma+ I(tsumma^2),
data=test.dat,family=binomial)
# Means of all covariates
means = apply(test.dat[,c("S", "pH","soil", "spruce", "pine","birch", "tsumma")],2,mean,na.rm=T)
# Calculate the constant given by all other covariates being at their means and assuming only pine on the plot
const = mod$coefficients[1]+
mod$coefficients["S"]*means["S"]+
mod$coefficients["pH"]*means["pH"]+
mod$coefficients["soil"]*means["soil"]+
mod$coefficients["spruce"]*means["spruce"]+
mod$coefficients["I(spruce^2)"]*means["spruce"]*means["spruce"]+
mod$coefficients["pine"]*means["pine"]+
mod$coefficients["birch"]*means["birch"]+
mod$coefficients["tsumma"]*means["tsumma"]+
mod$coefficients["I(tsumma^2)"]*means["tsumma"]*means["tsumma"]
# Plot
age = seq(from=min(test.dat$age,na.rm=T),to=150,length=100)
lin= const + mod$coefficients["age"]*age
Pr = exp(lin) / (exp(lin)+1)
par(mar = c(4, 4, 1.5, 0.3))
plot(age,Pr,type="l", ylim=c(0,.5),las=1, main="Probability of hotspot", ylab="Probability of occurrence",xlab="Forest age (years)")
You can use a package, indicating the term to plot while holding others constant:
library(sjPlot)
set.seed(888)
data = mtcars
data$vs = data$vs + rnorm(nrow(data))
mod = glm(am ~ disp + vs + carb+ I(vs^2),data=data,family="binomial")
plot_model(mod,type="pred",terms="disp")
Or derive it like you did, except I think you might need to create the extra term for the squared value, so that you can hold the other terms at their means, and use the predict.lm function :
data$vs2 = data$vs^2
mod = glm(am ~ disp + vs + carb+ vs2,data=data,family="binomial")
varMeans = colMeans(mod$model)[c("vs","carb","vs2")]
pred_disp = seq(min(data$disp),max(data$disp),length.out=100)
df = data.frame(
disp = pred_disp,
t(replicate(length(pred_disp),varMeans))
)
pred = predict(mod,df,se=TRUE)
plot(df$disp,plogis(pred$fit),"l")
lines(df$disp,plogis(pred$fit + 1.96*pred$se.fit),col="blue",lty=8)
lines(df$disp,plogis(pred$fit - 1.96*pred$se.fit),col="blue",lty=8)

Drawing 95% credible intervals for my bayesian predictions along with the Points from the actual observed value of the response variable

The response variable for my dataset is comprised of observations Y[1], Y[2], ...., Y[49]. I came up with a Bayesian Hierarchical Model to make Bayesian predictions for Y[50]. I also have MCMC samples for Y[1],...,Y[49], which I can use to assess the overall fit of my Bayesian model by comparing them with the actual values of Y[1], Y[2], ...., Y[49].
Is there any way that I can draw the caterpillar plots of my Bayesian Predictions from the MCMC object of the Hierarchical Model along with the points that stands for actual observed Y's from my original dataset on R?
Thank you,
First you need to extract your confidence intervals for each $Y_i$ . (usually this is done with quantile function if you're not using a standard S3 object).
Then you create the following df:
df <- data_frame(
obs = seq(from = 1,
to = 49,
by = 1),
lower = q1,
upper = q2,
estimate = estimate,
actual = actual)
Then you go:
df %>% ggplot(aes(x = obs)) +
geom_line(aes(y = actual)) +
geom_pointrange(aes(ymin = lower, ymax = upper, y = estimate)) +
coord_flip()
If you're doing hierarchical models I really recommend using rstanarm package which is compatible with the tidybayes library (which produces automatic caterpillar plots).

Interpreting interactions in a regression model

A simple question I hope.
I have an experimental design where I measure some response (let's say blood pressure) from two groups: a control group and an affected group, where both are given three treatments: t1, t2, t3. The data are not paired in any sense.
Here is an example data:
set.seed(1)
df <- data.frame(response = c(rnorm(5,10,1),rnorm(5,10,1),rnorm(5,10,1),
rnorm(5,7,1),rnorm(5,5,1),rnorm(5,10,1)),
group = as.factor(c(rep("control",15),rep("affected",15))),
treatment = as.factor(rep(c(rep("t1",5),rep("t2",5),rep("t3",5)),2)))
What I am interested in is quantifying the effect that each treatment has on the affected group relative to the control group. How would I model this, say using an linear model (for example lm in R)?
Am I wrong thinking that:
lm(response ~ 0 + treatment * group, data = df)
which is equivalent to:
lm(response ~ 0 + treatment + group + treatment:group, data = df)
is not what I need? I think that in this model the treatment:group interaction terms are relative to the mean over all baseline group and baseline treatment measurements.
I therefore thought that this model:
lm(response ~ 0 + treatment:group, data = df)
is what I need but it's quantifying each combination of treatment and group interaction terms: treatmentt1:groupcontrol treatmentt1:groupaffected treatmentt2:groupcontrol treatmentt2:groupaffected treatmentt3:groupcontrol treatmentt3:groupaffected
So perhaps this model:
lm(response ~ 0 + treatment + treatment:group, data = df)
is the correct one?
Although in addition to quantifying each combination of treatment and groupaffected interaction term it's also quantifying the effect of each treatment. I'm not sure what is the baseline each of the treatment and groupaffected interaction terms are compared to in this model.
Help would be appreciated.
Also, let's say I ran a fourth treatment which is actually the combination of two treatments, say t1+t3, where I don't know what the expectation of their combined effect is: additive/subtractive or synergistic. Is there any way this can be combined?
The interaction term tells you that the difference between groups is dependent on treatment, that is, that the difference between affected and control is not the same for t1, t2 and t3.
I would model the intercept though.
lm(response ~ group + treatment + group:treatment, data=df)
After getting a significant interaction term I would use t.tests to further investigate and to help with interpretation.
As can be seen the interaction is driven by the larger effect of t2 relative to the others.
library(data.table)
library(dplyr)
library(ggplot2)
set.seed(1)
df <- data.frame(response = c(rnorm(5,10,1),rnorm(5,10,1),rnorm(5,10,1),rnorm(5,7,1),rnorm(5,5,1),rnorm(5,10,1)),
group = as.factor(c(rep("control",15),rep("affected",15))),
treatment = as.factor(rep(c(rep("t1",5),rep("t2",5),rep("t3",5)),2)))
# t tests of the desired comparisons to see if there is a difference and get 95% confidence intervals
t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])
# plot 95% C.I.
ci_plot <- matrix(nrow=3, ncol=3)
ci_plot <- as.data.frame(ci_plot)
colnames(ci_plot) <- c("treatment", "lci", "uci")
ci_plot[,1] <- c("t1", "t2", "t3")
ci_plot[,3] <- c(t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])$conf.int[1],
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])$conf.int[1],
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])$conf.int[1])
ci_plot[,4] <- c(t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])$conf.int[2],
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])$conf.int[2],
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])$conf.int[2])
ggplot(ci_plot, aes(x=treatment, y=uci)) +
geom_errorbar(aes(ymin=uci, ymax=lci), width=0.5, position=position_dodge(0.9), weight=0.5) +
xlab("Treatment") +
ylab("Change in mean relative to control (95% C.I.)") +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle = 90, hjust = 1))
Your first specification is fine.
lm(response ~ 0 + treatment * group, data = df)
Call:
lm(formula = response ~ 0 + treatment * group, data = df)
Coefficients:
treatmentt1 treatmentt2 treatmentt3
7.460 5.081 9.651
groupcontrol treatmentt2:groupcontrol treatmentt3:groupcontrol
2.670 2.384 -2.283
The first coefficient, 7.460, represents the effect that occurs when a participant is both treated with t1 and affected. Going from left to right, the second coefficient, 5.081, represents when a participant is both treated with t2 and affected, etc...
So for example, when a participant is treated with t2 and in the control the effect is 5.081 + 2.384.
If I were doing this analysis, I would keep the intercept.
Call:
lm(formula = response ~ treatment * group, data = df)
Coefficients:
(Intercept) treatmentt2 treatmentt3
7.460 -2.378 2.192
groupcontrol treatmentt2:groupcontrol treatmentt3:groupcontrol
2.670 2.384 -2.283
Now the second coefficient, going from left to right, represents the effect of participants treated with t2 and affected relative to participants treated with t1 and affected. To see this notice that 7.460 - 2.378 = 5.081 (the second coefficient in the first specification). I like this approach because it makes it easier to interpret the relative effects.
That all being said #MrFlick is right. This is a question for Cross Validation.

Resources