Predict Weibull instant hazard - r

I am trying to model baseline hazard using different survival models.
For example I can model the cumulative baseline hazard using Cox as follows:
library(survival)
df <- aml
cox <- coxph(Surv(time, status) ~ 1, df)
cox.summary <- basehaz(cox, centered = TRUE)
plot(cox.summary)
However Cox is not ideal for this and I would like to model instant hazards rather than cumulative, so I am trying to use Weibull as follows:
weibull <- survreg(Surv(time, status) ~ 1, data=df, dist="weibull")
However I cannot work out how to predict the instant hazard from this. For example if I try the following code I just get a constant value (38.18681) at each time point.
predict(weibull, newdata=data.frame(time=df$time))
Where am I going wrong here ? In case I wasn't clear my aim is to visualize the instantaneous hazard in a plot of hazard vs time.

Related

Predictions from a Cox model with splines and confidence interval

I have to make a prediction tool from a Cox model.
Normally I could have used predictCox from the riskRegression package to get the predicted survival and confidence interval.
But for that I would have to upload my data with the tool. But I don't have the right to do that.
The alternative is to estimate the basis risk at a given instant, obtain the estimated coefficients to build a simple regression model which can be put online.
However, I'm not sure about the procedure and the packages I can use to get it.
I have read that some packages center linear predictors etc.
Also, I don't know how to get the confidence interval.
Can someone confirm me in a few words what would be the right procedure and the packages to use in this case?
library(survival)
library(riskRegression)
mfit <- coxph(Surv(futime, death) ~ sex +ns(age,2), data=mgus, x=T, y=T)
prediction_surv2 <- predictCox(mfit, newdata=newdata, times=365.25*5 , type = "survival",
se = TRUE, iid = TRUE, band = TRUE)
prediction_surv2
Thanks a lot !

Making sense of gbm survival prediction model

I am a newbie in using and making sense of ML methods and currently doing survival analysis using gbm package in R.
I have difficulty understanding some of the output of the survival prediction model. I have checked this tutorial and this post but still, find trouble in making sense of the outputted survival prediction model.
Here is my code for analysis based on example data:
rm(list=ls(all=TRUE))
library(randomForestSRC)
library(gbm)
library(survival)
library(Hmisc)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
model <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph",
cv.folds = 5)
summary(model)
best.iter <- gbm.perf(model, plot.it = TRUE, method = 'cv',
overlay = TRUE) #to get the optimal number of Boosting iterations
best.iter
#Us the best number of tree to produce predicted values for each observation in newdata
# return a vector of prediction on n.trees indicting log hazard scale.f(x)
# By default the predictions are on log hazard scale for coxph
# proportional hazard model assumes h(t|x)=lambda(t)*exp(f(x)).
# estimate the f(x) component of the hazard function
pred.train <- predict(object=model, newdata=data.train, n.trees = best.iter)
pred.test <- predict(object=model, newdata=data.test, n.trees = best.iter)
#trainig set
Hmisc::rcorr.cens(-pred.train, Surv(data.train$days, data.train$status))
#val set
Hmisc::rcorr.cens(-pred.test, Surv(data.test$days, data.test$status))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = data.train$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all:
surv.rate <- exp(-exp(pred.train)*basehaz.cum)
surv.rate
res_train <- data.train
# predicted outcome for train set
res_train$pred <- pred.train
res_train$survival_rate <- surv.rate
res_train
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.test$days, #The survival times.
delta=data.test$status, #The censoring indicator
f.x=pred.test, #The predicted values of the regression model on the log hazard scale.
t.eval = data.test$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all at specified time is:
surv.rate <- exp(-exp(pred.test)*basehaz.cum)
surv.rate
res_test <- data.test
# predicted outcome for test set
res_test$pred <- pred.test
res_test$survival_rate <- surv.rate
res_test
#--------------------------------------------------
#Estimate survival rate at time of interest
# Specify time of interest
time.interest <- sort(unique(data.train$days[data.train$status==1]))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = time.interest, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
#For individual $i$ in test set, estimation of survival function is:
surf.i <- exp(-exp(pred.test[1])*basehaz.cum) #survival rate
#Estimation of survival rate of all at specified time is:
specif.time <- time.interest[10]
surv.rate <- exp(-exp(pred.test)*basehaz.cum[10])
cat("Survival Rate of all at time", specif.time, "\n")
print(surv.rate)
The output returned from the predict function represents the f(x) component of the hazard function ( h(t|x)=lambda(t)*exp(f(x)) ).
My questions:
• A bit confused about whether hazard ratios can be calculated here?
• Wondering how can I divide the population into low-risk and high-risk groups? Can I rely on the estimated f(x) component of the hazard function to do the scoring system for the training set? I aim from this to have a scoring system where I show KM plots for low and high-risk groups for training and test sets.
• How can I construct calibration curve plots where I can plot observed survival vs. predicted survival for the training set and test set?
Amer. Thx for your reading of my tutorial!
As you mentioned that "The output returned from the predict function represents the f(x) component of the hazard function ( h(t|x)=lambda(t)*exp(f(x)) )", maybe we need to understand the hazard function, i.e. h(t|x).
Before this, please sure that you have the basic knowledge of survival analysis. if not, it's recommended to read the great post. I think the post would help you solve the questions.
Back to your questions:
Exactly, we can get the hazard ratios of log scale by invoking the predict function. Therefore, the hazard ratio can be calculated by exp() .
Sure! Relying on the values of hazard ratio, we can divide the population into low-risk and high-risk groups. Alternatively, you can use the median of hazard ratios as the cutoff value. I think the cutoff value should be derived from the training set, and then test in the test set. If your model is effective, KM plots for low and high-risk groups would have a significant difference (measured by log-rank test statistically).
Calibration curve plots are often used to evaluated the performance of model that outputs probabilities or likelihoods ranged from [0.0, 1.0]. We can calculate the survival function, and then specify a time point of interest, e.g. 5-Year. At last, we compare the survival probabilities with the actual survival state at the specified time, which is just the same as we do evaluating a binary classification model. More details of obtaining survival function can refer to my tutorial, and the principles can be found in that post aforementioned.

Predicted Survival Curves using Corrected Group Prognosis Method

How can I plot predicted survival curves of a continuous covariate (let's say 20th and 80th percentile of the value) using the corrected group prognosis method as implemented in R by Therneau
For example,
library(survival)
library(survminer)
fit <- coxph( Surv(stop, event) ~ size + strata(rx), data = bladder )
ggadjustedcurves(fit, data=bladder, method = "conditional", strata=rx)
Now, this is useful because I am given two survival curves that are stratified by rx (either 0 or 1) and the conditional method is being acted upon the bladder data set. However, let's say I would like to use the marginal method but not stratify and instead plot my continuous covariate at 20th and 80th value but also re-balance the subpopulation. Would like any step in the right direction.
To re-state, I have a Cox model with continuous predictors. I would like to build a Cox model but not stratify on rx but have this in the model. Then, I want to pass the created Cox object into ggadjustedcurves() function with uses "subpopulation re-balancing" when given a reference data set. And then, instead of showing two survival curves stratified on a categorical variable, I want to plot two representative survival curves at the 20th and 80th percentile.
EDIT
My first attempt
fit2 <- coxph( Surv(stop, event) ~ size + rx, data = bladder ) #remove strata
fit2
# CGP
pred<- data.frame("rx" = 1, "size" = 3.2)
ggadjustedcurves(fit2, data = pred , method = "conditional", reference = bladder)
Is this what I think it is? Conditional re-balancing has been applied to the reference data set and then the predicted curves are generated for an individual with rx=1 and size of 3.2.
It is difficult to understand what you are truly looking for, but I think I have a rough idea. I think you want to plot the survival curve that would have been observed if every person in your sample had received a specific value for the continuous covariate. If there is no confounding, you can simply use a Cox model that includes only the continuous covariate and use the predict() function for a range of points in time and plot the results. If you need to adjust for confounding, you can include the confounders in the Cox model and use g-computation to obtain the desired probabilities. I describe this in a recent preprint: https://arxiv.org/pdf/2208.04644.pdf
This can be done in R using the contsurvplot package (also developed by me). First, install the package using:
devtools::install_github("RobinDenz1/contsurvplot")
Afterwards, fit your Cox model, but use x=TRUE in the coxph call:
library(survival)
library(contsurvplot)
library(riskRegression)
library(ggplot2)
fit2 <- coxph(Surv(stop, event) ~ size + rx, data=bladder, x=TRUE)
You can now call the plot_surv_lines function to obtain the causal survival curves for specific values of size, given the model. Using the horizon argument you can tell the function for which values you want to plot the survival curves. I choose the 20% and 80% quantile of size as you described:
plot_surv_lines(time="stop",
status="event",
variable="size",
data=bladder,
model=fit2,
horizon=quantile(bladder$size, probs=c(0.2, 0.8)))
The package contains a lot more plotting routines to visualize the causal effect of a continuous variable on a time-to-event outcome that might be more suitable for what you actually want.

Calculate the Survival prediction using Cox Proportional Hazard model in R

I'm trying to calculate the Survival prediction using Cox Proportional Hazard model in R.
library(survival)
data(lung)
model<-coxph(Surv(time,status ==2)~age + sex + ph.karno + wt.loss, data=lung)
predict(model, data=lung, type ="expected")
When I use the above code, I get the Cumulative hazard's prediction corresponding to the formula
h^i(t)=h^0(t)exp(x′iβ^)
But my concern is all about predicting the Survival corresponding to the formula,
S^i(t)=S^0(t)exp(x′iβ^)
How do I predict the Survival in R?
Thanks in Advance.
You can use either predict or survfit. With predict you need to give the newdata argument a list with values for all the variables in the model:
predict(model,
newdata=list(time=100,status=1,age=60,sex=1, ph.karno=60,wt.loss=15),
type ="expected")
[1] 0.2007497
There's a plot method for survfit objects:
?survreg
png(); plot(survfit(model)); dev.off()

Survival Analysis for Telecom Churn using R

I am working on Telecom Churn problem and here is my dataset.
http://www.sgi.com/tech/mlc/db/churn.data
Names - http://www.sgi.com/tech/mlc/db/churn.names
I'm new to survival analysis.Given the training data,my idea to build a survival model to estimate the survival time along with predicting churn/non churn on test data based on the independent factors.Could anyone help me with the code or pointers on how to go about this problem.
To be precise,say my train data has got
customer call usage details,plan details,tenure of his account etc and whether did he churn or not.
Using general classification models,I can predict churn or not on test data.Now using Survival analysis,I want to predict the tenure of the survival in test data.
Thanks,
Maddy
If you're still interested (or for the benefit of those coming later), I've written a few guides specifically for conducting survival analysis on customer churn data using R. They cover a bunch of different analytical techniques, all with sample data and R code.
Basic survival analysis: http://daynebatten.com/2015/02/customer-churn-survival-analysis/
Basic cox regression: http://daynebatten.com/2015/02/customer-churn-cox-regression/
Time-dependent covariates in cox regression: http://daynebatten.com/2015/12/survival-analysis-customer-churn-time-varying-covariates/
Time-dependent coefficients in cox regression: http://daynebatten.com/2016/01/customer-churn-time-dependent-coefficients/
Restricted mean survival time (quantify the impact of churn in dollar terms): http://daynebatten.com/2015/03/customer-churn-restricted-mean-survival-time/
Pseudo-observations (quantify dollar gain/loss associated with the churn effects of variables): http://daynebatten.com/2015/03/customer-churn-pseudo-observations/
Please forgive the goofy images.
Here is some code to get you started:
First, read the data
nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names",
skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]]
dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn"))
Use Surv() to set up a survival object for modeling
library(survival)
s <- with(dat, Surv(account.length, as.numeric(Churn)))
Fit a cox proportional hazards model and plot the result
model <- coxph(s ~ total.day.charge + number.customer.service.calls, data=dat[, -4])
summary(model)
plot(survfit(model))
Add a stratum:
model <- coxph(s ~ total.day.charge + strata(number.customer.service.calls <= 3), data=dat[, -4])
summary(model)
plot(survfit(model), col=c("blue", "red"))

Resources