I am trying to predict and plot the (estimated) survival curve for a new observation in R. Using the "survival" library and the "lung" data set, I first fit a cox proportional hazards model to the data. Then, I tried to predict and plot the survival curve for a hypothetical new observation (I entered the details for this hypothetical new observation in the "list" command). However, this is not working.
I have attached my code below:
#load library
library(survival)
data(lung)
#create survival object
s <- with(lung,Surv(time,status))
#create model
modelA <- coxph(s ~ as.factor(sex)+age+ph.ecog+wt.loss+ph.karno,data=lung, model=TRUE)
summary(modelA)
#plot
plot(survfit(modelA), ylab="Probability of Survival",
xlab="Time", col=c("red", "black", "black"))
#predict for a hypothetical NEW observation (here is where the error is)
lines(predict(modelA, newdata=list(sex=1,
age = 56,
ph.ecog = 1,
ph.karno = 50,
wt.loss = 11),
type="quantile",
p=seq(.01,.99,by=.01)),
seq(.99,.01,by=-.01),
col="blue")
## Error in match.arg(type) :
## 'arg' should be one of “lp”, “risk”, “expected”, “terms”, “survival”
Does anyone know what I am doing wrong? Thanks
This is what the survfit function is for. In your example, you plot the survfit for the model, but you can feed a newdata argument into this function and it will produce the estimated survival for these data.
If we reproduce your example:
library(survival)
s <- with(lung, Surv(time, status))
modelA <- coxph(s ~ as.factor(sex) + age + ph.ecog + wt.loss + ph.karno,
data = lung, model = TRUE)
plot(survfit(modelA), ylab = "Probability of Survival",
xlab = "Time", col = c("red", "black", "black"))
Then we can create a survival curve given your specified covariates like this:
est <- survfit(modelA, newdata = data.frame(sex = 1,
age = 56,
ph.ecog = 1,
ph.karno = 50,
wt.loss = 11))
Now est is an S3 object with members that include time and survival, so we can plot a blue line tracking the estimated survival of individuals with the given covariates like this:
lines(est$time, est$surv, col = 'blue', type = 's')
Or plot it on its own with a 95% confidence interval:
plot(est, ylab = "Probability of Survival",
xlab = "Time", col = c("red", "black", "black"))
Created on 2022-05-26 by the reprex package (v2.0.1)
See the description of the predict() function (you can open it in R help by running ?predict.coxph, or here for example):
type - the type of predicted value. Choices are the linear predictor
("lp"), the risk score exp(lp) ("risk"), the expected number of events
given the covariates and follow-up time ("expected"), and the terms of
the linear predictor ("terms"). The survival probability for a subject
is equal to exp(-expected).
You can see that your type="quantile" does not match expected input. If you call predict() without the type argument, in your case it will default to using lp (linear predictor).
When you call predict() function for your object modelA, it determines that it is of coxph class, so the predict.coxph() function is applied. The arguments like type="quantile" and p=seq(.01,.99,by=.01) are not acceptable for predict.coxph() (p is ignored, type raises error). They are used in another function, predict.survreg() - for it to be called, your modelA object must be of survreg class, i.e. it should be created using survreg() call instead of coxph() call.
Related
For my homework, I am working with a dataset titled Default. I split my data into training and test sets, and ran a logistic regression for the relationship of default1 and the other 3 predictors(income (continuous), balance(continuous), student(0/1)).
I am supposed to plot the regression model, but it keeps showing a straight horizontal line on the graph and I don't think that's correct.
How can I graph multiple predictors with a singular binary outcome using my Default_train_logistic glm?
Also, how can I obtain those coefficients and error rates of the model?
TIA!
set.seed(1234)
Default$subsample <- runif(nrow(Default))
Default$test <- ifelse(Default$subsample < 0.80, "train", "test")
Default_train <- filter(Default, test == "train")
Default_test <- filter(Default, test == "test")
###Q1 Part B: b. Construct a logistic regression to predict if an individual will default based on all of the provided predictors, and visualize your final predicted model.
#Immediately after loading data, I created default1 to use default as a numerical binary variable for logistic regression.
Default_train_logistic <- glm(default1 ~ ., data = Default_train %>% select(-test), family = "binomial")
summary(Default_train_logistic)
plot(Default_train_logistic)
G1 <- ggplot(Default_train_logistic, aes(balance + income + student1, default1)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
print(G1)
I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html
I have set up a logistic regression model in R and successfully plotted the points of the model to show a relationship in the dataset. I am having trouble showing the line graph of the prediction. The model predicts readmission rates of a hospital based on the length of the initial stay (in days). Here is my code:
mydata <- read.csv(file = 'C:\\Users\\nickg\\Downloads\\3kfid8emf9rkc9ek30sf\\medical_clean.csv', header=TRUE)[,c("Initial_days","ReAdmis")]
head(mydata)
mydata$ReAdmis.f <- factor(mydata$ReAdmis)
logfit <- glm(mydata$ReAdmis.f ~ mydata$Initial_days, data = mydata, family = binomial)
summary(logfit)
range(mydata$Initial_days)
xweight <- seq(0, 79.992, .008)
yweight <- predict(logfit, list(xweight), type = "response")
plot(mydata$Initial_days, mydata$ReAdmis.f, pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
lines(xweight, yweight)
As you can see I have the model set up and ranges described by xweight and yweight, but nothing shows up for the line.
Always use curve for this:
plot(ReAdmis.f ~ Initial_days, data = mydata,
pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
curve(predict(logfit, newdata = data.frame(Initial_days = x),
#x is created by the curve function based on the plot's x limits
#note that newdata must contain the x variable with exactly the same name as in the original data
type = "response"),
add = TRUE)
However, the issue here could be that your y variable is a factor variable (internally that's values of 1 and 2 if you have two levels) whereas logistic regression predictions are always in the interval [0, 1]. You should convert ReAdmis.f into 0/1 integer values before running the code.
I want to add at-risk table and 95% confidence intervals to adjusted survival curves. Something like survival curves below (Link). I saw some relevant code here but it doesn't mention what I need.
Another question, would it be correct if I used the obtained weight variable (from iptw package) in the adjustment (as I have a column named weight in my actual dataset.
Here is my code:
library(survival);library(survminer)
data(lung);names(lung)
#fit <- coxph( Surv(time, status==2) ~ ph.karno + strata(sex), data = lung )
lung$sex <- ifelse(lung$sex == 1, "Male", "Female")
fit <- coxph(Surv(time, status) ~ ph.ecog + age +strata(sex), data = lung)
ggadjustedcurves(fit,
variable = "sex",
data = lung,
method = "average",
palette = c("#E69F00", "#56B4E9"),
size = 1.3,
legend = "right",
legend.title = expression(bold("Legend title")),
xlab = "Time",font.legend = 12) + theme(legend.text.align = 0.5)
Short Answer: Currently there is no R-Package that allows you to directly plot confounder-adjusted survival curves with confidence intervals. However there is a function called ate in the riskRegression R-Package which can be used to calculate adjusted survival probability estimates at some points in time with 95% confidence intervals. If you do some coding you can definitly use that one to get what you want.
Alternatively you can wait a little longer. I am currently working on an R-Package that implements various methods of confounder-adjusted survival curves (and cumulative incidence functions when there are competing risks) complete with confidence intervals, hypothesis tests, risk tables and so on.
UPDATE (09.05.2022):
The adjustedCurves package can now be used to obtain adjusted survival curves with confidence intervals. For your example:
library(survival)
library(devtools)
# install adjustedCurves from github, load it
devtools::install_github("/RobinDenz1/adjustedCurves")
library(adjustedCurves)
# fit required model
lung$sex <- as.factor(ifelse(lung$sex == 1, "Male", "Female"))
lung$status <- lung$status - 1
fit <- coxph(Surv(time, status) ~ ph.ecog + age + strata(sex), data=lung,
x=TRUE)
# calculate and plot curves
adj <- adjustedsurv(data=lung, variable="sex", ev_time="time",
event="status", method="direct",
outcome_model=fit, conf_int=TRUE)
plot(adj, conf_int=TRUE)
I have an issue with creating a ROC Curve for my survival tree created by the rpart package. My goal was to evaluate my survival tree through Area Under Curve (AUC) in ROC curve. I had tried many ways to plot a ROC curve but failed. How can I approach my next step the ROC curve plot?
Here is the R code I have so far:
library(survival)
library("rpart")
library("partykit")
library(rattle)
library(rpart.plot)
temp = coxph(Surv(pgtime, pgstat) ~ age+eet+g2+grade+gleason+ploidy, stagec)
newtime = predict(temp, type = 'expected')
fit <- rpart(Surv(pgtime, pgstat) ~ age+eet+g2+grade+gleason+ploidy, data = stagec)
fancyRpartPlot(fit)
tfit <- as.party(fit) #Transfer "rpart" to "party"
predtree<-predict(tfit,newdata=stagec,type="prob") #Prediction
Here is the R code I have tried so far:
1.
library("ROCR")
predROCR <- prediction(predict(tfit, newdata = stagec, type = "prob")[, 2],labels=Surv(stagec$pgtime, stagec$pgstat))
Error in predict(tfit, newdata = stagec, type = "prob")[, 2] :
incorrect number of dimensions
It doesn't work. I checked the prediction result of a function of predict() and found that it is an ‘Survival’ object (Code and results are as follows:). I guess this method fails because it not suitable for "Survival" object?
predict(tfit, newdata = stagec, type = "prob")[[1]]
Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
n events median 0.95LCL 0.95UCL
33 1 NA NA NA
I try to derive the survival function value of each terminal node and use these value to draw the ROC curve. Is this correct? The ROC curve drawn in this way seems to treat the predicted classification results as continuous variables rather than categorical variables.
Here's the R code I tried:
tree2 = fit
tree2$frame$yval = as.numeric(rownames(tree2$frame))
#Get the survival function value of each sample
Surv_value = data.frame(predict(tree2, newdata=stagec,type = "matrix"))[,1]
Out=data.frame()
Out=cbind(stagec,Surv_value)
#ROC
library(survivalROC)
roc=survivalROC(Stime=Out$pgtime, status=Out$pgstat, marker = Out$Surv_value, predict.time =5, method="KM")
roc$AUC #Get the AUC of ROC plot
#Plot ROC
aucText=c()
par(oma=c(0.5,1,0,1),font.lab=1.5,font.axis=1.5)
plot(roc$FP, roc$TP, type="l", xlim=c(0,1), ylim=c(0,1),col="#f8766d",
xlab="False positive rate", ylab="True positive rate",
lwd = 2, cex.main=1.3, cex.lab=1.2, cex.axis=1.2, font=1.2)
aucText=c(aucText,paste0("498"," (AUC=",sprintf("%.3f",roc$AUC),")"))
legend("bottomright", aucText,lwd=2,bty="n",col=c("#f8766d","#00bfc4","blue","green"))
abline(0,1)