I am befuddled by the format to perform a simple prediction using R's survival package
library(survival)
lung.surv <- survfit(Surv(time,status) ~ 1, data = lung)
So fitting a simple exponential regression (for example purposes only) is:
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung, dist="exponential")
How would I predict the percent survival at time=400?
When I use the following:
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400), type="response")
I get the following:
myPredict400
1
421.7758
I was expecting something like 37% so I am missing something pretty obvious
The point with this survival function is to find an empirical distribution that fits the survival times. Essentially you are associating a survival time with a probability. Once you have that distribution, you can pick out the survival rate for a given time.
Try this:
library(survival)
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung) # because you want a distribution
pct <- 1:99/100 # this creates the empirical survival probabilities
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400),type='quantile', p=pct)
indx = which(abs(myPredict400 - 400) == min(abs(myPredict400 - 400))) # find the closest survival time to 400
print(1 - pct[indx]) # 0.39
Straight from the help docs, here's a plot of it:
matplot(myPredict400, 1-pct, xlab="Months", ylab="Survival", type='l', lty=c(1,2,2), col=1)
Edited
You're basically fitting a regression to a distribution of probabilities (hence 1...99 out of 100). If you make it go to 100, then the last value of your prediction is inf because the survival rate in the 100th percentile is infinite. This is what the quantile and pct arguments do.
For example, setting pct = 1:999/1000 you get much more precise values for the prediction (myPredict400). Also, if you set pct to be some value that's not a proper probability (i.e. less than 0 or more than 1) you'll get an error. I suggest you play with these values and see how they impact your survival rates.
Related
I would like to
obtain the predicted time of the event, given a set of covariates
obtain the time at which the risk is equal to my specified threshold,
given covariates obtain the risk, given time and covariates
All this using ic_par (parametric) or ic_npar (non-parametric) or ic_sp (semi-parametric) models (not bayesian models) from icenReg
There are 3 functions in icenReg (https://cran.r-project.org/web/packages/icenReg/icenReg.pdf) that I believe do at least two of those things:
sampleSurv
getFitEsts
getSCurves
Can someone explain what those three functions do? Especially the difference between sampleSurv and getFitEsts?
From what I understand, the time to the event is modelled as a probability curve.
So you do not obtain a defined predicted time to the event, but rather a probability of this event occuring through time.
Thus, you can obtain the probability of the event to occur after X days, or you can obtain the time at which the event has a probability of X % to have occured.
getFitEsts() will provide these 2 estimates from an object previously fitted by ic_sp(), ic_par() or ic_bayes()
Here is an example of how to obtain these estimates, with an example from icenReg package :
data("IR_diabetes")
flatPrior_fit <- ic_bayes(cbind(left, right) ~ gender, data = IR_diabetes, model = "po", dist = "gamma")
newdata <- data.frame(gender = c(unique(IR_diabetes$gender)))
rownames(newdata) <- c(as.character(unique(IR_diabetes$gender)))
# plot the survival probability curve
plot(flatPrior_fit)
# plot the same curve according to each factor
plot(flatPrior_fit,newdata)
maleCovs <- data.frame(gender = c("male"))
femaleCovs <- data.frame(gender = c("female"))
# median survival time as calculated by the model if males and females are considered together
# = 50 % probability of the event occurring
getFitEsts(flatPrior_fit , p = 0.5)
# median survival time for males
getFitEsts(flatPrior_fit, newdata = maleCovs, p = 0.5)
# median survival time for females
getFitEsts(flatPrior_fit, newdata = femaleCovs, p = 0.5)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = maleCovs, q = 15)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = femaleCovs, q = 15)
getScurves() work only for semi parametric models.
It allows to obtain the interval of the survival probability for each time step, as plotted on the curve :
data("IR_diabetes")
# fit a semi parametric model (proportional odds)
sp_fit <- ic_sp(cbind(left, right) ~ gender, data = IR_diabetes, model = "po")
# plot the survival curve
plot(sp_fit, newdata)
# obtain the intervals and associated survival probability of this survival curve for each time step
getSCurves(sp_fit,newdata)
Finally, sampleSurv() draw samples from the probability curve you fitted, contained between the intervals computed, according to the quantile you need. These results are variable because there is multiple possibilities between these intervals.
I hope it helped a bit to understand these functions
I encountered some issues when calculating restricted mean survival time (RMST) in R and I made some attempts.
Here is the idea that I tried to calculate the RMST by myself.
i) I fitted a cox regression model to get estimated function of h(t), and I deploy individual covariables to calculate individual h(t);
ii) I derived individual survival curve S(t) by the above individual h(t);
iii) I then calculated individual RMST by the above individual S(t) with the following formula: RMST = integrate(S(t)) by 0 to tau. (I don't know how to put a formal formula here and I am sure you can understand what I am saying).
I have tried the above method to calculate individual RMST with the following R code:
# load R package
library(survRM2)
library(survival)
# generate example
D <- rmst2.sample.data()
time <- D$time
status <- D$status
x <- D[,c(4,6,7)]
# fit cox regression model with weibull baseline
fit<-survreg(Surv(time,status)~ x[[1]] + x[[2]]+ x[[3]],data = D,dist = "weibull")
# get cox regression coefficients of covariables
beta=fit$coefficients
# get paramaters within baseline hazard
gamma.weibull=fit$scale
# cutomize a function to calculate individual hazard
hazard <- function(u,x1,x2,x3) {
gamma.weibull*u^(gamma.weibull-1)*exp(beta[1]+beta[2]*x1+beta[3]*x2+beta[4]*x3)
}
# cutomize a function to calculate individual survival
surv <-function(t,x1,x2,x3) {
sapply(t,function(z){
exp(-integrate(hazard,lower=0,upper=z,x1=x1,x2=x2,x3=x3)$value)
}
)
}
rmst <- c() # genrate a empty vector
for(i in 1:312) { # 312 is the sample size
rmst[i]=integrate(surv,0,5,x1=x[[1]][i],x2=x[[2]][i],x3=x[[3]][i])$value
}
# Error in integrate(surv, 0, 5, x1 = x[[1]][i], x2 = x[[2]][i], x3 = x[[3]][i]) :
# the integral is probably divergent
I have three questions:
1) Is there anything wrong about my idea or computational process?
2) In the step iii), there are some cases that integrals are non-integrable (that is, integrals do not converge). Is there any solution, or should I use approximate evaluation?
3) One last shoot, is there any better method to calculate this individual RMST?
My aim is to study how the inclusion of several risk factors improve a clinical model predicting the incidence of stroke in a survival analysis. I want to use the NRI to compare two models (1 baseline model vs baseline model+new risk factor). However, I would prefer to stratify the probabilities of these models in categorical variables (i.e., risk 0-3%, risk 4-6%, risk >7%...).
For that purpose I have prepared the following code with R, but I am not sure whether it is correct or not and I would prefer to clarify it =).
I will use the open survival data “lung” in order to show a reproducible example. IMPORTANT: I will not consider censored data in this analysis, but it should be analyzed as appropriate in further analyses:
library(survival)
library(pec)##To calculate probabilities of event at a point of time.
library(Hmisc)##To calculate NRI
data(lung)
lung <- na.omit(lung)
lung$status <- lung$status-1##necessary for NRI package
stats <- lung$status
tempo <- lung$time
##Create two models, 1 baseline and 1 baseline+new predictor
model1 <- coxph(Surv(tempo,stats)~age+sex, data=lung, x=T)
model2 <- coxph(Surv(tempo,stats)~age+sex+factor(ph.ecog), data=lung,
x=T)
##Estimate the survival probability at time= 500.
lung$x <- predictSurvProb(model1, newdata=lung, times=c(500))
lung$y <- predictSurvProb(model2, newdata=lung, times=c(500))
##Calculate the probability of event (1-survival) and we categorize it.
lung$x <- cut(1-lung$x, c(0, 0.25, 0.5, 0.75, 1), include.lowest=TRUE,
right=FALSE)
lung$y <- cut(1-lung$y, c(0, 0.25, 0.5, 0.75, 1), include.lowest=TRUE,
right=FALSE)
##Confusion matrix
cases <- subset(lung, stats==1)
controls <- subset(lung, stats==0)
table(lung$x, lung$y)##All patients
table(cases$x, cases$y)##cases
table(controls$x, controls$y)##controls
##Calculate NRI
x <- as.numeric(lung$x)/4##Predictions should be ranged between 0-1
y <- as.numeric(lung$y)/4
improveProb(x, y, stats)
My questions:
1) Is this mode to calculate the risk of event at t=500 correct?
2) Is it correct to calculate NRI with this function? I have tried other packages (i.e., NRIcens) and I obtained the same results…
Thanks in advance for your help!
I'm attempting to use the "rpart" package in R to build a survival tree, and I'm hoping to use this tree to then make predictions for other observations.
I know there have been a lot of SO questions involving rpart and prediction; however, I have not been able to find any that address a problem that (I think) is specific to using rpart with a "Surv" object.
My particular problem involves interpreting the results of the "predict" function. An example is helpful:
library(rpart)
library(OIsurv)
# Make Data:
set.seed(4)
dat = data.frame(X1 = sample(x = c(1,2,3,4,5), size = 1000, replace=T))
dat$t = rexp(1000, rate=dat$X1)
dat$t = dat$t / max(dat$t)
dat$e = rbinom(n = 1000, size = 1, prob = 1-dat$t )
# Survival Fit:
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
plot(sfit)
# Tree Fit:
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
So far so good. My understanding of what's going on here is that rpart is attempting to fit exponential survival curves to subsets of my data. Based on this understanding, I believe that when I call predict(tfit), I get, for each observation, a number corresponding to the parameter for the exponential curve for that observation. So, for example, if predict(fit)[1] is .46, then this means for the first observation in my original dataset, the curve is given by the equation P(s) = exp(−λt), where λ=.46.
This seems like exactly what I'd want. For each observation (or any new observation), I can get the predicted probability that this observation will be alive/dead for a given time point. (EDIT: I'm realizing this is probably a misconception— these curves don't give the probability of alive/dead, but the probability of surviving an interval. This doesn't change the problem described below, though.)
However, when I try and use the exponential formula...
# Predict:
# an attempt to use the rates extracted from the tree to
# capture the survival curve formula in each tree node.
rates = unique(predict(tfit))
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-rate*(grid)), col=2)
}
What I've done here is split the dataset in the same way the survival tree did, then used survfit to plot a non-parametric curve for each of these partitions. That's the black lines. I've also drawn lines corresponding to the result of plugging in (what I thought was) the 'rate' parameter into (what I thought was) the survival exponential formula.
I understand that the non-parametric and the parametric fit shouldn't necessarily be identical, but this seems more than that: it seems like I need to scale my X variable or something.
Basically, I don't seem to understand the formula that rpart/survival is using under the hood. Can anyone help me get from (1) rpart model to (2) a survival equation for any arbitrary observation?
The survival data are scaled internally exponentially so that the predicted rate in the root node is always fixed to 1.000. The predictions reported by the predict() method are then always relative to the survival in the root node, i.e., higher or lower by a certain factor. See Section 8.4 in vignette("longintro", package = "rpart") for more details. In any case, the Kaplan-Meier curves you are reported correspond exactly to what is also reported in the rpart vignette.
If you want to obtain directly the plots of the Kaplan-Meier curves in the tree and get predicted median survival times, you can coerce the rpart tree to a constparty tree as provided by the partykit package:
library("partykit")
(tfit2 <- as.party(tfit))
## Model formula:
## Surv(t, event = e) ~ X1
##
## Fitted party:
## [1] root
## | [2] X1 < 2.5
## | | [3] X1 < 1.5: 0.192 (n = 213)
## | | [4] X1 >= 1.5: 0.082 (n = 213)
## | [5] X1 >= 2.5: 0.037 (n = 574)
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
##
plot(tfit2)
The print output shows the median survival time and the visualization the corresponding Kaplan-Meier curve. Both can also be obtained with the predict() method setting the type argument to "response" and "prob" respectively.
predict(tfit2, type = "response")[1]
## 5
## 0.03671885
predict(tfit2, type = "prob")[[1]]
## Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 574.0000 574.0000 574.0000 542.0000 0.0367 0.0323 0.0408
As an alternative to the rpart survival trees you might also consider the non-parametric survival trees based on conditional inference in ctree() (using logrank scores) or fully parametric survival trees using the general mob() infrastructure from the partykit package.
#Achim Zeileis's answer is very helpful, but it seems that the exact #jwdink's question was not answered. I understood it as "If RPart tree splits by best exponential survival fit, what are the Lambdas for these fits in absolute terms, so we can use these exponential survival functions to make predictions". The RPart summary does show the estimated rate, but only in relative terms assuming that the entire population has rate of 1. To overcome, one can fit an exponential survreg, take the referenced lambda from there and then multiply RPart predicted rates by that number (see code below).
That said, this is not how survival rates in RPart are predicted out of a tree. I did not find survival prediction function directly in RPart, however as Achim pointed above, partykit uses Kaplan-Meier estimates, i.e. non-parametric survival from those ending up in a respective final leaf. I think it is the same in survival random forest trees, where K-M curves are used in the final leaves.
The simulated data in this question uses exponential distribution, so K-M and exponential survival curves will be similar by design, however for a different simulated or real-life distribution estimated exponential rates by RPart tree and using K-M curves in the final leaves (of the same tree) will give different survival rates.
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
table(dat$node)
s0 = survreg(Surv(t,e)~ 1, data = dat, dist = "exponential") #-0.6175
e0 = exp(-summary(s0)$coefficients[1]); e0 #1.854
rates = unique(predict(tfit))
#1) plot K-M curves by node (black):
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
#2) plot exponential survival with rates = e0 * RPart rates (red):
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-e0*rate*(grid)), col=2)
}
#3) plot partykit survival curves based on RPart tree (green)
library(partykit)
tfit2 <- as.party(tfit)
col_n = 1
for (node in names(table(dat$node))){
predict_curve = predict(tfit2, newdata = dat[dat$node == node, ], type = "prob")
surv_esitmated = approxfun(predict_curve[[1]]$time, predict_curve[[1]]$surv)
lines(x= grid, y= surv_esitmated(grid), col = 2+col_n)
col_n=+1
}
Let me use UCLA example on multinominal logit as a running example---
library(nnet)
library(foreign)
ml <- read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
ml$prog2 <- relevel(ml$prog, ref = "academic")
test <- multinom(prog2 ~ ses + write, data = ml)
dses <- data.frame(ses = c("low", "middle", "high"), write = mean(ml$write))
predict(test, newdata = dses, "probs")
I wonder how can I get 95% confidence interval?
This can be accomplished with the effects package, which I showcased for another question at Cross Validated here.
Let's look at your example.
library(nnet)
library(foreign)
ml <- read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
ml$prog2 <- relevel(ml$prog, ref = "academic")
test <- multinom(prog2 ~ ses + write, data = ml)
Instead of using the predict() from base, we use Effect() from effects
require(effects)
fit.eff <- Effect("ses", test, given.values = c("write" = mean(ml$write)))
data.frame(fit.eff$prob, fit.eff$lower.prob, fit.eff$upper.prob)
prob.academic prob.general prob.vocation L.prob.academic L.prob.general L.prob.vocation U.prob.academic
1 0.4396845 0.3581917 0.2021238 0.2967292 0.23102295 0.10891758 0.5933996
2 0.4777488 0.2283353 0.2939159 0.3721163 0.15192359 0.20553211 0.5854098
3 0.7009007 0.1784939 0.1206054 0.5576661 0.09543391 0.05495437 0.8132831
U.prob.general U.prob.vocation
1 0.5090244 0.3442749
2 0.3283014 0.4011175
3 0.3091388 0.2444031
If we want to, we can also plot the predicted probabilities with their respective confidence intervals using the facilities in effects.
plot(fit.eff)
Simply use the confint function on your model object.
ci <- confint(test, level=0.95)
Note that confint is a generic function and a specific version is run for multinom, as you can see by running
> methods(confint)
[1] confint.default confint.glm* confint.lm* confint.multinom*
[5] confint.nls*
EDIT:
as for the matter of calculating confidence interval for the predicted probabilities, I quote from: https://stat.ethz.ch/pipermail/r-help/2004-April/048917.html
Is there any possibility to estimate confidence intervalls for the
probabilties with the multinom function?
No, as confidence intervals (sic) apply to single parameters not
probabilities (sic). The prediction is a probability distribution, so
the uncertainty would have to be some region in Kd space, not an interval.
Why do you want uncertainty statements about predictions (often called
tolerance intervals/regions)? In this case you have an event which
happens or not and the meaningful uncertainty is the probability
distribution. If you really have need of a confidence region, you could
simulate from the uncertainty in the fitted parameters, predict and
summarize somehow the resulting empirical distribution.