I would like to
obtain the predicted time of the event, given a set of covariates
obtain the time at which the risk is equal to my specified threshold,
given covariates obtain the risk, given time and covariates
All this using ic_par (parametric) or ic_npar (non-parametric) or ic_sp (semi-parametric) models (not bayesian models) from icenReg
There are 3 functions in icenReg (https://cran.r-project.org/web/packages/icenReg/icenReg.pdf) that I believe do at least two of those things:
sampleSurv
getFitEsts
getSCurves
Can someone explain what those three functions do? Especially the difference between sampleSurv and getFitEsts?
From what I understand, the time to the event is modelled as a probability curve.
So you do not obtain a defined predicted time to the event, but rather a probability of this event occuring through time.
Thus, you can obtain the probability of the event to occur after X days, or you can obtain the time at which the event has a probability of X % to have occured.
getFitEsts() will provide these 2 estimates from an object previously fitted by ic_sp(), ic_par() or ic_bayes()
Here is an example of how to obtain these estimates, with an example from icenReg package :
data("IR_diabetes")
flatPrior_fit <- ic_bayes(cbind(left, right) ~ gender, data = IR_diabetes, model = "po", dist = "gamma")
newdata <- data.frame(gender = c(unique(IR_diabetes$gender)))
rownames(newdata) <- c(as.character(unique(IR_diabetes$gender)))
# plot the survival probability curve
plot(flatPrior_fit)
# plot the same curve according to each factor
plot(flatPrior_fit,newdata)
maleCovs <- data.frame(gender = c("male"))
femaleCovs <- data.frame(gender = c("female"))
# median survival time as calculated by the model if males and females are considered together
# = 50 % probability of the event occurring
getFitEsts(flatPrior_fit , p = 0.5)
# median survival time for males
getFitEsts(flatPrior_fit, newdata = maleCovs, p = 0.5)
# median survival time for females
getFitEsts(flatPrior_fit, newdata = femaleCovs, p = 0.5)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = maleCovs, q = 15)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = femaleCovs, q = 15)
getScurves() work only for semi parametric models.
It allows to obtain the interval of the survival probability for each time step, as plotted on the curve :
data("IR_diabetes")
# fit a semi parametric model (proportional odds)
sp_fit <- ic_sp(cbind(left, right) ~ gender, data = IR_diabetes, model = "po")
# plot the survival curve
plot(sp_fit, newdata)
# obtain the intervals and associated survival probability of this survival curve for each time step
getSCurves(sp_fit,newdata)
Finally, sampleSurv() draw samples from the probability curve you fitted, contained between the intervals computed, according to the quantile you need. These results are variable because there is multiple possibilities between these intervals.
I hope it helped a bit to understand these functions
Related
I used the survey package to draw a weighted Kaplan Meier survival plot, like this:
library(survey)
data(pbc, package="survival")
pbc$randomized <- with(pbc, !is.na(trt) & trt>0)
biasmodel <- glm(randomized~age*edema,data=pbc)
pbc$randprob <- fitted(biasmodel)
dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized))
s2 <-svykm(Surv(time,status>0) ~ sex, design = dpbc)
svyjskm(s2, pval = T, table = T, design = dpbc)
Now I would like to calculate the survival rates in the two groups, similar to what I would get when using summary(x, times = c(1:5)) on a survfit object. Does anybody know how I can extract these values?
Many thanks in advance!
Saving survival probabilities to data frames
The time and survival probability values can be extracted from the svykm object ("s2") and saved in separate data frames for females ("s2_data_f") and males ("s2_data_m") using the following codes:
s2_data_f <- data.frame(time = s2[["f"]][["time"]], surv = s2[["f"]][["surv"]])
s2_data_m <- data.frame(time = s2[["m"]][["time"]], surv = s2[["m"]][["surv"]])
Note that not all time values will be available in the data (e.g. no one has time = 5) and for these values the nearest smaller value should be taken (e.g. for time = 5 the value for time = 0 should be used which is 100% survival).
Functions to extract survival probabilities or time values from data frames
Below is a function that locates the row in "s2_data_f" with a specified time value (or the nearest smaller time value) and returns the corresponding survival probability value.
return_surv_f <- function(x) {
time <- max(s2_data_f$time[s2_data_f$time <= x])
return(s2_data_f$surv[s2_data_f$time==time])
}
Similarly, for males ("s2_data_m") the function would be:
return_surv_m <- function(x) {
time <- max(s2_data_m$time[s2_data_m$time <= x])
return(s2_data_m$surv[s2_data_m$time==time])
}
These functions can then be used with "sapply" to return survival probability results for one or more chosen time values.
sapply(c(1:5), return_surv_f)
sapply(c(1:5), return_surv_m)
If you need to get the results in reverse (i.e. find the time corresponding with a specific survival probability) the "quantile" function can be used. For example, if you want to know at what time 75% of participants were alive then:
quantile(s2[["f"]], probs = 0.75)
quantile(s2[["m"]], probs = 0.75)
Confidence intervals or standard errors for survival probabilities
If you wish to calculate confidence intervals or standard errors for the survival probabilities then "se = TRUE" must be added to the svykm function.
s2 <- svykm(Surv(time,status>0) ~ sex, design = dpbc, se = TRUE)
Note however that this changes the statistical method with the R survey package documentation stating that:
"When standard errors are computed, the survival curve is actually the
Aalen (hazard-based) estimator rather than the Kaplan-Meier estimator."
Confidence intervals can then be obtained using "confint" with one or more time values specified in "parm =".
confint(s2[["f"]], parm = c(1000:1005), level = 0.95)
confint(s2[["m"]], parm = c(1000:1005), level = 0.95)
Standard errors can be obtained from the "varlog" values.
s2_data_f <- data.frame(varlog = s2[["f"]][["varlog"]])
s2_data_m <- data.frame(varlog = s2[["m"]][["varlog"]])
Example of a data frame including survival probabilities and their confidence intervals:
s2_results_f <- data.frame(
time = c(1000:1005),
surv = sapply(c(1000:1005), return_surv_f),
CI = confint(s2[["f"]], parm = c(1000:1005), level = 0.95)
)
s2_results_f[2:4] <- round(s2_results_f[2:4], 2)*100
s2_results_f[2:4] <- paste0(unlist(s2_results_f[2:4]), "%")
names(s2_results_f)[1:4] <- c("Follow-up time", "Survival probability", "95% CI lower", "95% CI upper")
I have data for two populations that are binned by age, with different bins for each population.
Age bins in population 1: 18-24, 25-29, 30-34, 35-45, 46-60, 61+
Age bins in population 2: 15-19, 20-24, 25-29, 30-34 ... 85-89, 90+
I want to infer a continuous distribution from these binned data in order to compare the two populations more directly. I tried fitting an untruncated negative binomial distribution but it was underestimating the lower bins:
So, now I want to try a truncated negative binomial distribution. I did the following:
library(truncdist)
library(fitdistrplus)
dtruncated_nbinom <- function(x)
dtrunc(x, "nbinom", a=18, b=100)
ptruncated_nbinom <- function(q)
ptrunc(q, "nbinom", a=18, b=100)
pop1_nbinom <- fitdistcens(pop1_dt, "truncated_nbinom")
But I got the following error:
Error in computing default starting values.
Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = pseudodata, :
Error in start.arg.default(obs, distname) :
Unknown starting values for distribution truncated_nbinom.
Any advice on how to approach/resolve this?
Here's the pop 1 data:
pop1 <- data.table(left = c(18,25,30,35,46,61), right = c(25,30,35,46,61,100), counts = c(2745,3115,2726,3433,1368,204))
pop1_dt <- pop1[rep(1:nrow(pop1), pop1[,counts]), .(left, right)]
I am struggling to transform the log odds ratio profile confidence intervals obtained from a logit model into probabilities. I would like to know how to calculate the confidence intervals of the difference between two groups.
If the p-value is > 0.05, the 95% CI of the difference should span from below zero to above zero. However, I don’t know how negative values can be obtained when the log ratios have to be exponentiated. Therefore I tried to calculate the CI of one of the groups (B) and see what the difference of the lower and the upper end of the CI to the estimate of group A is. I believe this is not the correct way to calculate the CI of the difference because the estimate of A is also uncertain.
I would be happy if anyone could help me out.
library(lme4)
# Example data:
set.seed(11)
treatment = c(rep("A",30), rep("B", 40))
site = rep(1:14, each = 5)
presence = c(rbinom(30, 1, 0.6),rbinom(40, 1, 0.8))
df = data.frame(presence, treatment, site)
# Likelihood ratio test
M0 = glmer(presence ~ 1 + (1|site), family = "binomial", data = df)
M1 = glmer(presence ~ treatment + (1|site), family = "binomial", data = df)
anova(M1, M0)
# Calculating confidence intervals
cc <- confint(M1, parm = "beta_")
ctab <- cbind(est = fixef(M1), cc)
cdat = as.data.frame(ctab)
# Function to back-transform to probability (0-1)
unlogit = function(y){
y_retransfromed = exp(y)/(1+exp(y))
y_retransfromed
}
# Getting estimates
A_est = unlogit(cdat$est[1])
B_est = unlogit(cdat$est[1] + cdat$est[2])
B_lwr = unlogit(cdat$est[1] + cdat[2,2])
B_upr = unlogit(cdat$est[1] + cdat[2,3])
Difference_est = B_est - A_est
# This is how I tried to calculate the CI of the difference
Difference_lwr = B_lwr - A_est
Difference_upr = B_upr - A_est
# However, I believe this is wrong because A_est is also “uncertain”
How to get the confidence interval of the difference of the probability of presence?
We can calculate the average treatment effect in the following way. From the original data, create two new datasets, one in which all units receive treatment A, and one in which all units receive treatment B. Now, based on your model estimates (in your case, M1), we compute predicted outcomes for units in each of these two datasets. We then compute the mean difference in the outcomes between the two datasets to get our estimated average treatment effect. Here, we can write a function that takes a glmer object and computes the average treatment effect:
ate <- function(.) {
treat_A <- treat_B <- df
treat_A$treatment <- "A"
treat_B$treatment <- "B"
c("ate" = mean(predict(., newdata = treat_B, type = "response") -
predict(., newdata = treat_A, type = "response")))
}
ate(M1)
# ate
# 0.09478276
How do we get the uncertainty interval? We can use the bootstrap, i.e. re-estimate the model many times using randomly generated samples from your original data, calculating the average treatment effect each time. We can then use the distribution of the bootstrapped average treatment effects to compute our uncertainty interval. Here we generate 100 simulations using the bootMer function
out <- bootMer(M1, ate, seed = 1234, nsim = 100)
and inspect the distribution of the effect:
quantile(out$t, c(0.025, 0.5, 0.975))
# 2.5% 50% 97.5%
# -0.06761338 0.10508751 0.26907504
I am befuddled by the format to perform a simple prediction using R's survival package
library(survival)
lung.surv <- survfit(Surv(time,status) ~ 1, data = lung)
So fitting a simple exponential regression (for example purposes only) is:
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung, dist="exponential")
How would I predict the percent survival at time=400?
When I use the following:
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400), type="response")
I get the following:
myPredict400
1
421.7758
I was expecting something like 37% so I am missing something pretty obvious
The point with this survival function is to find an empirical distribution that fits the survival times. Essentially you are associating a survival time with a probability. Once you have that distribution, you can pick out the survival rate for a given time.
Try this:
library(survival)
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung) # because you want a distribution
pct <- 1:99/100 # this creates the empirical survival probabilities
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400),type='quantile', p=pct)
indx = which(abs(myPredict400 - 400) == min(abs(myPredict400 - 400))) # find the closest survival time to 400
print(1 - pct[indx]) # 0.39
Straight from the help docs, here's a plot of it:
matplot(myPredict400, 1-pct, xlab="Months", ylab="Survival", type='l', lty=c(1,2,2), col=1)
Edited
You're basically fitting a regression to a distribution of probabilities (hence 1...99 out of 100). If you make it go to 100, then the last value of your prediction is inf because the survival rate in the 100th percentile is infinite. This is what the quantile and pct arguments do.
For example, setting pct = 1:999/1000 you get much more precise values for the prediction (myPredict400). Also, if you set pct to be some value that's not a proper probability (i.e. less than 0 or more than 1) you'll get an error. I suggest you play with these values and see how they impact your survival rates.
I'm attempting to use the "rpart" package in R to build a survival tree, and I'm hoping to use this tree to then make predictions for other observations.
I know there have been a lot of SO questions involving rpart and prediction; however, I have not been able to find any that address a problem that (I think) is specific to using rpart with a "Surv" object.
My particular problem involves interpreting the results of the "predict" function. An example is helpful:
library(rpart)
library(OIsurv)
# Make Data:
set.seed(4)
dat = data.frame(X1 = sample(x = c(1,2,3,4,5), size = 1000, replace=T))
dat$t = rexp(1000, rate=dat$X1)
dat$t = dat$t / max(dat$t)
dat$e = rbinom(n = 1000, size = 1, prob = 1-dat$t )
# Survival Fit:
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
plot(sfit)
# Tree Fit:
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
So far so good. My understanding of what's going on here is that rpart is attempting to fit exponential survival curves to subsets of my data. Based on this understanding, I believe that when I call predict(tfit), I get, for each observation, a number corresponding to the parameter for the exponential curve for that observation. So, for example, if predict(fit)[1] is .46, then this means for the first observation in my original dataset, the curve is given by the equation P(s) = exp(−λt), where λ=.46.
This seems like exactly what I'd want. For each observation (or any new observation), I can get the predicted probability that this observation will be alive/dead for a given time point. (EDIT: I'm realizing this is probably a misconception— these curves don't give the probability of alive/dead, but the probability of surviving an interval. This doesn't change the problem described below, though.)
However, when I try and use the exponential formula...
# Predict:
# an attempt to use the rates extracted from the tree to
# capture the survival curve formula in each tree node.
rates = unique(predict(tfit))
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-rate*(grid)), col=2)
}
What I've done here is split the dataset in the same way the survival tree did, then used survfit to plot a non-parametric curve for each of these partitions. That's the black lines. I've also drawn lines corresponding to the result of plugging in (what I thought was) the 'rate' parameter into (what I thought was) the survival exponential formula.
I understand that the non-parametric and the parametric fit shouldn't necessarily be identical, but this seems more than that: it seems like I need to scale my X variable or something.
Basically, I don't seem to understand the formula that rpart/survival is using under the hood. Can anyone help me get from (1) rpart model to (2) a survival equation for any arbitrary observation?
The survival data are scaled internally exponentially so that the predicted rate in the root node is always fixed to 1.000. The predictions reported by the predict() method are then always relative to the survival in the root node, i.e., higher or lower by a certain factor. See Section 8.4 in vignette("longintro", package = "rpart") for more details. In any case, the Kaplan-Meier curves you are reported correspond exactly to what is also reported in the rpart vignette.
If you want to obtain directly the plots of the Kaplan-Meier curves in the tree and get predicted median survival times, you can coerce the rpart tree to a constparty tree as provided by the partykit package:
library("partykit")
(tfit2 <- as.party(tfit))
## Model formula:
## Surv(t, event = e) ~ X1
##
## Fitted party:
## [1] root
## | [2] X1 < 2.5
## | | [3] X1 < 1.5: 0.192 (n = 213)
## | | [4] X1 >= 1.5: 0.082 (n = 213)
## | [5] X1 >= 2.5: 0.037 (n = 574)
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
##
plot(tfit2)
The print output shows the median survival time and the visualization the corresponding Kaplan-Meier curve. Both can also be obtained with the predict() method setting the type argument to "response" and "prob" respectively.
predict(tfit2, type = "response")[1]
## 5
## 0.03671885
predict(tfit2, type = "prob")[[1]]
## Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 574.0000 574.0000 574.0000 542.0000 0.0367 0.0323 0.0408
As an alternative to the rpart survival trees you might also consider the non-parametric survival trees based on conditional inference in ctree() (using logrank scores) or fully parametric survival trees using the general mob() infrastructure from the partykit package.
#Achim Zeileis's answer is very helpful, but it seems that the exact #jwdink's question was not answered. I understood it as "If RPart tree splits by best exponential survival fit, what are the Lambdas for these fits in absolute terms, so we can use these exponential survival functions to make predictions". The RPart summary does show the estimated rate, but only in relative terms assuming that the entire population has rate of 1. To overcome, one can fit an exponential survreg, take the referenced lambda from there and then multiply RPart predicted rates by that number (see code below).
That said, this is not how survival rates in RPart are predicted out of a tree. I did not find survival prediction function directly in RPart, however as Achim pointed above, partykit uses Kaplan-Meier estimates, i.e. non-parametric survival from those ending up in a respective final leaf. I think it is the same in survival random forest trees, where K-M curves are used in the final leaves.
The simulated data in this question uses exponential distribution, so K-M and exponential survival curves will be similar by design, however for a different simulated or real-life distribution estimated exponential rates by RPart tree and using K-M curves in the final leaves (of the same tree) will give different survival rates.
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
table(dat$node)
s0 = survreg(Surv(t,e)~ 1, data = dat, dist = "exponential") #-0.6175
e0 = exp(-summary(s0)$coefficients[1]); e0 #1.854
rates = unique(predict(tfit))
#1) plot K-M curves by node (black):
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
#2) plot exponential survival with rates = e0 * RPart rates (red):
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-e0*rate*(grid)), col=2)
}
#3) plot partykit survival curves based on RPart tree (green)
library(partykit)
tfit2 <- as.party(tfit)
col_n = 1
for (node in names(table(dat$node))){
predict_curve = predict(tfit2, newdata = dat[dat$node == node, ], type = "prob")
surv_esitmated = approxfun(predict_curve[[1]]$time, predict_curve[[1]]$surv)
lines(x= grid, y= surv_esitmated(grid), col = 2+col_n)
col_n=+1
}