cumulative survival rate after weighting using survey::svykm() - r

I used the survey package to draw a weighted Kaplan Meier survival plot, like this:
library(survey)
data(pbc, package="survival")
pbc$randomized <- with(pbc, !is.na(trt) & trt>0)
biasmodel <- glm(randomized~age*edema,data=pbc)
pbc$randprob <- fitted(biasmodel)
dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized))
s2 <-svykm(Surv(time,status>0) ~ sex, design = dpbc)
svyjskm(s2, pval = T, table = T, design = dpbc)
Now I would like to calculate the survival rates in the two groups, similar to what I would get when using summary(x, times = c(1:5)) on a survfit object. Does anybody know how I can extract these values?
Many thanks in advance!

Saving survival probabilities to data frames
The time and survival probability values can be extracted from the svykm object ("s2") and saved in separate data frames for females ("s2_data_f") and males ("s2_data_m") using the following codes:
s2_data_f <- data.frame(time = s2[["f"]][["time"]], surv = s2[["f"]][["surv"]])
s2_data_m <- data.frame(time = s2[["m"]][["time"]], surv = s2[["m"]][["surv"]])
Note that not all time values will be available in the data (e.g. no one has time = 5) and for these values the nearest smaller value should be taken (e.g. for time = 5 the value for time = 0 should be used which is 100% survival).
Functions to extract survival probabilities or time values from data frames
Below is a function that locates the row in "s2_data_f" with a specified time value (or the nearest smaller time value) and returns the corresponding survival probability value.
return_surv_f <- function(x) {
time <- max(s2_data_f$time[s2_data_f$time <= x])
return(s2_data_f$surv[s2_data_f$time==time])
}
Similarly, for males ("s2_data_m") the function would be:
return_surv_m <- function(x) {
time <- max(s2_data_m$time[s2_data_m$time <= x])
return(s2_data_m$surv[s2_data_m$time==time])
}
These functions can then be used with "sapply" to return survival probability results for one or more chosen time values.
sapply(c(1:5), return_surv_f)
sapply(c(1:5), return_surv_m)
If you need to get the results in reverse (i.e. find the time corresponding with a specific survival probability) the "quantile" function can be used. For example, if you want to know at what time 75% of participants were alive then:
quantile(s2[["f"]], probs = 0.75)
quantile(s2[["m"]], probs = 0.75)
Confidence intervals or standard errors for survival probabilities
If you wish to calculate confidence intervals or standard errors for the survival probabilities then "se = TRUE" must be added to the svykm function.
s2 <- svykm(Surv(time,status>0) ~ sex, design = dpbc, se = TRUE)
Note however that this changes the statistical method with the R survey package documentation stating that:
"When standard errors are computed, the survival curve is actually the
Aalen (hazard-based) estimator rather than the Kaplan-Meier estimator."
Confidence intervals can then be obtained using "confint" with one or more time values specified in "parm =".
confint(s2[["f"]], parm = c(1000:1005), level = 0.95)
confint(s2[["m"]], parm = c(1000:1005), level = 0.95)
Standard errors can be obtained from the "varlog" values.
s2_data_f <- data.frame(varlog = s2[["f"]][["varlog"]])
s2_data_m <- data.frame(varlog = s2[["m"]][["varlog"]])
Example of a data frame including survival probabilities and their confidence intervals:
s2_results_f <- data.frame(
time = c(1000:1005),
surv = sapply(c(1000:1005), return_surv_f),
CI = confint(s2[["f"]], parm = c(1000:1005), level = 0.95)
)
s2_results_f[2:4] <- round(s2_results_f[2:4], 2)*100
s2_results_f[2:4] <- paste0(unlist(s2_results_f[2:4]), "%")
names(s2_results_f)[1:4] <- c("Follow-up time", "Survival probability", "95% CI lower", "95% CI upper")

Related

differences between sampleSurv, getFitEsts and getSCurves in icenReg

I would like to
obtain the predicted time of the event, given a set of covariates
obtain the time at which the risk is equal to my specified threshold,
given covariates obtain the risk, given time and covariates
All this using ic_par (parametric) or ic_npar (non-parametric) or ic_sp (semi-parametric) models (not bayesian models) from icenReg
There are 3 functions in icenReg (https://cran.r-project.org/web/packages/icenReg/icenReg.pdf) that I believe do at least two of those things:
sampleSurv
getFitEsts
getSCurves
Can someone explain what those three functions do? Especially the difference between sampleSurv and getFitEsts?
From what I understand, the time to the event is modelled as a probability curve.
So you do not obtain a defined predicted time to the event, but rather a probability of this event occuring through time.
Thus, you can obtain the probability of the event to occur after X days, or you can obtain the time at which the event has a probability of X % to have occured.
getFitEsts() will provide these 2 estimates from an object previously fitted by ic_sp(), ic_par() or ic_bayes()
Here is an example of how to obtain these estimates, with an example from icenReg package :
data("IR_diabetes")
flatPrior_fit <- ic_bayes(cbind(left, right) ~ gender, data = IR_diabetes, model = "po", dist = "gamma")
newdata <- data.frame(gender = c(unique(IR_diabetes$gender)))
rownames(newdata) <- c(as.character(unique(IR_diabetes$gender)))
# plot the survival probability curve
plot(flatPrior_fit)
# plot the same curve according to each factor
plot(flatPrior_fit,newdata)
maleCovs <- data.frame(gender = c("male"))
femaleCovs <- data.frame(gender = c("female"))
# median survival time as calculated by the model if males and females are considered together
# = 50 % probability of the event occurring
getFitEsts(flatPrior_fit , p = 0.5)
# median survival time for males
getFitEsts(flatPrior_fit, newdata = maleCovs, p = 0.5)
# median survival time for females
getFitEsts(flatPrior_fit, newdata = femaleCovs, p = 0.5)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = maleCovs, q = 15)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = femaleCovs, q = 15)
getScurves() work only for semi parametric models.
It allows to obtain the interval of the survival probability for each time step, as plotted on the curve :
data("IR_diabetes")
# fit a semi parametric model (proportional odds)
sp_fit <- ic_sp(cbind(left, right) ~ gender, data = IR_diabetes, model = "po")
# plot the survival curve
plot(sp_fit, newdata)
# obtain the intervals and associated survival probability of this survival curve for each time step
getSCurves(sp_fit,newdata)
Finally, sampleSurv() draw samples from the probability curve you fitted, contained between the intervals computed, according to the quantile you need. These results are variable because there is multiple possibilities between these intervals.
I hope it helped a bit to understand these functions

R: Plot Individual Predictions

I am using the R programming language. I am trying to follow this tutorial :https://rdrr.io/cran/randomForestSRC/man/plot.competing.risk.rfsrc.html
This tutorial shows how to use the "survival random forest" algorithm - an algorithm used to analyze survival data. In this example, the "follic" data set is used, the survival random forest algorithm is used to analyze the instant hazard of observation experiencing "status 1" vs "status 2" (this is called "competing risks).
In the code below, the survival random forest model is trained on the follic data set using all observations except the last two observations. Then, this model is used to predict the hazards of the last two observations:
#load library
library(randomForestSRC)
#load data
data(follic, package = "randomForestSRC")
#train model on all observations except the last 2 observations
follic.obj <- rfsrc(Surv(time, status) ~ ., follic[c(1:539),], nsplit = 3, ntree = 100)
#use model to predict the last two observations
f <- predict(follic.obj, follic[540:541, ])
#plot individual curves - does not work
plot.competing.risk(f)
However, this seems to produce the average hazards for the last two observations experiencing "status 1 vs status 2".
Is there a way to plot the individual hazards of the first observation and the second observation?
Thanks
EDIT1:
I know how to do this for other functions in this package, e.g. here you can plot these curves for 7 observations at once:
data(veteran, package = "randomForestSRC")
plot.survival(rfsrc(Surv(time, status)~ ., veteran), cens.model = "rfsrc")
## pbc data
data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc)
## use subset to focus on specific individuals
plot.survival(pbc.obj, subset = c(3, 10))
This example seems to show the predicted survival curves for 7 observations (plus the confidence intervals - the red line is the average) at once. But I still do not know how to do this for the "plot.competing.risk" function.
EDIT2:
I think there might be an indirect way to solve this - you can predict each observation individually:
#use model to predict the last two observations individually
f1 <- predict(follic.obj, follic[540, ])
f2 <- predict(follic.obj, follic[541, ])
#plot individual curves
plot.competing.risk(f1)
plot.competing.risk(f2)
But I was hoping there was a more straightforward way to do this. Does anyone know how?
One possible way is to modify the function plot.competing.risk for individual line, and plot over a for loop for overlapping individual lines, as shown below.
#use model to predict the last three observations
f <- predict(follic.obj, follic[539:541, ])
x <- f
par(mfrow = c(2, 2))
for (k in 1:3) { #k for type of plot
for (i in 1:dim(x$chf)[1]) { #i for all individuals in x
#cschf <- apply(x$chf, c(2, 3), mean, na.rm = TRUE) #original group mean
cschf = x$chf[i,,] #individual values
#cif <- apply(x$cif, c(2, 3), mean, na.rm = TRUE) #original group mean
cif = x$cif[i,,] #individual values
cpc <- do.call(cbind, lapply(1:ncol(cif), function(j) {
cif[, j]/(1 - rowSums(cif[, -j, drop = FALSE]))
}))
if (k==1)
{matx = cschf
range = range(x$chf)
}
if (k==2)
{matx = cif
range = range(x$cif)
}
if (k==3)
{matx = cpc
range = c(0,1) #manually assign, for now
}
ylab = c("Cause-Specific CHF","Probability (%)","Probability (%)")[k]
matplot(x$time.interest, matx, type='l', lty=1, lwd=3, col=1:2,
add=ifelse(i==1,F,T), ylim=range, xlab="Time", ylab=ylab) #ADD tag for overlapping individual lines
}
legend <- paste(c("CSCHF","CIF","CPC")[k], 1:2, " ")
legend("bottomright", legend = legend, col = (1:2), lty = 1, lwd = 3)
}

Similar to cox regression hazard model, can we get survival curves and hazard ratios using survivalsvm?

I am a beginner, trying to do survival analysis using machine learning on the lung cancer dataset. I know how to do the survival analysis using the Cox proportional hazard model. Cox proportional hazard model provides us the hazard ratios, which are nothing but the exponential of the regression coefficients. I wonder if, we can do the same thing using machine learning. As a beginner, I am trying survivalsvm from the R language. Please see the link for this. I am using the inbuilt cancer data for doing survival analysis. Following is the R code, given at this link.
library(survival)
library(survivalsvm)
set.seed(123)
n <- nrow(veteran)
train.index <- sample(1:n, 0.7 * n, replace = FALSE)
test.index <- setdiff(1:n, train.index)
survsvm.reg <- survivalsvm(Surv(diagtime, status) ~ .,
subset = train.index, data = veteran,
type = "regression", gamma.mu = 1,
opt.meth = "quadprog", kernel = "add_kernel")
print(survsvm.reg)
pred.survsvm.reg <- predict(object = survsvm.reg,
newdata = veteran, subset = test.index)
print(pred.survsvm.reg)
Can anyone help me to get the hazard ratios or survival curve for this dataset? Also, how to interpret the output of this function
This question is kind of old now but I'm going to answer anyway because this is a difficult problem and I struggled with {survivalsvm} when I first used it.
So depending on the type argument you get different outputs. In your case type = "regression" means you are plotting Shivaswamy's (hope i spelt correctly) SVCR which predicts the time until an event takes place, so these are survival time predictions.
In order to convert this to a survival curve you have to make some assumptions about the shape of the survival distribution. So for example, let's say you think the survival time is Normally distributed with N(mu, sigma). Then you can use your predicted survival time as mu and either predict or make an assumption about sigma.
Below is an example using your code and my {distr6} package, which enables quick computation of many distributions and printing and plotting of functions:
library(survival)
library(survivalsvm)
set.seed(123)
n <- nrow(veteran)
train.index <- sample(1:n, 0.7 * n, replace = FALSE)
test.index <- setdiff(1:n, train.index)
survsvm.reg <- survivalsvm(Surv(diagtime, status) ~ .,
subset = train.index, data = veteran,
type = "regression", gamma.mu = 1,
opt.meth = "quadprog", kernel = "add_kernel")
print(survsvm.reg)
pred.survsvm.reg <- predict(object = survsvm.reg,
newdata = veteran, subset = test.index)
# load distr6
library(distr6)
# create a vector of normal distributions each with
# mean as the predicted time and with variance 1
# `decorators = "ExoticStatistics"` adds survival function
v = VectorDistribution$new(distribution = "Normal",
params = data.frame(mean = as.numeric(pred.survsvm.reg$predicted)),
shared_params = list(var = 1),
decorators = "ExoticStatistics")
# survival function evaluated at times = 1:10
v$survival(1:10)
# plot survival function for first individual
plot(v[1], fun = "survival")
# plot hazard function for first individual
plot(v[1], fun = "hazard")

How to obtain profile confidence intervals of the difference in probability of success between two groups from a logit model (glmer)?

I am struggling to transform the log odds ratio profile confidence intervals obtained from a logit model into probabilities. I would like to know how to calculate the confidence intervals of the difference between two groups.
If the p-value is > 0.05, the 95% CI of the difference should span from below zero to above zero. However, I don’t know how negative values can be obtained when the log ratios have to be exponentiated. Therefore I tried to calculate the CI of one of the groups (B) and see what the difference of the lower and the upper end of the CI to the estimate of group A is. I believe this is not the correct way to calculate the CI of the difference because the estimate of A is also uncertain.
I would be happy if anyone could help me out.
library(lme4)
# Example data:
set.seed(11)
treatment = c(rep("A",30), rep("B", 40))
site = rep(1:14, each = 5)
presence = c(rbinom(30, 1, 0.6),rbinom(40, 1, 0.8))
df = data.frame(presence, treatment, site)
# Likelihood ratio test
M0 = glmer(presence ~ 1 + (1|site), family = "binomial", data = df)
M1 = glmer(presence ~ treatment + (1|site), family = "binomial", data = df)
anova(M1, M0)
# Calculating confidence intervals
cc <- confint(M1, parm = "beta_")
ctab <- cbind(est = fixef(M1), cc)
cdat = as.data.frame(ctab)
# Function to back-transform to probability (0-1)
unlogit = function(y){
y_retransfromed = exp(y)/(1+exp(y))
y_retransfromed
}
# Getting estimates
A_est = unlogit(cdat$est[1])
B_est = unlogit(cdat$est[1] + cdat$est[2])
B_lwr = unlogit(cdat$est[1] + cdat[2,2])
B_upr = unlogit(cdat$est[1] + cdat[2,3])
Difference_est = B_est - A_est
# This is how I tried to calculate the CI of the difference
Difference_lwr = B_lwr - A_est
Difference_upr = B_upr - A_est
# However, I believe this is wrong because A_est is also “uncertain”
How to get the confidence interval of the difference of the probability of presence?
We can calculate the average treatment effect in the following way. From the original data, create two new datasets, one in which all units receive treatment A, and one in which all units receive treatment B. Now, based on your model estimates (in your case, M1), we compute predicted outcomes for units in each of these two datasets. We then compute the mean difference in the outcomes between the two datasets to get our estimated average treatment effect. Here, we can write a function that takes a glmer object and computes the average treatment effect:
ate <- function(.) {
treat_A <- treat_B <- df
treat_A$treatment <- "A"
treat_B$treatment <- "B"
c("ate" = mean(predict(., newdata = treat_B, type = "response") -
predict(., newdata = treat_A, type = "response")))
}
ate(M1)
# ate
# 0.09478276
How do we get the uncertainty interval? We can use the bootstrap, i.e. re-estimate the model many times using randomly generated samples from your original data, calculating the average treatment effect each time. We can then use the distribution of the bootstrapped average treatment effects to compute our uncertainty interval. Here we generate 100 simulations using the bootMer function
out <- bootMer(M1, ate, seed = 1234, nsim = 100)
and inspect the distribution of the effect:
quantile(out$t, c(0.025, 0.5, 0.975))
# 2.5% 50% 97.5%
# -0.06761338 0.10508751 0.26907504

Format in R for point prediction of survival analysis

I am befuddled by the format to perform a simple prediction using R's survival package
library(survival)
lung.surv <- survfit(Surv(time,status) ~ 1, data = lung)
So fitting a simple exponential regression (for example purposes only) is:
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung, dist="exponential")
How would I predict the percent survival at time=400?
When I use the following:
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400), type="response")
I get the following:
myPredict400
1
421.7758
I was expecting something like 37% so I am missing something pretty obvious
The point with this survival function is to find an empirical distribution that fits the survival times. Essentially you are associating a survival time with a probability. Once you have that distribution, you can pick out the survival rate for a given time.
Try this:
library(survival)
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung) # because you want a distribution
pct <- 1:99/100 # this creates the empirical survival probabilities
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400),type='quantile', p=pct)
indx = which(abs(myPredict400 - 400) == min(abs(myPredict400 - 400))) # find the closest survival time to 400
print(1 - pct[indx]) # 0.39
Straight from the help docs, here's a plot of it:
matplot(myPredict400, 1-pct, xlab="Months", ylab="Survival", type='l', lty=c(1,2,2), col=1)
Edited
You're basically fitting a regression to a distribution of probabilities (hence 1...99 out of 100). If you make it go to 100, then the last value of your prediction is inf because the survival rate in the 100th percentile is infinite. This is what the quantile and pct arguments do.
For example, setting pct = 1:999/1000 you get much more precise values for the prediction (myPredict400). Also, if you set pct to be some value that's not a proper probability (i.e. less than 0 or more than 1) you'll get an error. I suggest you play with these values and see how they impact your survival rates.

Resources