My aim is to study how the inclusion of several risk factors improve a clinical model predicting the incidence of stroke in a survival analysis. I want to use the NRI to compare two models (1 baseline model vs baseline model+new risk factor). However, I would prefer to stratify the probabilities of these models in categorical variables (i.e., risk 0-3%, risk 4-6%, risk >7%...).
For that purpose I have prepared the following code with R, but I am not sure whether it is correct or not and I would prefer to clarify it =).
I will use the open survival data “lung” in order to show a reproducible example. IMPORTANT: I will not consider censored data in this analysis, but it should be analyzed as appropriate in further analyses:
library(survival)
library(pec)##To calculate probabilities of event at a point of time.
library(Hmisc)##To calculate NRI
data(lung)
lung <- na.omit(lung)
lung$status <- lung$status-1##necessary for NRI package
stats <- lung$status
tempo <- lung$time
##Create two models, 1 baseline and 1 baseline+new predictor
model1 <- coxph(Surv(tempo,stats)~age+sex, data=lung, x=T)
model2 <- coxph(Surv(tempo,stats)~age+sex+factor(ph.ecog), data=lung,
x=T)
##Estimate the survival probability at time= 500.
lung$x <- predictSurvProb(model1, newdata=lung, times=c(500))
lung$y <- predictSurvProb(model2, newdata=lung, times=c(500))
##Calculate the probability of event (1-survival) and we categorize it.
lung$x <- cut(1-lung$x, c(0, 0.25, 0.5, 0.75, 1), include.lowest=TRUE,
right=FALSE)
lung$y <- cut(1-lung$y, c(0, 0.25, 0.5, 0.75, 1), include.lowest=TRUE,
right=FALSE)
##Confusion matrix
cases <- subset(lung, stats==1)
controls <- subset(lung, stats==0)
table(lung$x, lung$y)##All patients
table(cases$x, cases$y)##cases
table(controls$x, controls$y)##controls
##Calculate NRI
x <- as.numeric(lung$x)/4##Predictions should be ranged between 0-1
y <- as.numeric(lung$y)/4
improveProb(x, y, stats)
My questions:
1) Is this mode to calculate the risk of event at t=500 correct?
2) Is it correct to calculate NRI with this function? I have tried other packages (i.e., NRIcens) and I obtained the same results…
Thanks in advance for your help!
Related
I would like to do the following analyses with the dataset with missing variables. Because mice and MatchThem packages do not support pooling the results of Kaplan-Meier analysis, I try to do it manually as follows:
Do multiple imputations using mice.
Calculate inverse probability weights in each imputed dataset using WeightIt.
Estimate IPW-adjusted Kaplan-Meier curves in each imputed dataset using survfit.
Pool the results of #3 and depict the pooled IPW-adjusted KM curves.
Calculate the difference in IPW-adjusted restricted mean survival time (the area under KM curve until the specific timepoint) according to akm-rmst (https://github.com/s-conner/akm-rmst) within each imputed dataset.
Pool the results of #5.
Get descriptive statistics of baseline characteristics in imputed dataset using tbl_summary from gtsummary package.
Here are my codes
pacman::p_load(survival, survey, survminer, WeightIt, tidyverse, mice)
df # sample dataset
m <- 10 # number of imputation
dimp <- mice::mice(df, m = m, seed = 123)
for (i in 1:m) {
dcomp <- mice::complete(dimp, i) # extract imputed data
# estimate weight
wgt <- weightit(
treatment ~ age + sex + smoking,
data = dcomp, method = "ps", estimand = "ATE", stabilize = TRUE
)
# add weight and pscore to dataset
dimp <- tibble(dcomp, wgt = wgt[["weights"]], pscores = wgt[["ps"]])
assign(paste0("df", i), output) # save "i"th imputed dataset
# calculate Kaplan-Meier estimate
surv <- survival::survfit(Surv(time, event) ~ treatment, data = dimp, weight = wgt)
assign(paste0("surv", i), output) # save "i"th IPW-adjusted KM curves
}
These codes do the analyses from #1 to #3. Although I read the reference (https://stefvanbuuren.name/fimd/sec-pooling.html), I could not find how to do these analyses(#4 to #7). Can anyone give me some advice regarding #4 to #7?
I believe this is not a duplicate to any posted question so I'd appreciate any advice. Any assistance you can provide would be greatly appreciated.
Regarding your point #7. When you look for imputation with presumably the need of a high number of datasets (m=20, 40 or > 50) you cannot pick only one dataset in random. Risk of type-one error and you lose the effect of your imputation. Had the same concerns like you. this thread could help you (only for a summary of imputed descriptive data) : Björn answer in StackExchange
I have a data series of around 250 annual maximum rainfall measurements, maxima[,] and want to apply quantile regression to all series at once and obtain the significance of each regression model in R.
library(quantreg)
qmag <- array(NA, c(250,4))
taus <- c(0.05, 0.1, 0.95, 0.975)
for(igau in 1:250){
qure <- rq(maxima[,igau+1]~maxima[,1], tau=taus)
qmag[igau,] <- coef(qure)[2,]
}
I've tried
summary(qure, se="boot")$p.value
ci(qure)
and other similar variations but get NULL values. Is it actually possible to automatically extract the p-values from quantreg to a table, rather than just viewing them individually in summary() for each model?
have a look at the structure produced by running str() of the summary-object:
require(quantreg)
data(engel)
mod <- rq(foodexp ~ income, data = engel)
summ <- summary(mod, se = "boot")
summ
str(summ)
summ$coefficients[,4]
I am using the survival package to do competing risk analysis and would like to use the prediction to do simulation.
I plan to sample a uniform random x from [0, 1], find the termination time from survival curve that intersect with x; and then sample another random y from [0, hazard1 + hazard2] to decide which end state to choose.
I can extract survival curve but don't know what are the hazard for individual competing states. I copied example from Vignettes "compete" of the survival package below:
data(mgus2)
cfit1 <- coxph(Surv(etime, event=="pcm") ~ age + sex + mspike, mgus2)
etime <- with(mgus2, ifelse(pstat==0, futime, ptime))
event <- with(mgus2, ifelse(pstat==0, 2*death, 1))
event <- factor(event, 0:2, labels=c("censor", "pcm", "death"))
cfit2 <- coxph(Surv(etime, event=="death") ~ age + sex + mspike, mgus2)
cfit1 <- coxph(Surv(etime, event=="pcm") ~ age + sex + mspike, mgus2)
newdata <- expand.grid(sex=c("F", "M"), age=c(60, 80), mspike=1.2)
newdata
temp <- matrix(list(), 3,3)
dimnames(temp) <- list(from=c("Entry", "PCM", "Death"),
to =c("Entry", "PCM", "Death"))
temp[1,2] <- list(survfit(cfit1, newdata, std.err=FALSE))
temp[1,3] <- list(survfit(cfit2, newdata, std.err=FALSE))
csurv <- survfit(temp, p0 =c(1,0,0))
It looks like csurv$pstate contains the survival curve and cumulative hazards. But I don't understand their relationship. I tried to back out survival by adding up the two cumulative hazards and take exp as below but the results are different from survival curve.
exp(-(csurv$pstate[,2]+csurv$pstate[,3]))
csurv$pstate[,1]
How are they related? How do I use them for simulation?
The documentation says there is a prev component of a survfit object which is the "prevalence in state". (Perhaps Terry forgot to fix the documentation when he change the name?) Since it is a probability, at any given time the sum of probabilities should be 1. And therefore you would not be exponentiating them. When I tabulate the rowSums of pstate component, all the sums (three values per row) are within numerical error of 1.0. So exponentiation of the sums would be equal to e, not what you expected. I'd drop the exp operation. It's already been done for you "behind the scenes."
I am befuddled by the format to perform a simple prediction using R's survival package
library(survival)
lung.surv <- survfit(Surv(time,status) ~ 1, data = lung)
So fitting a simple exponential regression (for example purposes only) is:
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung, dist="exponential")
How would I predict the percent survival at time=400?
When I use the following:
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400), type="response")
I get the following:
myPredict400
1
421.7758
I was expecting something like 37% so I am missing something pretty obvious
The point with this survival function is to find an empirical distribution that fits the survival times. Essentially you are associating a survival time with a probability. Once you have that distribution, you can pick out the survival rate for a given time.
Try this:
library(survival)
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung) # because you want a distribution
pct <- 1:99/100 # this creates the empirical survival probabilities
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400),type='quantile', p=pct)
indx = which(abs(myPredict400 - 400) == min(abs(myPredict400 - 400))) # find the closest survival time to 400
print(1 - pct[indx]) # 0.39
Straight from the help docs, here's a plot of it:
matplot(myPredict400, 1-pct, xlab="Months", ylab="Survival", type='l', lty=c(1,2,2), col=1)
Edited
You're basically fitting a regression to a distribution of probabilities (hence 1...99 out of 100). If you make it go to 100, then the last value of your prediction is inf because the survival rate in the 100th percentile is infinite. This is what the quantile and pct arguments do.
For example, setting pct = 1:999/1000 you get much more precise values for the prediction (myPredict400). Also, if you set pct to be some value that's not a proper probability (i.e. less than 0 or more than 1) you'll get an error. I suggest you play with these values and see how they impact your survival rates.
The intention involves finding the causal relations of breakdown (0,1) for a washing machine. The commands for the logistic distribution created randomized variables except for the dependent variable which are all 1. The simulation from the other logistic distribution created randomized 0 and 1 values.
Variables, Hrs2, WashCap, and SpinSp created as normal distributions with n=3000. I altered the mean and SD for fitting the desired interval on the x-axis of the histogram.
# Hours2
set.seed(600)
Hrs2 <- rnorm(3000, mean=300, sd=100)
#WashCap
set.seed(5)
WashCap <- rnorm(3000, mean=2.5, sd=1)
#SpinSp
set.seed(1100)
SpinSp <- rnorm(3000, mean=550, sd=250)
The difference for the logistic distribution includes an additional variable for the linear combination with a bias.
z=1 + 2*Hrs2 + 3*WashCap + 4*SpinSp
pr = 1/(1+exp(-z))
y <- rbinom(3000,1,pr)
WashMa = data.frame(y=y, Hrs2=Hrs2, WashCap=WashCap, SpinSp=SpinSp)
glm( y~Hrs2+WashCap+SpinSp,data=WashMa, family=binomial)
Any thoughts for fixing the dependent variable to create variation?