I am using the survival package to do competing risk analysis and would like to use the prediction to do simulation.
I plan to sample a uniform random x from [0, 1], find the termination time from survival curve that intersect with x; and then sample another random y from [0, hazard1 + hazard2] to decide which end state to choose.
I can extract survival curve but don't know what are the hazard for individual competing states. I copied example from Vignettes "compete" of the survival package below:
data(mgus2)
cfit1 <- coxph(Surv(etime, event=="pcm") ~ age + sex + mspike, mgus2)
etime <- with(mgus2, ifelse(pstat==0, futime, ptime))
event <- with(mgus2, ifelse(pstat==0, 2*death, 1))
event <- factor(event, 0:2, labels=c("censor", "pcm", "death"))
cfit2 <- coxph(Surv(etime, event=="death") ~ age + sex + mspike, mgus2)
cfit1 <- coxph(Surv(etime, event=="pcm") ~ age + sex + mspike, mgus2)
newdata <- expand.grid(sex=c("F", "M"), age=c(60, 80), mspike=1.2)
newdata
temp <- matrix(list(), 3,3)
dimnames(temp) <- list(from=c("Entry", "PCM", "Death"),
to =c("Entry", "PCM", "Death"))
temp[1,2] <- list(survfit(cfit1, newdata, std.err=FALSE))
temp[1,3] <- list(survfit(cfit2, newdata, std.err=FALSE))
csurv <- survfit(temp, p0 =c(1,0,0))
It looks like csurv$pstate contains the survival curve and cumulative hazards. But I don't understand their relationship. I tried to back out survival by adding up the two cumulative hazards and take exp as below but the results are different from survival curve.
exp(-(csurv$pstate[,2]+csurv$pstate[,3]))
csurv$pstate[,1]
How are they related? How do I use them for simulation?
The documentation says there is a prev component of a survfit object which is the "prevalence in state". (Perhaps Terry forgot to fix the documentation when he change the name?) Since it is a probability, at any given time the sum of probabilities should be 1. And therefore you would not be exponentiating them. When I tabulate the rowSums of pstate component, all the sums (three values per row) are within numerical error of 1.0. So exponentiation of the sums would be equal to e, not what you expected. I'd drop the exp operation. It's already been done for you "behind the scenes."
Related
My aim is to study how the inclusion of several risk factors improve a clinical model predicting the incidence of stroke in a survival analysis. I want to use the NRI to compare two models (1 baseline model vs baseline model+new risk factor). However, I would prefer to stratify the probabilities of these models in categorical variables (i.e., risk 0-3%, risk 4-6%, risk >7%...).
For that purpose I have prepared the following code with R, but I am not sure whether it is correct or not and I would prefer to clarify it =).
I will use the open survival data “lung” in order to show a reproducible example. IMPORTANT: I will not consider censored data in this analysis, but it should be analyzed as appropriate in further analyses:
library(survival)
library(pec)##To calculate probabilities of event at a point of time.
library(Hmisc)##To calculate NRI
data(lung)
lung <- na.omit(lung)
lung$status <- lung$status-1##necessary for NRI package
stats <- lung$status
tempo <- lung$time
##Create two models, 1 baseline and 1 baseline+new predictor
model1 <- coxph(Surv(tempo,stats)~age+sex, data=lung, x=T)
model2 <- coxph(Surv(tempo,stats)~age+sex+factor(ph.ecog), data=lung,
x=T)
##Estimate the survival probability at time= 500.
lung$x <- predictSurvProb(model1, newdata=lung, times=c(500))
lung$y <- predictSurvProb(model2, newdata=lung, times=c(500))
##Calculate the probability of event (1-survival) and we categorize it.
lung$x <- cut(1-lung$x, c(0, 0.25, 0.5, 0.75, 1), include.lowest=TRUE,
right=FALSE)
lung$y <- cut(1-lung$y, c(0, 0.25, 0.5, 0.75, 1), include.lowest=TRUE,
right=FALSE)
##Confusion matrix
cases <- subset(lung, stats==1)
controls <- subset(lung, stats==0)
table(lung$x, lung$y)##All patients
table(cases$x, cases$y)##cases
table(controls$x, controls$y)##controls
##Calculate NRI
x <- as.numeric(lung$x)/4##Predictions should be ranged between 0-1
y <- as.numeric(lung$y)/4
improveProb(x, y, stats)
My questions:
1) Is this mode to calculate the risk of event at t=500 correct?
2) Is it correct to calculate NRI with this function? I have tried other packages (i.e., NRIcens) and I obtained the same results…
Thanks in advance for your help!
I am building a factor model to estimate future equity returns. I'd like to include an autoregressive residual term in this model. I'd like to have yesterday's error (the difference between yesterday's predicted return and actual return) to be included in the regression as an independent variable. What type of autoregressive model is this called? I've searched through various time series econometrics texts and have not found this particular model described. My current solution in R is to rerun the regression at every discrete time step (t), and manually include yesterday's residual, but I am curious if there is a more efficient method or package that does this.
Below is some sample code without the residual term included:
Data:
# fake data
set.seed(333)
df <- data.frame(seq(as.Date("2017/1/1"), as.Date("2017/2/19"), "days"),
matrix(runif(50*506), nrow = 50, ncol = 506))
names(df) <- c("Date", paste0("var", 1:503), c("mktrf", "smb", "hml"))
Then I store my necessary variables for regression:
1.All the dep var
x = df[,505:507]
2.All the indep var
y <- df[,2:504]
4.Fit all the models
list_models_AR= lapply(y, function(y)
with(x, lm(y ~ mktrf + smb + hml , na.action = na.exclude)))
It’s a ARIMA(0, 0, 1), with regressors model
I am a complete newbie to R.
I have the following logit equation I am estimating:
allAM <- glm (AM ~ VS + Prom + LS_Exp + Sex + Age + Age2 + Jpart + X2004LS + X2009LS + X2014LS + factor(State), family = binomial(link = "logit"), data = mydata)
AM is a standard binary (happened/didn’t happen). The three “X****LS” variables are dummies indicating different sessions of congress and “factor(State)” is used to generate fixed effects/dummies for each state.
VS is the key independent variable of interest and I want to generate the predicated probability that AM=1 for each value of VS between 0 and 60, holding everything else at its mean.
I am running into trouble, however, generating and plotting the predicted probabilities because “State” is a factor. I want to be able to show the average effects, not 50 different charts/effects for each state.
Per (Hanmer and Kalkan 2013) http://onlinelibrary.wiley.com/doi/10.1111/j.1540-5907.2012.00602.x/abstract I was advised to do the following to plot the predicted probabilities:
pred.seq <- seq(from=0, to=60, by=0.01)
pred.out <- c()
for(i in 1:length(pred.seq)){
mydata.c <- mydata
mydata.c$VS <- pred.seq[i]
pred.out[i] <- mean(predict(allAM, newdata=mydata.c, type="response"))
}
plot(pred.out ~ pred.seq, type="l")
This approach seems to work, though I don’t really understand it.
I want to add the upper and lower 95% confidence intervals to the plot, but when I attempt to do it by hand the way I know how:
lower <- pred.out$fit - (1.96*pred.out$se.fit)
upper <- pred.out$fit + (1.96*pred.out$se.fit)
I get the following error:
Error in pred.outfit:fit: operator is invalid for atomic vectors
Can anyone advise how I can plot the confidence intervals and how I can specify different levels of VS so that I can report some specific predicted probabilities?
I am befuddled by the format to perform a simple prediction using R's survival package
library(survival)
lung.surv <- survfit(Surv(time,status) ~ 1, data = lung)
So fitting a simple exponential regression (for example purposes only) is:
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung, dist="exponential")
How would I predict the percent survival at time=400?
When I use the following:
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400), type="response")
I get the following:
myPredict400
1
421.7758
I was expecting something like 37% so I am missing something pretty obvious
The point with this survival function is to find an empirical distribution that fits the survival times. Essentially you are associating a survival time with a probability. Once you have that distribution, you can pick out the survival rate for a given time.
Try this:
library(survival)
lung.reg <- survreg(Surv(time,status) ~ 1, data = lung) # because you want a distribution
pct <- 1:99/100 # this creates the empirical survival probabilities
myPredict400 <- predict(lung.reg, newdata=data.frame(time=400),type='quantile', p=pct)
indx = which(abs(myPredict400 - 400) == min(abs(myPredict400 - 400))) # find the closest survival time to 400
print(1 - pct[indx]) # 0.39
Straight from the help docs, here's a plot of it:
matplot(myPredict400, 1-pct, xlab="Months", ylab="Survival", type='l', lty=c(1,2,2), col=1)
Edited
You're basically fitting a regression to a distribution of probabilities (hence 1...99 out of 100). If you make it go to 100, then the last value of your prediction is inf because the survival rate in the 100th percentile is infinite. This is what the quantile and pct arguments do.
For example, setting pct = 1:999/1000 you get much more precise values for the prediction (myPredict400). Also, if you set pct to be some value that's not a proper probability (i.e. less than 0 or more than 1) you'll get an error. I suggest you play with these values and see how they impact your survival rates.
I performed a stratification on the propensity score via the MatchIt package:
install.packages("MatchIt")
library(MatchIt)
# Stratification on the propensity score
stra <- matchit(treat ~ X1 + X2 + X3, data = data.mc, , method = "subclass")
smry.stra <- summary(stra, standardize = TRUE)
data.stra <- match.data(stra)
Now I would like to calculate the variance ratio of the propensity scores between the treatment and the control group. Could I just calculate the variances for the total treatment and control group?
stra.ratio <- var(data.stra$distance[data.stra$treat == 1]) /
var(data.stra$distance[data.stra$treat == 0])
Or would I have to consider the strata and the weights of the strata somehow?
Thank you very much in advance!
From the documentation of the package, the description of the value weights:
"Each matched control unit has weight proportional to the number of treatment units to which it was matched, and the sum of the control weights is equal to the number of uniquely matched control units.
"
This indicates to me that yes, we should take the weighted variance. It wouldn't surprise me if the weighted treatment variance is always equal to the weighted variance as the weights should always be one, but we will definitely see a difference in the control variance.
Here is a reproducible example using the data that came with the MatchIt package:
library(MatchIt)
library(SDMTools)
data(lalonde)
stra <- matchit(treat ~ age + educ + black + hispan + married + nodegree, data = lalonde, method = "subclass")
data.stra <- match.data(stra)
treatment <- data.stra[data.stra$treat == 1, ]
control <- data.stra[data.stra$treat == 0, ]
stra.ratio <- wt.var(treatment$distance, treatment$weights) / wt.var(control$distance, control$weights)