I want to use IPTW to find the effects of a medication on cardiovascular death (1), which competes with non-cardiovascular death (2) and survival (0).
After the IPTW, I would like to do a competing risk analysis to find the effect of the medication on cardiovascular death and plot the resulting Kaplan Meier curve.
This is my start
library(tableone)
library(crr)
library(ipw)
library(sandwich)
library(survey)
treatment<-as.numeric(df$treatment==1)
#propensity score model
psmodel<-glm(treatment ~ age + sex, data=df )
ps<-predict(psmodel, type="response")
#weights
weight<-ifelse(treatment==1,1/(ps),1/(1-ps))
age<-as.numeric(df$age)
sex<-as.numeric(df$sex==1)
cov1 <-cbind(age,sex, weight, treatment)
ftime <-df$Survival
fstatus<-df$Outcome
competingrisk<-crr(ftime, fstatus, cov1, failcode=2)
summary(competingrisk)
My dataset looks like this but with 500 lines.
structure(list(Outcome = c(1, 1, 1, 2, 2, 2, 2, 0, 0, 1, 1, 0,
0, 1, 1, 0, 0, 2), Survival = c(7, 13, 14, 8, 9, 15, 14, 16,
14, 3, 7, 13, 14, 8, 9, 15, 16, 4), treatment = c(1, 0, 0, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0), age = c(59, 58, 57,
56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42),
BMI = c(25, 24, 23, 22, 21, 20, 29, 28, 27, 26, 25, 24, 25,
24, 23, 22, 21, 20), sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -18L))
A Kaplan-Meier plot is very naive to multiple covariates, and to competing risks. I highly doubt it is the best way to analyse your data in this instance, however it could be used to investigate and so here is the solution.
If you want to find the four Kaplan-Meier curves for the two events, you can use the following code:
library(survival)
#create a KM model
mod1 <- survfit(Surv(Survival,Outcome==1)~treatment,data=df)
#plot it
plot(mod1,
conf.int=TRUE, #show confidence intervals
col=c("red","green") #treatment=0 is red
)
#see ?plot.survfit for more parameters
mod2 <- survfit(Surv(Survival,Outcome==2)~treatment,data=df)
#plot it
plot(mod2,
conf.int=TRUE, #show confidence intervals
col=c("red","green") #treatment=0 is red
)
However, as stated above, KM plots are naive to competing risks (the probabilities do not take the other event into consideration and so at certain time points, you can have that the probabilities of the two events sum to more than 100%).You would likely be better plotting the Cumulative Incidence Function (or CIF), which does take into account the competing risks. The cmprsk package has the cuminc() function for this.
library(cmprsk)
cif <- cuminc(df$Survival,df$Outcome,df$treatment)
plot(cif)
# See ?plot.cuminc for more parameters
However, I find these base plots to be quite unappealing, and so would highly recommend using the survminer package which utilises ggplot2 to create better plots:
library(survminer)
ggsurvplot(mod1)
ggsurvplot(mod2)
ggcompetingrisks(cif)
If you wish to analyse the data and find the effects of a treatment, using the IPTW, you can use the crr() function as in your question and this will return the subdistribution proportional hazard regression coefficients (i.e. the results of a Fine & Gray model). These can be interpreted in much the same way that of a Cox proportional hazard analysis can be (whilst accounting for the competing risks). Bear in mind that the IPTW is not a panacea and therefore these results may still be non-causal.
Once you have the Fine & Gray model results from crr(), you can create a plot of the subdistributions across the two treatments by using the following code (assuming you have made the competingrisk object above)
cov2 <- matrix(c(0,1),ncol=1)
colnames(cov2) <- "treatment"
cr_pred <- predict(competingrisk,cov2)
plot(cr_pred)
# see ?plot.predict.crr for parameters
A few good resources are:
This blogpost by Emily Zabor
Tutorial in Biostatistics by Putter et al.
Related
I'm struggling to set a dataframe for multistate survival analysis. Here is the reproducible example with only 3 individuals (ID). This is only a part of the multistate.
f <- structure(list(ID = c(3, 4, 5), time_to_end = c(30, 36, 36)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
f.long<–structure(list(ID = c(3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5), resp_pois = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), time = c(6, 12, 18, 24, 30, 36, 42, 48, 6, 12, 18, 24, 30, 36, 42, 48, 6,
12, 18, 24, 30, 36, 42, 48)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"))
f includes 3 individual taking a new drug and observed for time_to_end.
f.long includes the biochemical response at different time points.
E.g., ID 3 immediately respond to the drug up to 24 months, but at 30 months there is evidence of lack of biochemical response, while ID 4 shows only an isolated response at 24 months but never again or before.
I'm trying to manage this reversible condition between response and no-response with tmerge, as follows:
f.merge <- tmerge(f %>% select(ID), f, id=ID, tstart = 0, tstop = time_to_end)
f.merge <- tmerge(f.merge, f.long, id=ID, response=event(time, resp_pois))
survfit(Surv(tstart, tstop, response)~1, data=f.merge)
The problem is that tmerge interpretes every resp==1 as a new event, so at the end the survfit function give 5, instead of 2.
Can someone suggest any solution? Am i probably misusing tmerge?
I got this answer from Prof Terry Therneau by e-mail
Okay,
0. When debugging a purported issue with the survival package, my first step is run in an R session with the survival package loaded, and NOTHING else.
I can't read your data set: I'm getting a syntax error somewhere. But by cutting and pasting, I was able to create a "de tibbled" version of the data.
But what exactly is the data? The variable name "resp_pois" means nothing to me. You gave some information on where you want to go, but none about where you are starting. I will hazard a guess that you are dealing with panel data, i.e., subjects come in at regular intervals and you mesure their state at each visit?
Now, tmerge has no way to distinguish a data set with multiple heart attacks, and one row per attack; from a panel study data set.
I have a dataset consisting in observations of the developmental time and survival of an insect.
Developmental time is the time in days between egg exclosure and adult emergence, while survival of adults is the time between adult emergence and death (or censoring).
In my analysis, I am plotting a Kaplan-Meier reversed survival curve (ggsurvplot(survfit_obj, fun = "event")) for developmental time, while for adult survival I am using a classical Kaplan-Meier curve.
Since I have these two time-to-event variables for every unique individual, I am wondering if there is the possibility to horizontally merge the two resulting Kaplan-Meier curves into a single comprehensive one, or eventually with a different estimator.
Actually, I am not even sure if it has a solid statistical meaning, it just graphically make sense (see last figure).
Keep in mind that the two time-to-events variables are substantially different, they are not representing a single recurrent event.
Below you can find my reproducible example.
Variable legend:
treat -> treatment
days2emerge -> days from egg to adult
new.ad = 1 -> successful adult emergence
days2event -> days from adult emergence to death/censoring
days2event = 1 -> adult death
days2event = 0 -> adult censoring
library(tidyverse)
library(survival)
library(survminer)
#import database
db <- matrix(c("BRA20", "BA84", "BRA20", "BRA20", "BRA20", "BRA20", "BRA20", "BRA20", "BA84", "BRA20", "BA84", "BA84", "BRA20", "BA84", "BRA20", "BA84", "BA84", "BA84", "BRA20", "BA84", "BA84", "BA84", "BRA20", "BRA20", "BA84", "BA84", "BRA20", "BA84", 45, 27, 34, 45, 45, 56, 59, 45, 27, 42, 56, 31, 52, 27, 56, 27, 31, 59, 42, 52, 27, 34, 49, 38, 34, 63, 52, 31, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 29, 50, 11, 14, 17, 35, 14, 39, 35, 14, 35, 7, 43, 35, 50, 21, 32, 17, 11, 11, 25, 51, 28, 15, 7, 25, 14, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0), ncol = 5, byrow = F)
colnames(db) <- c("treat", "days2emerge", "new.ad", "days2event", "death")
db <- as_tibble(db)
db$days2emerge <- as.numeric(db$days2emerge)
db$new.ad <- as.numeric(db$new.ad)
db$days2event <- as.numeric(db$days2event)
db$death <- as.numeric(db$death)
#nymph developmental time
db_devtfit <- survfit(Surv(days2emerge, new.ad) ~ treat, data = db)
np <- ggsurvplot(db_devtfit, data = db, fun = "event", linetype = c("strata"), legend.title = "Groups")
#adult survival
db_survfit <- survfit(Surv(days2event, death) ~ treat, data = db)
sp <- ggsurvplot(db_survfit, data = db, censor.shape = c("X"), linetype = c("strata"), legend.title = "Groups")
Considering this example, is there a way to horizontally merge the two different Kaplan-Meier curves, taking into account the statistics behind it?
A picture to somehow graphically represent what I mean:
I don't think you can do this natively in ggsurvplot, but you can strip the data out of your plots and easily build a new one:
np_dat <- np$data.survplot
np_dat$time <- np_dat$time - max(np_dat$time)
np_dat$surv <- 1 - np_dat$surv
df <- rbind(np_dat, sp$data.survplot)
ggplot(df, aes(time, surv, color = strata)) +
geom_step() +
geom_vline(xintercept = 0, linetype = 2) +
theme_classic() +
labs(y = 'probability of being live adult',
x = 'time from emergence')
Context
Often in dose-response models we regress some range of doses against a response variable, but we are really interested in identifying the dose required to elicit a particular response. Typically this is done with inverse regression techniques (i.e. after-fitting / reparameterisation). EDIT: To clarify - this is commonly done when you need to estimate the dose required to kill say 50%, or 99.99% for quarantine protocols. To derive these estimates people employ inverse regression techniques - the above link goes through this more carefully (see page 9).
Question
How can I carry out these inverse regression procedures using methods like robust linear models, quantile regression, or machine learning models (i.e. neural networks or support vector machines)? EDIT: To clarify, I want a programming solution to how I can estimate the dose required to elicit a response of 99.99 when the model I have fitted is one of the above mentioned. I have fitted example models below to these ends.
My data looks like this:
df <- structure(list(Response = c(100, 91.1242603550296, 86.9822485207101,
100, 0, 0, 90.5325443786982, 95.8579881656805, 88.7573964497041,
96.4497041420118, 82.2485207100592, 99.4082840236686, 99.4082840236686,
98.8165680473373, 91.7159763313609, 59.1715976331361, 44.9704142011834,
0, 100, 95.2662721893491, 100, 82.8402366863905, 7.69230769230769,
81.6568047337278, 62.7218934911243, 97.6331360946746, 73.9644970414201,
8.87573964497041, 0, 98.8165680473373, 78.1065088757396, 98.2248520710059,
52.6627218934911, 96.4497041420118, 52.0710059171598, 0, 62.043795620438,
84.6715328467153, 97.8102189781022, 4.37956204379562, 89.051094890511,
99.2700729927007, 99.2700729927007, 97.0802919708029, 81.7518248175183,
80.2919708029197, 90.5109489051095, 99.2700729927007, 96.3503649635037,
0, 0, 94.8905109489051, 79.5620437956204, 67.8832116788321, 73.7226277372263,
100, 97.0802919708029, 93.4306569343066, 86.8613138686131, 33.5766423357664,
32.1167883211679, 46.7153284671533, 98.5401459854015, 95.6204379562044,
86.1313868613139, 14.5985401459854, 92.7007299270073, 86.1313868613139,
0, 77.3722627737226, 89.051094890511, 80.2919708029197, 98.1818181818182,
96.3636363636364, 30.9090909090909, 0, 60.9090909090909, 100,
0, 83.6363636363636, 88.1818181818182, 97.2727272727273, 0, 0,
99.0909090909091, 100, 100, 91.8181818181818, 88.1818181818182,
46.3636363636364, 50.9090909090909, 99.0909090909091, 97.2727272727273,
100, 0, 92.7272727272727, 60.9090909090909, 90.9090909090909,
57.2727272727273, 76.3636363636364, 94.5454545454545, 50, 98.1818181818182,
16.3636363636364, 87.2727272727273, 92.7272727272727, 87.2727272727273,
88.1818181818182, 10.7438016528926, 91.7355371900827, 98.3471074380165,
60.3305785123967, 95.8677685950413, 0, 63.6363636363636, 71.900826446281,
0, 74.3801652892562, 76.8595041322314, 0, 61.9834710743802, 0,
0, 0, 84.297520661157, 47.1074380165289, 69.4214876033058, 97.5206611570248,
100, 61.1570247933884, 90.0826446280992, 78.5123966942149, 10.7438016528926,
100, 98.3471074380165, 100, 98.3471074380165, 93.3884297520661,
90.9090909090909, 57.8512396694215, 57.8512396694215, 92.5619834710744,
77.6859504132231, 69.4214876033058), Covariate = c(20, 14, 14,
20, 0, 0, 14, 14, 14, 16, 10, 20, 20, 20, 16, 10, 10, 0, 16,
16, 16, 10, 0, 12, 10, 12, 12, 0, 0, 20, 12, 16, 10, 12, 12,
0, 14, 14, 16, 0, 14, 20, 16, 20, 14, 12, 12, 20, 20, 0, 0, 14,
12, 10, 10, 20, 16, 16, 14, 10, 10, 10, 20, 16, 10, 0, 12, 12,
0, 12, 16, 14, 16, 14, 0, 0, 12, 20, 0, 12, 14, 14, 0, 0, 20,
20, 20, 14, 14, 10, 10, 20, 16, 16, 0, 12, 10, 10, 10, 16, 16,
12, 20, 10, 12, 12, 16, 14, 0, 16, 20, 12, 14, 10, 10, 0, 0,
12, 12, 10, 10, 0, 0, 0, 14, 12, 12, 20, 20, 14, 14, 14, 12,
20, 20, 20, 16, 16, 14, 10, 10, 16, 16, 16)), row.names = 433:576, class = "data.frame")
with my formula usually being something like:
Response ~ Covariate + I(Covariate^2)
Here is an example of the models I have fitted:
#Robust linear model
MASS::rlm(Response ~ Covariate + I(Covariate^2), data = df)
#Quantile regression
quantreg::rq(Response ~ Covariate + I(Covariate^2), data = df, tau = c(0.5, 0.95)) # In this case I want to predict the specified quantiles for the dose required to elicit a given response, although I realised this code doesn't do that...
#Machine learning algorithms were trained with caret
TRControl <- trainControl(method = "cv")
#Neural Network
caret::train(Response ~ Covariate, data = df, method = "neuralnet", trControl = TRControl)
#Support Vector Machine
caret::train(Response ~ Covariate, data = df, method = "polySVM", trControl = TRControl)
Further to my comments above, your data doesn't really resemble that of a typical dose-response measurement
library(ggplot2)
ggplot(df, aes(Covariate, log10(Response))) +
geom_point()
Here I assume that Covariate is the dose/concentration.
Do the different measurements for every Covariate relate to different experiments/groups? Do you plan on fitting multiple dose response curves do different groups in order to compare them?
A possible analysis strategy
Here is something that might give you some ideas. I'm using drc here because it allows me to fit a "sensible" dose-response curve to your data. A sensible dose-response model has horizontal asymptotes for dose → 0 and dose → ∞.
In this particular example we fit a four parameter Weibull function to your data.
library(drc)
model <- drm(Response ~ Covariate, data = df, fct = W2.4())
Let's plot original data and model predictions (including confidence interval)
library(tidyverse)
df.pred <- data.frame(Covariate = 1.1 * seq(min(df$Covariate), max(df$Covariate), length.out = 20)) %>%
bind_cols(as.data.frame(predict(model, data.frame(Covariate = Covariate), interval = "confidence"))) %>%
rename(Response = Prediction)
ggplot(df, aes(Covariate, Response)) +
geom_point() +
geom_line(data = df.pred, aes(Covariate, Response), color = "blue") +
geom_ribbon(data = df.pred, aes(x = Covariate, ymin = Lower, ymax = Upper), fill = "blue", alpha = 0.2)
We can now use uniroot to determine specific LDx values, which are defined as the dose required to reduce the maximum response by x / 100.
getLDx <- function(model, x = 0.5) {
maxResponse <- max(predict(model, data.frame(x = c(0, Inf))))
uniroot(
function(Covariate) predict(model, newdata = data.frame(Covariate = Covariate)) - x * maxResponse,
interval = range(Covariate))$root
}
This is basically an inversion of the model, so perhaps this is what the authors of the papers you link to in your original post refer to as "inverse regression techniques".
Let's calculate the LD50 value (i.e. the dose required to reduce the response by 50%)
getLDx(model, x = 0.5)
#[1] 9.465188
From an inspection of the plot you can see that this value indeed corresponds to the dose where the response is 50% of the maximum response value.
I want to use predict() with a polr() model to predict variable z, as per the following code. This first is the df to train the model and the subsequent test data.
df <- data.frame(x=c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2),
y=c(32, 67, 12, 89, 45, 78, 43, 47, 14, 67, 16, 36, 25, 23, 56, 26, 35, 79, 13, 44),
z=as.factor(c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 1, 2, 1, 2, 1, 2)))
test <- data.frame(x=c(1, 2, 1, 1, 2, 1, 2, 2, 1, 1),
y=c(34, NA, 78, NA, 89, 17, 27, 83, 23, 48),
z=c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1))
This is the polr() model:
mod <- polr(z ~ x + y, data = df, Hess = TRUE)
And this is the predict() function with its outcome:
predict(mod, newdata = test)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
My problem is that I want the model to make predictions even when there are NAs, as in the 2nd and 4th cases. I have tried the following, with the same result:
predict(mod, newdata = test, na.action = "na.exclude")
predict(mod, newdata = test, na.action = "na.pass")
predict(mod, newdata = test, na.action = "na.omit")
predict(mod, newdata = test, na.rm=T)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
How can I get the model to make predictions even when there's some missing data?
This is more of a statistical or mathematical problem than a programming problem. To simplify things a little bit (and show that it's general, I'll illustrate with a linear regression, but the concept extends to ordinal regression as well.
Suppose I've estimated a linear relationship, say z = 1 + 2*x + 3*y, and I want to predict a response when the predictors are {x=3, y=NA}. I get 1 + 2*3 + 3*NA, which is clearly NA.
If you want predictions when some of the predictor variables are unknown, you have to make some kind of assumption/decision about what to do — this is a question of interpretation, not mathematics. For example, you could set unknown values of y to the mean of the original data set, or the mean of the new data set, or some sensible reference value, or you could do multiple imputation — i.e., making several predictions based on several different draws from a reasonable distribution, then averaging the results. (For a linear regression model this will give you the same answer (point estimate) as using the mean of the distribution, but (1) the results will differ if you have an effectively nonlinear model like an ordinal or generalized linear regression; (2) multiple imputation will allow you to get sensible standard errors on the prediction.)
I am trying to use auto.arima on a timeseries. Now I need to know the order of the arima that has been selected. The return value is of type ARIMA, which doesn't hold the order anywhere. (or am I missing the values). Given in code snippet and the output attributes. (This is same as in R Documentation)
double[] list1 = {0, 0, 2, 1, 2, 10, 21, 0, 0, 3, 6, 5, 11, 51, 0, 11, 8, 6, 24, 25, 104, 0, 0, 6, 4, 5, 25, 71};
rconnection.assign("myData1", list1);
rconnection.eval("timeSeries1 <- ts(myData1,start=1,frequency="+staticBookingStage+")");
REXP fc = rconnection.eval("fitModel1 <- auto.arima(timeSeries1)");
System.out.println( fc.asList().names);
Output
[coef, sigma2, var.coef, mask, loglik, aic, arma, residuals, call, series, code, n.cond, nobs, model, bic, aicc, x, fitted]
Use the arimaorder() function:
library(forecast)
fit <- auto.arima(WWWusage)
arimaorder(fit)