How can I plot predicted survival curves of a continuous covariate (let's say 20th and 80th percentile of the value) using the corrected group prognosis method as implemented in R by Therneau
For example,
library(survival)
library(survminer)
fit <- coxph( Surv(stop, event) ~ size + strata(rx), data = bladder )
ggadjustedcurves(fit, data=bladder, method = "conditional", strata=rx)
Now, this is useful because I am given two survival curves that are stratified by rx (either 0 or 1) and the conditional method is being acted upon the bladder data set. However, let's say I would like to use the marginal method but not stratify and instead plot my continuous covariate at 20th and 80th value but also re-balance the subpopulation. Would like any step in the right direction.
To re-state, I have a Cox model with continuous predictors. I would like to build a Cox model but not stratify on rx but have this in the model. Then, I want to pass the created Cox object into ggadjustedcurves() function with uses "subpopulation re-balancing" when given a reference data set. And then, instead of showing two survival curves stratified on a categorical variable, I want to plot two representative survival curves at the 20th and 80th percentile.
EDIT
My first attempt
fit2 <- coxph( Surv(stop, event) ~ size + rx, data = bladder ) #remove strata
fit2
# CGP
pred<- data.frame("rx" = 1, "size" = 3.2)
ggadjustedcurves(fit2, data = pred , method = "conditional", reference = bladder)
Is this what I think it is? Conditional re-balancing has been applied to the reference data set and then the predicted curves are generated for an individual with rx=1 and size of 3.2.
It is difficult to understand what you are truly looking for, but I think I have a rough idea. I think you want to plot the survival curve that would have been observed if every person in your sample had received a specific value for the continuous covariate. If there is no confounding, you can simply use a Cox model that includes only the continuous covariate and use the predict() function for a range of points in time and plot the results. If you need to adjust for confounding, you can include the confounders in the Cox model and use g-computation to obtain the desired probabilities. I describe this in a recent preprint: https://arxiv.org/pdf/2208.04644.pdf
This can be done in R using the contsurvplot package (also developed by me). First, install the package using:
devtools::install_github("RobinDenz1/contsurvplot")
Afterwards, fit your Cox model, but use x=TRUE in the coxph call:
library(survival)
library(contsurvplot)
library(riskRegression)
library(ggplot2)
fit2 <- coxph(Surv(stop, event) ~ size + rx, data=bladder, x=TRUE)
You can now call the plot_surv_lines function to obtain the causal survival curves for specific values of size, given the model. Using the horizon argument you can tell the function for which values you want to plot the survival curves. I choose the 20% and 80% quantile of size as you described:
plot_surv_lines(time="stop",
status="event",
variable="size",
data=bladder,
model=fit2,
horizon=quantile(bladder$size, probs=c(0.2, 0.8)))
The package contains a lot more plotting routines to visualize the causal effect of a continuous variable on a time-to-event outcome that might be more suitable for what you actually want.
Related
I have panel data from external assets of 102 countries over ~ 20-40 years, depending on the country.
I tried predicting the probability for a financial crisis, depending on log(total_liabilities to see whether an increase in foreign investment and other capital positions can help predict a crisis.
plm1 <- plm(crisis ~ log_total_liabilities + lag1_log_tot_lia + lag2_log_tot_lia + lag3_log_tot_lia
+ factor(year) + factor(country), data = dt2, index=c("year", "country"), model="pooling")
summary(plm1)
I started by estimating a plm model, regressing on my crisis dummy.
To estimate the predictive ability, I wanted to generate a ROC and AUC value, given the regression
# Plot of True Positive Rate Against the False Positive Rate
pred1 <- predict(plm1)
pred2 <- prediction(pred1,as.numeric(plm1$crisis))
plot(performance(pred2,"tpr","fpr"), las=0, main="plm1")
I get errors like:
Error: not fitting arguments / variables" (translated from German) or
"all arguments/variables need to have the same length" (translated
from German).
Another approach to obtaining Roc values would start with
When changing pred1 <- predict(plm1, dt2) (dt2 is my data frame, containing also some variables I had not used in the plm1 regression), the error differs:
The format of predictions is invalid. It couldn't be coerced to a list.
Are PLMs simply not made for ROC calculations? And if so, how come that the paper attached presents AUROC values for a linear probability model with fixed effects? (See second last row)
And if no, what am I doing wrong?
I attached the screenshot of the paper and my dataset.
CSV File with datasat
Screenshot of paper with OLS AUROC value
AUC-ROC only works for only binary classification problems. As you used a fixed effects regression, the predicted values produced after plm1, pred1, is a continuous one.
I am trying to create a GLMM in R. I want to find out how the emergence time of bats depends on different factors. Here I take the time difference between the departure of the respective bat and the sunset of the day as dependent variable (metric). As fixed factors I would like to include different weather data (metric) as well as the reproductive state (categorical) of the bats. Additionally, there is the transponder number (individual identification code) as a random factor to exclude inter-individual differences between the bats.
I first worked in R with a linear mixed model (package lme4), but the QQ plot of the residuals deviates very strongly from the normal distribution. Also a histogram of the data rather indicates a gamma distribution. As a result, I implemented a GLMM with a gamma distribution. Here is an example with one weather parameter:
model <- glmer(formula = difference_in_min ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gamma(link = log))
However, since there was no change in the QQ plot this way, I looked at the residual diagnostics of the DHARMa package. But the distribution assumption still doesn't seem to be correct, because the data in the QQ plot deviates very much here, too.
Residual diagnostics from DHARMa
But if the data also do not correspond to a gamma distribution, what alternative is there? Or maybe the problem lies somewhere else entirely.
Does anyone have an idea where the error might lie?
But if the data also do not correspond to a gamma distribution, what alternative is there?
One alternative is called the lognormal distribution (https://en.wikipedia.org/wiki/Log-normal_distribution)
Gaussian (or normal) distributions are typically used for data that are normally distributed around zero, which sounds like you do not have. But the lognormal distribution does not have the same requirements. Following your previous code, you would fit it like this:
model <- glmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gaussian(link = identity))
or instead of glmer you can just call lmer directly where you don't need to specify the distribution (which it may tell you to do in a warning message anyway:
model <- lmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl)
I have build a binary logistic regression for churn prediction in Rstudio. Due to the unbalanced data used for this model, I also included weights. Then I tried to find the optimum cutoff by try and error, however To complete my research I have to incorporate ROC curves to find the optimum cutoff. Below I provided the script I used to build the model (fit2). The weight is stored in 'W'. This states that the costs of wrongly identifying a churner is 14 times as large as the costs of wrongly identifying a non-churner.
#CH1 logistic regression
library(caret)
W = 14
lvl = levels(trainingset$CH1)
print(lvl)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trainingset$CH1==lvl[2],W,1)
fit2 = glm(CH1 ~ RET + ORD + LVB + REVA + OPEN + REV2KF + CAL + PSIZEF + COM_P_C + PEN + SHOP, data = trainingset, weight=fit_wts, family=binomial(link='logit'))
# we test it on the test set
predlog1 = ifelse(predict(fit2,testset,type="response")>0.5,lvl[2],lvl[1])
predlog1 = factor(predlog1,levels=lvl)
predlog1
confusionMatrix(pred,testset$CH1,positive=lvl[2])
For this research I have also build ROC curves for decision trees using the pROC package. However, of course the same script does not work the same for a logistic regression. I have created a ROC curve for the logistic regression using the script below.
prob=predict(fit2, testset, type=c("response"))
testset$prob=prob
library(pROC)
g <- roc(CH1 ~ prob, data = testset, )
g
plot(g)
Which resulted in the ROC curve below.
How do I get the optimum cut off from this ROC curve?
Getting the "optimal" cutoff is totally independent of the type of model, so you can get it like you would for any other type of model with pROC. With the coords function:
coords(g, "best", transpose = FALSE)
Or directly on a plot:
plot(g, print.thres=TRUE)
Now the above simply maximizes the sum of sensitivity and specificity. This is often too simplistic and you probably need a clear definition of "optimal" that is adapted to your use case. That's mostly beyond the scope of this question, but as a starting point you should a look at Best Thresholds section of the documentation of the coords function for some basic options.
I'm trying to plot a Kaplan-Meier survival plot in R, but I'm having some trouble.
I'm quite new to R, so forgive my terrible code.
library(survival)
data_time = c(0.19,0.75,0.27,0.26,0.22,0.91,0.21,0.091,0.19,0.37,0.093,0.92,0.046,0.93,042)
data_event = c(1,1,1,1,0,0,1,1,0,0,0,1,1,1,0)
surv_object = Surv(time = data_time, event = data_event)
survfit(surv_object)
This of course gives me an error: "The survfit function requires a formula as its first argument".
I've split the data into two vectors, the first for the life-length, and the second for whether or not that specific data point was censored or not, with 0 meaning not censored, and 1 meaning censored.
I thought the Surv function was supposed to produce the formula required for the survfit function, with the default being the Kaplan-Meier.
The survfit function, as the name suggests, serves to fit a survival model, i.e. predicting survival based on some variables. The "formula" is the non-linear y = f(x) model that is fitted, expressed as Surv(...) ~ x1 + ... + xn.
However, it is definitely possible to do a Kaplan-Meier survival plot without any predictors. Just fitting the model on a constant (i.e. 1) should do the trick. Then, I like to use the ggsurvplot function from the survminer package.
install.packages("survminer")
library(survminer)
library(survival)
data_time = c(0.19,0.75,0.27,0.26,0.22,0.91,0.21,0.091,0.19,0.37,0.093,0.92,0.046,0.93,0.42)
data_event = c(1,1,1,1,0,0,1,1,0,0,0,1,1,1,0)
surv_object = Surv(time = data_time, event = data_event)
# Regress on a constant
fit <- survfit(surv_object ~ 1)
# Plot the fit
ggsurvplot(fit, data.frame(time=data_time, event=data_event), conf.int=FALSE)
Of course, the plot will be a lot more interesting if you're fitting some strata.
Note: I assume you missed a period in the last even time, and fixed it.
I am using random-forest for a regression problem to predict the label values of Test-Y for a given set of Test-X (new values of features). The model has been trained over a given Train-X (features) and Train-Y (labels). "randomForest" of R serves me very well in predicting the numerical values of Test-Y. But this is not all I want.
Instead of only a number, I want to use random-forest to produce a probability density function. I searched for a solution for several days and here is I found so far:
"randomForest" doesn't produce probabilities for regression, but only in classification. (via "predict" and setting type=prob).
Using "quantregForest" provides a nice way to make and visualize prediction intervals. But still not the probability density function!
Any other thought on this?
Please see the predict.all parameter of the predict.randomForest function.
library("ggplot2")
library("randomForest")
data(mpg)
rf = randomForest(cty ~ displ + cyl + trans, data = mpg)
# Predict the first car in the dataset
pred = predict(rf, newdata = mpg[1, ], predict.all = TRUE)
hist(pred$individual)
The histogram of 500 "elementary" predictions looks like this:
You can also use quantregForest with a very fine grid of quantiles, convert them into a "cumulative distribution function (cdf)" with R-function ecdf and convert this cdf into a density estimation with a kernel density estimator.