R cdplot() - does the right axis show probability or density? - r

Reproducible data:
## NASA space shuttle o-ring failures
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1),
levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
## CD plot
cdplot(fail ~ temperature)
The documentation for cdplot says:
cdplot computes the conditional densities of x given the levels of y weighted by the marginal distribution of y. The densities are derived cumulatively over the levels of y. The conditional probabilities are not derived by discretization (as in the spinogram), but using a smoothing approach via density.The conditional density functions (cumulative over the levels of y) are returned invisibly.
So on the plot where x = 63, y = 0.4 (approximately). Is this probability, or probability density? I am confused by the documentation as to what is calculated, what is returned and what is plotted.

The plot shows the probability of an outcome for a given temperature.
What the docs are saying is that a standard density distribution is calculated for temperature measurements, and a density is worked out separately for temperature when fail is 'no'. If we divide the density of "no" temperatures by the density of all temperatures, then weight this by the proportion of 'no' temperatures, then we will get an estimate of the probability of drawing a "no" at a given temperature.
To show this is the case, let's see the cdplot:
cdplot(fail ~ temperature)
Now let's calculate the probabilities from the marginal densities manually and plot. We should get a near-identical shape to our curve
all <- density(temperature, from = min(temperature), to = max(temperature))
no <- density(temperature[fail == "no"], from = min(temperature),
to = max(temperature))
probs <- no$y/all$y * proportions(table(fail))[1]
plot(all$x, 1 - probs, type = "l", ylim = c(0, 1))

Related

How to extract values of regression curve in R?

How to extract the x- and y-values of a regression curve in R?
Generate some example data and combine in data frame:
x <- c(54, 54, 54, 54, 54, 54, 54, 72, 72, 72, 90, 90, 90, 90, 90, 90, 90, 90, 90, 72, 72, 72, 72, 72, 54, 54, 54, 54, 54, 54, 54, 72, 90, 90, 90, 90, 90, 90, 90, 90, 108, 126, 144, 144, 144, 144)
y <- c(15, 15, 15, 7.50, 7.50, 7.50, 25, 15, 15, 7.50, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 15, 0, 25, 15, 7.50, 7.50, 0, 7.50, 7.50, 7.50, 7.50, 7.50, 0, 3, 3, 3, 3, 15, 3, 15, 3, 3, 7.50)
df_xy <- as.data.frame(cbind(x, y))
df_xy
Visualize the data:
plot(df_xy[,2], df_xy[,1])
Regression analysis (code from here)
Generate training and test data:
library(tidyverse)
library(caret)
set.seed(123)
training.samples <- df_xy[,1] %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- df_xy[training.samples, ]
test.data <- df_xy[-training.samples, ]
Apply spline regression:
library(splines)
knots <- quantile(train.data$y, p = c(0.25, 0.5, 0.75))
model <- lm (x ~ bs(y, knots = knots), data = train.data)
Visualize:
ggplot(train.data, aes(y, x) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ splines::bs(x, df =3))
Is there a way to extract the x- and respective y-values of that curve (blue line)?
Furthermore, might there be a base R option to visualize the spline regression?
Addition to the question:
Which uncertainty/error bars are shown in this display? And how can this be specified in the respective code?
How can the x- and respective y-values of the uncertainty band (grey area) be extracted?
If you save the plot as gg1, then
ggplot_build(gg1)$data[[2]][,c("x","y")]
will get you the x and y coordinates of the curve.

Kaplan Meier curve after IPTW

I want to use IPTW to find the effects of a medication on cardiovascular death (1), which competes with non-cardiovascular death (2) and survival (0).
After the IPTW, I would like to do a competing risk analysis to find the effect of the medication on cardiovascular death and plot the resulting Kaplan Meier curve.
This is my start
library(tableone)
library(crr)
library(ipw)
library(sandwich)
library(survey)
treatment<-as.numeric(df$treatment==1)
#propensity score model
psmodel<-glm(treatment ~ age + sex, data=df )
ps<-predict(psmodel, type="response")
#weights
weight<-ifelse(treatment==1,1/(ps),1/(1-ps))
age<-as.numeric(df$age)
sex<-as.numeric(df$sex==1)
cov1 <-cbind(age,sex, weight, treatment)
ftime <-df$Survival
fstatus<-df$Outcome
competingrisk<-crr(ftime, fstatus, cov1, failcode=2)
summary(competingrisk)
My dataset looks like this but with 500 lines.
structure(list(Outcome = c(1, 1, 1, 2, 2, 2, 2, 0, 0, 1, 1, 0,
0, 1, 1, 0, 0, 2), Survival = c(7, 13, 14, 8, 9, 15, 14, 16,
14, 3, 7, 13, 14, 8, 9, 15, 16, 4), treatment = c(1, 0, 0, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0), age = c(59, 58, 57,
56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42),
BMI = c(25, 24, 23, 22, 21, 20, 29, 28, 27, 26, 25, 24, 25,
24, 23, 22, 21, 20), sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -18L))
A Kaplan-Meier plot is very naive to multiple covariates, and to competing risks. I highly doubt it is the best way to analyse your data in this instance, however it could be used to investigate and so here is the solution.
If you want to find the four Kaplan-Meier curves for the two events, you can use the following code:
library(survival)
#create a KM model
mod1 <- survfit(Surv(Survival,Outcome==1)~treatment,data=df)
#plot it
plot(mod1,
conf.int=TRUE, #show confidence intervals
col=c("red","green") #treatment=0 is red
)
#see ?plot.survfit for more parameters
mod2 <- survfit(Surv(Survival,Outcome==2)~treatment,data=df)
#plot it
plot(mod2,
conf.int=TRUE, #show confidence intervals
col=c("red","green") #treatment=0 is red
)
However, as stated above, KM plots are naive to competing risks (the probabilities do not take the other event into consideration and so at certain time points, you can have that the probabilities of the two events sum to more than 100%).You would likely be better plotting the Cumulative Incidence Function (or CIF), which does take into account the competing risks. The cmprsk package has the cuminc() function for this.
library(cmprsk)
cif <- cuminc(df$Survival,df$Outcome,df$treatment)
plot(cif)
# See ?plot.cuminc for more parameters
However, I find these base plots to be quite unappealing, and so would highly recommend using the survminer package which utilises ggplot2 to create better plots:
library(survminer)
ggsurvplot(mod1)
ggsurvplot(mod2)
ggcompetingrisks(cif)
If you wish to analyse the data and find the effects of a treatment, using the IPTW, you can use the crr() function as in your question and this will return the subdistribution proportional hazard regression coefficients (i.e. the results of a Fine & Gray model). These can be interpreted in much the same way that of a Cox proportional hazard analysis can be (whilst accounting for the competing risks). Bear in mind that the IPTW is not a panacea and therefore these results may still be non-causal.
Once you have the Fine & Gray model results from crr(), you can create a plot of the subdistributions across the two treatments by using the following code (assuming you have made the competingrisk object above)
cov2 <- matrix(c(0,1),ncol=1)
colnames(cov2) <- "treatment"
cr_pred <- predict(competingrisk,cov2)
plot(cr_pred)
# see ?plot.predict.crr for parameters
A few good resources are:
This blogpost by Emily Zabor
Tutorial in Biostatistics by Putter et al.

How to make predictions even with NAs using predict()?

I want to use predict() with a polr() model to predict variable z, as per the following code. This first is the df to train the model and the subsequent test data.
df <- data.frame(x=c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2),
y=c(32, 67, 12, 89, 45, 78, 43, 47, 14, 67, 16, 36, 25, 23, 56, 26, 35, 79, 13, 44),
z=as.factor(c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 1, 2, 1, 2, 1, 2)))
test <- data.frame(x=c(1, 2, 1, 1, 2, 1, 2, 2, 1, 1),
y=c(34, NA, 78, NA, 89, 17, 27, 83, 23, 48),
z=c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1))
This is the polr() model:
mod <- polr(z ~ x + y, data = df, Hess = TRUE)
And this is the predict() function with its outcome:
predict(mod, newdata = test)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
My problem is that I want the model to make predictions even when there are NAs, as in the 2nd and 4th cases. I have tried the following, with the same result:
predict(mod, newdata = test, na.action = "na.exclude")
predict(mod, newdata = test, na.action = "na.pass")
predict(mod, newdata = test, na.action = "na.omit")
predict(mod, newdata = test, na.rm=T)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
How can I get the model to make predictions even when there's some missing data?
This is more of a statistical or mathematical problem than a programming problem. To simplify things a little bit (and show that it's general, I'll illustrate with a linear regression, but the concept extends to ordinal regression as well.
Suppose I've estimated a linear relationship, say z = 1 + 2*x + 3*y, and I want to predict a response when the predictors are {x=3, y=NA}. I get 1 + 2*3 + 3*NA, which is clearly NA.
If you want predictions when some of the predictor variables are unknown, you have to make some kind of assumption/decision about what to do — this is a question of interpretation, not mathematics. For example, you could set unknown values of y to the mean of the original data set, or the mean of the new data set, or some sensible reference value, or you could do multiple imputation — i.e., making several predictions based on several different draws from a reasonable distribution, then averaging the results. (For a linear regression model this will give you the same answer (point estimate) as using the mean of the distribution, but (1) the results will differ if you have an effectively nonlinear model like an ordinal or generalized linear regression; (2) multiple imputation will allow you to get sensible standard errors on the prediction.)

In R, how do I find the x value that will equate to a certain y value on a cdplot? And vice-versa?

I have a cdplot where I'm trying to find my x value where the distribution (or the y value) = .5 and couldn't find a method to do it that works. Additionally I want to find the y value when my x value is 0 and would like help finding that equation to if it's different.
I cant really provide my code as it relies on a saved workspace with a large dataframe. I'll give this as an example:
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,1, 2, 1, 1, 1, 1, 1),levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
cdplot(fail ~ temperature)
So I don't need a quick and dirty way to solve this specific example, I need a code I can apply to my own workspace.
If you capture the return of cdplot, you get a function that you can use to find these values.
CDP = cdplot(fail ~ temperature
uniroot(function(x) { CDP$no(x) - 0.5}, c(55,80))
> uniroot(function(x) { CDP$no(x) - 0.5}, c(55,80))
$root
[1] 62.34963
$f.root
[1] 3.330669e-16

Extract x and y values from cdplot()

How can I extract the exact probabilities for each factor y at any value of x with cdplot(y~x)
Thanks
Following the example from the help file of ?cdplot you can do...
## NASA space shuttle o-ring failures
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1),
levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
## CD plot
result <- cdplot(fail ~ temperature)
And this is a simple way to obtain the probabilities from the cdplot output.
# Getting the probabilities for each group.
lapply(split(temperature, fail), result[[1]])
$no
[1] 0.8166854 0.8209055 0.8209055 0.8209055 0.8090438 0.7901473 0.7718317 0.7718317 0.7579343
[10] 0.7664731 0.8062898 0.8326761 0.8326761 0.8905854 0.9185472 0.9626185
$yes
[1] 3.656304e-05 6.273653e-03 1.910046e-02 6.007471e-01 7.718317e-01 7.718317e-01 8.062898e-01
Note that result is a conditional density function (cumulative over the levels of fail) returned invisibly by cdplot, therefore we can split temperature by fail and apply the returned function over those values using lapply.
Here a simple version of getS3method('cdplot','default') :
get.props <- function(x,y,n){
ny <- nlevels(y)
yprop <- cumsum(prop.table(table(y)))
dx <- density(x, n )
y1 <- matrix(rep(0, n * (ny - 1L)), nrow = (ny - 1L))
rval <- list()
for (i in seq_len(ny - 1L)) {
dxi <- density(x[y %in% levels(y)[seq_len(i)]],
bw = dx$bw, n = n, from = min(dx$x), to = max(dx$x))
y1[i, ] <- dxi$y/dx$y * yprop[i]
}
}

Resources