Plot Kaplan-Meier for Cox regression - r

I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.

Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.

The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))

Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf

Related

How to calculate Odds ratio and 95% confidence interval for decile

I have done logistic regression, a part of result is like below.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9056 0.4967 -3.837 0.000125 ***
GWAS$value 0.4474 0.1157 3.868 0.000110 ***
This is the data which I used to do logistic regression.
ID Phenotype value
1 128 0 1.510320
2 193 1 1.956477
3 067 0 2.038308
4 034 1 2.058739
5 159 0 2.066371
6 013 0 2.095866
I would like to know how to calculate Odds Ratio and 95% Confidence interval for the decile of the value? My purpose is out put a plot, the y axis is OR(95%CI) and the x axis is the decile of the value in my data Can anyone please tell me how can I calculate this in R?
This is the example of the figure.
enter image description here
I don't have your data, so I cannot obtain the right model for you. The trick is to make the predictor ordinal, and use that to regress your response variable. After that you just plot the CI of each group, and join the lines if need be. Below I used an example dataset, and if you use the same steps, you should get the plot below:
library(tidyverse)
ldata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# we break gre column into quintiles
ldata <- ldata %>% mutate(GRE = cut_number(gre,5))
#regression like you did, calculate lor for all quintiles
fit <- glm(admit ~ 0+GRE,data=ldata,family="binomial")
# results like you have
results = coefficients(summary(fit))
# rename second column, SE for plotting
colnames(results)[2] = "SE"
#use ggplot
data.frame(results) %>%
mutate(X=1:n()) %>%
ggplot(aes(x=X,y=Estimate)) + geom_point()+
geom_line() +
# 95% interval is 1.96*SE
geom_errorbar(aes(ymin=-1.96*SE+Estimate,ymax=1.96*SE+Estimate),width=0.2)+
scale_x_continuous(label=rownames(results))+
xlab("GRE quintiles") +ylab("Log Odds Ratio")

R quantreg model does not reproduce quantiles: Why?

I am using the quantreg package to predict quantiles and their confidence intervals. I can't understand why the predicted quantiles are different from the quantiles calculated directly from the data using quantile().
library(tidyverse)
library(quantreg)
data <- tibble(data=runif(10)*10)
qr1 <- rq(formula=data ~ 1, tau=0.9, data=data) # quantile regression
yqr1<- predict(qr1, newdata=tibble(data=c(1)), interval='confidence', level=0.95, se='boot') # predict quantile
q90 <- quantile(data$data, 0.9) # quantile of sample
> yqr1
fit lower higher
1 6.999223 3.815588 10.18286
> q90
90%
7.270891
You should realize the predicting the 90th percentile for a dataset with only 10 items is really based solely on the two highest values. You should review the help page for quantile where you will find multiple definitions of the term.
When I run this, I see:
yqr1<- predict(qr1, newdata=tibble(data=c(1)) )
yqr1
1
8.525812
And when I look at the data I see:
data
# A tibble: 10 x 1
data
<dbl>
1 8.52581158
2 7.73959380
3 4.53000680
4 0.03431813
5 2.13842058
6 5.60713159
7 6.17525537
8 8.76262959
9 5.30750304
10 4.61817190
So the rq function is estimating the second highest value as the 90th percentile, which seems perfectly reasonable. The quantile result is not actually estimated that way:
quantile(data$data, .9)
# 90%
#8.549493
?quantile

Comparing survival at specific time points

I have a following survival data
library(survival)
data(pbc)
#model to be plotted and analyzed, convert time to years
fit <- survfit(Surv(time/365.25, status) ~ edema, data = pbc)
#visualize overall survival Kaplan-Meier curve
plot(fit)
Here is how the resulting Kaplan-Meier plot looks like
I am further calculating survival at 1, 2, 3 years in this manner:
> summary(fit,times=c(1,2,3))
Call: survfit(formula = Surv(time/365.25, status) ~ edema, data = pbc)
232 observations deleted due to missingness
edema=0
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 126 12 0.913 0.0240 0.867 0.961
2 112 12 0.825 0.0325 0.764 0.891
3 80 26 0.627 0.0420 0.550 0.714
edema=0.5
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 22 7 0.759 0.0795 0.618 0.932
2 17 5 0.586 0.0915 0.432 0.796
3 11 4 0.448 0.0923 0.299 0.671
edema=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 8 11 0.421 0.1133 0.2485 0.713
2 5 3 0.263 0.1010 0.1240 0.558
3 3 2 0.158 0.0837 0.0559 0.446
As you can see, the resulting output shows me 95% confidence intervals between different levels of edema but no actual p values. Whether confidence intervals overlap or not, I still get a pretty good idea whether survival at these time points are signifiantly different or not, but I would like to have exact p values. How can I do that?
I think the following code does what you are looking for:
library(survival)
data(pbc)
#model to be plotted and analyzed, convert time to years
fit <- survfit(Surv(time/365.25, status) ~ edema, data = pbc)
#visualize overall survival Kaplan-Meier curve
plot(fit)
threeYr <- summary(fit,times=3)
#difference in survival at 3 years between edema=0 and edemo=1 (for example) is
threeYr$surv[1] - threeYr$surv[3]
#the standard error of this is
diffSE <- sqrt(threeYr$std.err[3]^2 + threeYr$std.err[1]^2)
#a 95% CI for the diff is
threeYr$surv[1] - threeYr$surv[3] - 1.96 *diffSE
threeYr$surv[1] - threeYr$surv[3] + 1.96 *diffSE
#a z-test test statistic is
zStat <- (threeYr$surv[1] - threeYr$surv[3])/diffSE
#and a two-sided p-value testing that the diff. is 0 is
2*pnorm(abs(zStat), lower.tail=FALSE)
Alternatively one could make the comparison by estimating the risk ratio or odds ratio based on the estimated probabilities, and perform the inference/test on the log risk ratio or log odds ratio scale. In general I would expect this to perform better (in terms of test size and confidence interval coverage) since the normal approximation will be better on these scales than on the risk difference scale.
Your question is 'are x-year survival rates different for the different categories of edema'.
For example, if you're interested in 3-year survival rates; you only need to focus on that portion of the curve (first 3 years of follow-up), as shown in the figure. The follow-up time for patients that are still alive after 3 years is set to 3 years (i.e., maximum follow-up time in this analysis):pbc$time[pbc$time > 3*365.25] <- 3*365.25.
Calculating a log-rank test using coxph in the package 'survival' (same package you are already using in your analysis) for this data set will provide you the P-value that says whether survival at three years is different between the three groups (highly significant in this example). You can also use the same model to generate P-values and hazard ratios for the association of edema with cause-specific survival.

Dummy variables for Logistic regression in R

I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.

Odds ratio and confidence intervals from glmer output

I have made a model that looks at a number of variables and the effect that has on pregnancy outcome. The outcome is a grouped binary. A mob of animals will have 34 pregnant and 3 empty, the next will have 20 pregnant and 4 empty and so on.
I have modelled this data using the glmer function where y is the pregnancy outcome (pregnant or empty).
mclus5 <- glmer(y~adg + breed + bw_start + year + (1|farm),
data=dat, family=binomial)
I get all the usual output with coefficients etc. but for interpretation I would like to transform this into odds ratios and confidence intervals for each of the coefficients.
In past logistic regression models I have used the following code
round(exp(cbind(OR=coef(mclus5),confint(mclus5))),3)
This would very nicely provide what I want, but it does not seem to work with the model I have run.
Does anyone know a way that I can get this output for my model through R?
The only real difference is that you have to use fixef() rather than coef() to extract the fixed-effect coefficients (coef() gives you the estimated coefficients for each group).
I'll illustrate with a built-in example from the lme4 package.
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
Fixed-effect coefficients and confidence intervals, log-odds scale:
cc <- confint(gm1,parm="beta_") ## slow (~ 11 seconds)
ctab <- cbind(est=fixef(gm1),cc)
(If you want faster-but-less-accurate Wald confidence intervals you can use confint(gm1,parm="beta_",method="Wald") instead; this will be equivalent to #Gorka's answer but marginally more convenient.)
Exponentiate to get odds ratios:
rtab <- exp(ctab)
print(rtab,digits=3)
## est 2.5 % 97.5 %
## (Intercept) 0.247 0.149 0.388
## period2 0.371 0.199 0.665
## period3 0.324 0.165 0.600
## period4 0.206 0.082 0.449
A marginally simpler/more general solution:
library(broom.mixed)
tidy(gm1,conf.int=TRUE,exponentiate=TRUE,effects="fixed")
for Wald intervals, or add conf.method="profile" for profile confidence intervals.
I believe there is another, much faster way (if you are OK with a less accurate result).
From: http://www.ats.ucla.edu/stat/r/dae/melogit.htm
First we get the confidence intervals for the Estimates
se <- sqrt(diag(vcov(mclus5)))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(mclus5), LL = fixef(mclus5) - 1.96 * se, UL = fixef(mclus5) + 1.96 * se)
Then the odds ratios with 95% CI
print(exp(tab), digits=3)
Other option I believe is to just use package emmeans :
library(emmeans)
data.frame(confint(pairs(emmeans(fit, ~ factor_name,type="response"))))

Resources