Im not sure... this cant be that difficult i think, but i cant work it out. If you run:
library(survival)
leukemia.surv <- survfit(Surv(time, status) ~ 1, data = aml)
plot(leukemia.surv, lty = 2:3)
you see the survival curve and its 95% confidence interval. Instead of showing two lines that show the upper and lower 95% CI, id like to shade the area between the upper and lower 95% boundries.
Does this have to be done by something like polygon()? All coordinates can be found in the summary...
> summary(leukemia.surv)
Call: survfit(formula = Surv(time, status) ~ 1, data = aml)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 23 2 0.9130 0.0588 0.8049 1.000
8 21 2 0.8261 0.0790 0.6848 0.996
9 19 1 0.7826 0.0860 0.6310 0.971
12 18 1 0.7391 0.0916 0.5798 0.942
13 17 1 0.6957 0.0959 0.5309 0.912
18 14 1 0.6460 0.1011 0.4753 0.878
23 13 2 0.5466 0.1073 0.3721 0.803
27 11 1 0.4969 0.1084 0.3240 0.762
30 9 1 0.4417 0.1095 0.2717 0.718
31 8 1 0.3865 0.1089 0.2225 0.671
33 7 1 0.3313 0.1064 0.1765 0.622
34 6 1 0.2761 0.1020 0.1338 0.569
43 5 1 0.2208 0.0954 0.0947 0.515
45 4 1 0.1656 0.0860 0.0598 0.458
48 2 1 0.0828 0.0727 0.0148 0.462
Is there an existing function to shade the 95% CI area?
You can use data from the summary() to make your own plot with the confidence interval as polygon.
First, save the summary() as an object. Data for plotting are located in variables time, surv, upper and lower.
mod<-summary(leukemia.surv)
Now you can use function plot() to define the plotting region. Then with polygon() plot confidence interval. Here you have to provide x values and x values in reverse order, and for y values use lower values and revere upper values. With function lines() add survival line. By adding argument type="s" to lines() you will get line as steps.
with(mod,plot(time,surv,type="n",xlim=c(5,50),ylim=c(0,1)))
with(mod,polygon(c(time,rev(time)),c(lower,rev(upper)),
col = "grey75", border = FALSE))
with(mod,lines(time,surv,type="s"))
I've developed a function to plot shaded confidence intervals in survival curves. You can find it here: Plotting survival curves in R with ggplot2
Maybe you can find it useful.
Related
Hey guys, so I taught myself time-to-event analysis recently and I need some help understanding it. I made some Kaplan-Meier survival curves.
Sure, the number of observations within each node is small but let's pretend that I have plenty.
K <- HF %>%
filter(serum_creatinine <= 1.8, ejection_fraction <= 25)
## Call: survfit(formula = Surv(time, DEATH_EVENT) ~ 1, data = K)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 20 36 5 0.881 0.0500 0.788 0.985
## 45 33 3 0.808 0.0612 0.696 0.937
## 60 31 3 0.734 0.0688 0.611 0.882
## 80 23 6 0.587 0.0768 0.454 0.759
## 100 17 1 0.562 0.0776 0.429 0.736
## 110 17 0 0.562 0.0776 0.429 0.736
## 120 16 1 0.529 0.0798 0.393 0.711
## 130 14 0 0.529 0.0798 0.393 0.711
## 140 14 0 0.529 0.0798 0.393 0.711
## 150 13 1 0.488 0.0834 0.349 0.682
If someone were to ask me about the third node, would the following statements be valid?:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
What about:
The survival distributions for the samples analyzed, and no other future incoming samples, are visualized above.
I want to make sure these statements are correct.
I would also like to know if logistic regression could be used to predict the binary variable DEATH_EVENT? Since the TIME variable contributes to how much weight one patient's death at 20 days has over another patient's death at 175 days, I understand that this needs to be accounted for.
If logistic regression can be used, does that imply anything over keeping/removing variable TIME?
Here are some thoughts:
Logistic regression is not appropriate in your case. As it is not the correct method for time to event analysis.
If the clinical outcome observed is “either-or,” such as if a patient suffers an MI or not, logistic regression can be used.
However, if the information on the time to MI is the observed outcome, data are analyzed using statistical methods for survival analysis.
Text from here
If you want to use a regression model in survival analysis then you should use a COX PROPORTIONAL HAZARDS MODEL. To understand the difference of a Kaplan-Meier analysis and Cox proportional hazards model you should understand both of them.
The next step would be to understand what is a univariable in contrast to a multivariable Cox proportional hazard model.
At the end you should understand all 3 methods(Kaplan-Meier, Cox univariable and Cox multivariable) then you can answer your question if this is a valid statement:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
There is nothing wrong to state the results of a subgroup of a Kaplan-Meier method. But it has a different value if the statement comes from a multivariable Cox regression analysis.
I have a datset that looks like:
Treatment Surface ex.time excision antib.time antibiotic inf.time infection
1 0 15 12 0 12 0 12 0
2 0 20 9 0 9 0 9 0
3 0 15 13 0 13 0 7 1
4 0 20 11 1 29 0 29 0
5 0 70 28 1 31 0 4 1
6 0 20 11 0 11 0 8 1
he variables represented in the dataset are as follows:
Observation number
Treatment
0-routine bathing 1-Body cleansing
Surface
Percentage of total surface area burned
Exis.time
Time to excision or on study time
Excision
indicator: 1=yes 0=no
Antib.time
Time to prophylactic antibiotic treatment or on study time
antibiotic
indicator: 1=yes 0=no
inf.time Time to straphylocous aureaus infection or on study time
infection
indicator: 1=yes 0=no
I want to model the time until infection as a function of treatment, surface, time until antibiotic treatment and time until excision. According to other posts this dataset must be transformed from wide to long. However I am not sure how to do it? Then, once the data is in the right format i would use this formula:
coxph(Surv(start, stop, event) ~ m, data=times)
Until now i have run just a normal Cox's regression, but i guess this is not correct because the time dependency is not accounted for?
coxph(formula = Surv(inf.time, infection) ~ Treatment + Surface +
ex.time + antib.time, data = BurnData)
n= 154, number of events= 48
coef exp(coef) se(coef) z Pr(>|z|)
Treatment -0.453748 0.635243 0.300805 -1.508 0.131
Surface 0.006932 1.006956 0.007333 0.945 0.345
ex.time 0.013503 1.013595 0.018841 0.717 0.474
antib.time 0.009546 1.009592 0.009560 0.999 0.318
exp(coef) exp(-coef) lower .95 upper .95
Treatment 0.6352 1.5742 0.3523 1.145
Surface 1.0070 0.9931 0.9926 1.022
ex.time 1.0136 0.9866 0.9768 1.052
antib.time 1.0096 0.9905 0.9909 1.029
Concordance= 0.576 (se = 0.046 )
Rsquare= 0.041 (max possible= 0.942 )
Likelihood ratio test= 6.5 on 4 df, p=0.1648
Wald test = 6.55 on 4 df, p=0.1618
Score (logrank) test = 6.71 on 4 df, p=0.1519
Good morning,
I am having trouble understanding some of my outputs for my Kaplan Meier analyses.
I have managed to produce the following plots and outputs using ggsurvplot and survfit.
I first made a plot of survival time of 55 nest with time and then did the same with the top predictors for nest failure, one being microtopography, as seen in this example.
Call: npsurv(formula = (S) ~ 1, data = nestdata, conf.type = "log-log")
26 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
55 45 0 13 29 2 NA
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata, conf.type = "log-log")
29 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
Microtopography=0 14 13 0 1 NA NA NA
Microtopography=1 26 21 0 7 NA 29 NA
Microtopography=2 12 8 0 5 3 2 NA
So, I have two primary questions.
1. The survival curves are for a ground nesting bird with an egg incubation time of 21-23 days. Incubation time is the number of days the hen sits of the eggs before they hatch. Knowing that, how is it possible that the median survival time in plot #1 is 29 days? It seems to fit with the literature I have read on this same species, however, I assume it has something to do with the left censoring in my models, but am honestly at a loss. If anyone has any insight or even any litterature that could help me understand this concept, I would really appreciate it.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt, the median survival times returned are NA. I understand I can chose another interval, such as .75, but in this example that still wouldnt help me because microtopography 0 never drops below .9 or so. How would one go about reporting this data. Would the work around be to choose a survival interval, using:
summary(s,times=c(7,14,21,29))
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata,
conf.type =
"log-log")
29 observations deleted due to missingness
Microtopography=0
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 3 0 0 1.000 0.0000 1.000 1.000
14 7 0 0 1.000 0.0000 1.000 1.000
21 13 0 0 1.000 0.0000 1.000 1.000
29 8 1 5 0.909 0.0867 0.508 0.987
Microtopography=1
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 9 0 0 1.000 0.0000 1.000 1.000
14 17 1 0 0.933 0.0644 0.613 0.990
21 21 3 0 0.798 0.0909 0.545 0.919
29 15 3 7 0.655 0.1060 0.409 0.819
Microtopography=2
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 1 2 0 0.333 0.272 0.00896 0.774
14 7 1 0 0.267 0.226 0.00968 0.686
21 8 1 0 0.233 0.200 0.00990 0.632
29 3 1 5 0.156 0.148 0.00636 0.504
Late to the party...
The median survival time of 29 days is the median incubation time that birds of this species are expected to be in the egg until they hatch - based on your data. Your median of 21-24 (based on ?) is probably based on many experiments/studies of eggs that have hatched, ignoring those that haven't hatched yet (those that failed?).
From your overall survival curve, it is clear that some eggs have not yet hatched, even after more than 35 days. These are taken into account when calculating the expected survival times. If you think that these eggs will fail, then omit them. Otherwise, the software cannot possibly know that they will eventually fail. But how can anyone know for sure if an egg is going to fail, even after 30 days? Is there a known maximum hatching time? The record-breaker of all hatched eggs?
There are not really R questions, so this question might be more appropriate for the statistics site. But the following might help.
how is it possible that the median survival time in plot #1 is 29 days?
The median survival is where the survival curve passes the 50% mark. Eyeballing it, 29 days looks right.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt.
Given your data, you cannot compare the median. You can compare the 75% or 90%, if you must. You can compare the point survival at, say, 30 days. You can compare the truncated average survival in the first 30 days.
In order to compare the median, you would have to make an assumption. I reasonable assumption would be an exponential decay after some tenure point that includes at least one failure.
My data is as follows
# A tibble: 24 x 3
time OD600 strain
<dbl> <dbl> <chr>
1 0.0001 0.0001 M12-611020
2 1.0000 0.0880 M12-611020
3 3.0000 0.2110 M12-611020
4 4.0000 0.2780 M12-611020
5 4.5000 0.4040 M12-611020
6 5.0000 0.6060 M12-611020
7 5.5000 0.7780 M12-611020
8 6.0000 0.9020 M12-611020
9 6.5000 1.0240 M12-611020
10 8.0000 1.1000 M12-611020
11 0.0001 0.0001 M12-611025
12 1.0000 0.0770 M12-611025
13 3.0000 0.0880 M12-611025
14 4.0000 0.1250 M12-611025
15 5.0000 0.3040 M12-611025
16 5.5000 0.4210 M12-611025
17 6.0000 0.5180 M12-611025
18 6.5000 0.6160 M12-611025
19 7.0000 0.7180 M12-611025
20 7.5000 0.8520 M12-611025
21 8.0000 0.9400 M12-611025
22 8.5000 0.9500 M12-611025
23 9.0000 0.9680 M12-611025
I have 2 "strains" in the data.frame each with thier own set's of values for "time" and "OD600".
I have so far been plotting using ggplot as follows (removing asthetics for simplicity) using "loess" to fit a curve:
growth_curve_SE <- growth_curve +
stat_smooth(aes(group=strain,fill=strain, colour = strain) ,method = "loess", se = T, alpha=0.2 , span = 0.8) +
geom_point(aes(fill=factor(strain)),alpha=0.5 , size=3,shape = 21,colour = "black", stroke = 1)
What I ultimately want to achieve is fitting a 5 parameter logictical regression rather then "loess" for the method as it is a better model for the data and fits a more accurate curve.
I used the package "nplr" to fit the regression for multiple strains using a list split as per strain:
strain_list <- split(multi_strain, multi_strain$strain)
np2 <- lapply(strain_list, function(tmp) {nplr(tmp$time, tmp$OD600, useLog = F)})
Which fits the regression:
$`M12-611020`
Instance of class nplr
Call:
nplr(x = tmp$time, y = tmp$OD600, useLog = F)
weights method: residuals
5-P logistic model
Bottom asymptote: 0.03026607
Top asymptote: 1.104278
Inflexion point at (x, y): 5.297454 0.6920488
Goodness of fit: 0.9946967
Weighted Goodness of fit: 0.9998141
Standard error: 0.0308006 0.01631115
$`M12-611025`
Instance of class nplr
Call:
nplr(x = tmp$time, y = tmp$OD600, useLog = F)
weights method: residuals
5-P logistic model
Bottom asymptote: -0.0009875526
Top asymptote: 0.9902298
Inflexion point at (x, y): 6.329304 0.5919818
Goodness of fit: 0.9956551
Weighted Goodness of fit: 0.9998606
Standard error: 0.02541948 0.01577407
Any ideas how I can achieve the same in ggplot using the "stat_smooth" command to use the 5 parameter logistic regression, either with or without the "nplr" package?
For anyone interested, I found a work around for this.
The nplr packages allows you to output the curve as a series of x and y co-ordiantes as follows:
x <- getXcurve(data)
y <-getYcurve(data)
From this I used "geom_line" function using these x and y parameters and that gave me the (n)parameter logistic regression I was after inthe form of a logn. Layer one the "geom_point" as abice and you get a good looking graph
I would like a second order(? is it) regression line plotted through zero and crucially I need the equation for this relationship.
Here's my data:
ecoli_ug_ml A420 rpt
1 0 0.000 1
2 10 0.129 1
3 20 0.257 1
4 30 0.379 1
5 40 0.479 1
6 50 0.579 1
7 60 0.673 1
8 70 0.758 1
9 80 0.838 1
10 90 0.912 1
11 100 0.976 1
12 0 0.000 2
13 10 0.126 2
14 20 0.257 2
15 30 0.382 2
16 40 0.490 2
17 50 0.592 2
18 60 0.684 2
19 70 0.772 2
20 80 0.847 2
21 90 0.917 2
22 100 0.977 2
23 0 0.000 3
24 10 0.125 3
25 20 0.258 3
26 30 0.376 3
27 40 0.488 3
28 50 0.582 3
29 60 0.681 3
30 70 0.768 3
31 80 0.846 3
32 90 0.915 3
33 100 0.977 3
My plot looks like this: (sci2 is just some axis and text formatting, can include if necessary)
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x - 1,2)) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u."))) +
sci2
When I view this, the fit of this line to the points is spectacularly good.
When I check out coef, I think there is non-zero y-intercept (which is unacceptable for my purposes) but to be honest I don't really understand these lines:
coef(lm(A420 ~ poly(ecoli_ug_ml, 2, raw=TRUE), data = calib))
(Intercept) poly(ecoli_ug_ml, 2, raw = TRUE)1
-1.979021e-03 1.374789e-02
poly(ecoli_ug_ml, 2, raw = TRUE)2
-3.964258e-05
Therefore, I assume the plot is actually not quite right either.
So, what I need is to generate a regression line forced through zero and get the equation for it, and, understand what it's saying when it gives me said equation. If I could annotate the plot area with the equation directly I would be incredibly stoked.
I have spent approximately 8 hours trying to work this out now, I checked excel and got a formula in 8 seconds but I would really like to get into using R for this. Thanks!
To clarify: the primary purpose of this plot is not to demonstrate the distribution of these data but rather to provide a visual confirmation that the equation I generate from these points fits the readings well
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T),data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T), data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.979e-03 1.926e-03 -1.028 0.312
# poly(ecoli_ug_ml, 2, raw = T)1 1.375e-02 8.961e-05 153.419 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.964e-05 8.631e-07 -45.932 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004379 on 30 degrees of freedom
# Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
# F-statistic: 8.343e+04 on 2 and 30 DF, p-value: < 2.2e-16
So the intercept is not exactly 0 but it is small compared to the Std. Error. In other words, the intercept is not significantly different from 0.
You can force a fit without the intercept this way (note the -1 in the formula):
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T)-1,data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T) - 1, data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# poly(ecoli_ug_ml, 2, raw = T)1 1.367e-02 5.188e-05 263.54 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.905e-05 6.396e-07 -61.05 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004383 on 31 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 3.4e+05 on 2 and 31 DF, p-value: < 2.2e-16
Note that the coefficients do not change appreciably.
EDIT (Response to OP's comment)
The formula specified in stat_smooth(...) is just passed directly to the lm(...) function, so you can specify in stat_smooth(...) any formula that works in lm(...). The point of the results above is that, even without forcing the intercept to 0, it is extremely small (-2e-3) compared to the range in y (0-1), so plotting curves with and without will give nearly indistinguishable results. You can see this for yourself by running this code:
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x,2,raw=T),colour="red") +
stat_smooth(method="lm", formula=y~-1+poly(x,2,raw=T),colour="blue") +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))
The blue and red curves are nearly, but not quite on top of each other (you may have to open up your plot window to see it). And no, you do not have to do this "outside of ggplot."
The problem you reported relates to using the default raw=F. This causes poly(...) to use orthogonal polynomials, which by definition have constant terms. So using y~-1+poly(x,2) doesn't really make sense, whereas using y~-1+poly(x,2,raw=T) does make sense.
Finally, if all this business of using poly(...) with or without raw=T is causing confusion, you can achieve the exact same result using formula = y~ -1 + x + I(x^2). This fits a second order polynomial (a*x +b*x^2) and suppresses the constant term.
I think you are misinterpreting that Intercept and also how stat_smooth works. Polynomial fits done by statisticians typically do not use the raw=TRUE parameter. The default is FALSE and the polynomials are constructed to be orthogonal to allow proper statistical assessment of the fit improvement when looking at the standard errors. It is instructive to look at what happens if you attempt to eliminate the Intercept by using -1 or 0+ in the formula. Try with your data and code to get rid of the intercept:
....+
stat_smooth(method="lm", formula=y~0+poly(x - 1,2)) + ...
You will see the fitted line intercepting the y axis at -0.5 and change. Now look at the non-raw value of the intercept.
coef(lm(A420~poly(ecoli_ug_ml,2),data=ecoli))
(Intercept) poly(ecoli_ug_ml, 2)1 poly(ecoli_ug_ml, 2)2
0.5466667 1.7772858 -0.2011251
So the intercept is shifting the whole curve upward to let the polynomial fit have the best fitting curvature. If you want to draw a line with ggplot2 that meets some different specification you should calculate it outside of ggplot2 and then plot it without the error bands because it really won't have the proper statistical properties.
Nonetheless, this is the way to apply what in this case is a trivial amount of adjustment and I am offering it only as an illustration of how to add an externally derived set of values. I think _ad_hoc_ adjustments like this are dangerous in practice:
mod <- lm(A420~poly(ecoli_ug_ml,2), data=ecoli)
ecoli$shifted_pred <- predict(mod) - predict( mod, newdata=list(ecoli_ug_ml=0))
ggplot(ecoli, aes(ecoli_ug_ml, A420)) +
geom_point(shape=ecoli$rpt) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))+
geom_line(data=ecoli, aes(x= ecoli_ug_ml, y=shifted_pred ) )