Kaplan Meier survival plot - r

Good morning,
I am having trouble understanding some of my outputs for my Kaplan Meier analyses.
I have managed to produce the following plots and outputs using ggsurvplot and survfit.
I first made a plot of survival time of 55 nest with time and then did the same with the top predictors for nest failure, one being microtopography, as seen in this example.
Call: npsurv(formula = (S) ~ 1, data = nestdata, conf.type = "log-log")
26 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
55 45 0 13 29 2 NA
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata, conf.type = "log-log")
29 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
Microtopography=0 14 13 0 1 NA NA NA
Microtopography=1 26 21 0 7 NA 29 NA
Microtopography=2 12 8 0 5 3 2 NA
So, I have two primary questions.
1. The survival curves are for a ground nesting bird with an egg incubation time of 21-23 days. Incubation time is the number of days the hen sits of the eggs before they hatch. Knowing that, how is it possible that the median survival time in plot #1 is 29 days? It seems to fit with the literature I have read on this same species, however, I assume it has something to do with the left censoring in my models, but am honestly at a loss. If anyone has any insight or even any litterature that could help me understand this concept, I would really appreciate it.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt, the median survival times returned are NA. I understand I can chose another interval, such as .75, but in this example that still wouldnt help me because microtopography 0 never drops below .9 or so. How would one go about reporting this data. Would the work around be to choose a survival interval, using:
summary(s,times=c(7,14,21,29))
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata,
conf.type =
"log-log")
29 observations deleted due to missingness
Microtopography=0
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 3 0 0 1.000 0.0000 1.000 1.000
14 7 0 0 1.000 0.0000 1.000 1.000
21 13 0 0 1.000 0.0000 1.000 1.000
29 8 1 5 0.909 0.0867 0.508 0.987
Microtopography=1
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 9 0 0 1.000 0.0000 1.000 1.000
14 17 1 0 0.933 0.0644 0.613 0.990
21 21 3 0 0.798 0.0909 0.545 0.919
29 15 3 7 0.655 0.1060 0.409 0.819
Microtopography=2
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 1 2 0 0.333 0.272 0.00896 0.774
14 7 1 0 0.267 0.226 0.00968 0.686
21 8 1 0 0.233 0.200 0.00990 0.632
29 3 1 5 0.156 0.148 0.00636 0.504

Late to the party...
The median survival time of 29 days is the median incubation time that birds of this species are expected to be in the egg until they hatch - based on your data. Your median of 21-24 (based on ?) is probably based on many experiments/studies of eggs that have hatched, ignoring those that haven't hatched yet (those that failed?).
From your overall survival curve, it is clear that some eggs have not yet hatched, even after more than 35 days. These are taken into account when calculating the expected survival times. If you think that these eggs will fail, then omit them. Otherwise, the software cannot possibly know that they will eventually fail. But how can anyone know for sure if an egg is going to fail, even after 30 days? Is there a known maximum hatching time? The record-breaker of all hatched eggs?

There are not really R questions, so this question might be more appropriate for the statistics site. But the following might help.
how is it possible that the median survival time in plot #1 is 29 days?
The median survival is where the survival curve passes the 50% mark. Eyeballing it, 29 days looks right.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt.
Given your data, you cannot compare the median. You can compare the 75% or 90%, if you must. You can compare the point survival at, say, 30 days. You can compare the truncated average survival in the first 30 days.
In order to compare the median, you would have to make an assumption. I reasonable assumption would be an exponential decay after some tenure point that includes at least one failure.

Related

Do results of survival analysis only pertain to the observations analyzed?

Hey guys, so I taught myself time-to-event analysis recently and I need some help understanding it. I made some Kaplan-Meier survival curves.
Sure, the number of observations within each node is small but let's pretend that I have plenty.
K <- HF %>%
filter(serum_creatinine <= 1.8, ejection_fraction <= 25)
## Call: survfit(formula = Surv(time, DEATH_EVENT) ~ 1, data = K)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 20 36 5 0.881 0.0500 0.788 0.985
## 45 33 3 0.808 0.0612 0.696 0.937
## 60 31 3 0.734 0.0688 0.611 0.882
## 80 23 6 0.587 0.0768 0.454 0.759
## 100 17 1 0.562 0.0776 0.429 0.736
## 110 17 0 0.562 0.0776 0.429 0.736
## 120 16 1 0.529 0.0798 0.393 0.711
## 130 14 0 0.529 0.0798 0.393 0.711
## 140 14 0 0.529 0.0798 0.393 0.711
## 150 13 1 0.488 0.0834 0.349 0.682
If someone were to ask me about the third node, would the following statements be valid?:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
What about:
The survival distributions for the samples analyzed, and no other future incoming samples, are visualized above.
I want to make sure these statements are correct.
I would also like to know if logistic regression could be used to predict the binary variable DEATH_EVENT? Since the TIME variable contributes to how much weight one patient's death at 20 days has over another patient's death at 175 days, I understand that this needs to be accounted for.
If logistic regression can be used, does that imply anything over keeping/removing variable TIME?
Here are some thoughts:
Logistic regression is not appropriate in your case. As it is not the correct method for time to event analysis.
If the clinical outcome observed is “either-or,” such as if a patient suffers an MI or not, logistic regression can be used.
However, if the information on the time to MI is the observed outcome, data are analyzed using statistical methods for survival analysis.
Text from here
If you want to use a regression model in survival analysis then you should use a COX PROPORTIONAL HAZARDS MODEL. To understand the difference of a Kaplan-Meier analysis and Cox proportional hazards model you should understand both of them.
The next step would be to understand what is a univariable in contrast to a multivariable Cox proportional hazard model.
At the end you should understand all 3 methods(Kaplan-Meier, Cox univariable and Cox multivariable) then you can answer your question if this is a valid statement:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
There is nothing wrong to state the results of a subgroup of a Kaplan-Meier method. But it has a different value if the statement comes from a multivariable Cox regression analysis.

Fitting logistic curve to data with SSlogis: error NA/NaN/Inf

I am trying to fit a logistic curve to my data for two separate years. The data are very similar for each year, with exactly the same format and I removed all observations with NA's or NaN's and made sure they are numeric. However for some reason the script works perfectly for the first year, and does not work for the second year.
Here is a bit of what my 2013 data looks like (101 observations total):
x y
1 0.070 95.392
2 0.079 100.000
3 0.109 100.000
4 0.072 100.000
5 -0.005 100.000
6 0.014 100.000
7 -0.008 100.000
8 0.307 52.523
9 0.696 0.000
10 -0.045 100.000
And the 2018 data(116 observations total):
x y
1 0.133 100.000
2 0.139 100.000
3 0.152 100.000
4 0.124 100.000
5 0.051 100.000
6 0.062 100.000
7 -0.050 100.000
8 0.356 80.282
9 0.545 0.000
10 -0.029 62.857
Here is my script:
##2013 data
x <- veg13$`Elevation`
y <- veg13$`pclowspecies`
##predict parameters for the logistic curve
fit <- nls(y ~ SSlogis(x, Asym, xmid, scal), data=data.frame(x, y))
summary(fit)
It works fine for 2013, but when I repeat with 2018 data I get the following error:
Error in qr.default(.swts * gr) :
NA/NaN/Inf in foreign function call (arg 1)
I have read some other people who had the same issue but their solutions do not work for me, since I don't have any NA's or NaN's in my data.
Thank you for your help!

Logistic regression detection probability

I'm attempting to access the key covariates in detection probability.
I'm currently using this code
model1 <- glm(P ~ Width +
MBL +
DFT +
SGP +
SGC +
Depth,
family = binomial("logit"),
data = dframe2, na.action = na.exclude)
summary.lm(model1)
my data is structured like this-
Site Transect Q ID P Width DFT Depth Substrate SGP SGC MBL
1 Vr1 Q1 1 0 NA NA 0.5 Sand 0 0 0.00000
2 Vr1 Q2 2 0 NA NA 1.4 Sand&Searass 1 30 19.14286
3 Vr1 Q3 3 0 NA NA 1.7 Sand&Searass 1 15 16.00000
4 Vr1 Q4 4 1 17 0 2.0 Sand&Searass 1 95 35.00000
5 Vr1 Q5 5 0 NA NA 2.4 Sand 0 0 0.00000
6 Vr1 Q6 6 0 NA NA 2.9 Sand&Searass 1 50 24.85714
My sample size is really small (n=12) and I only have ~70 rows of data.
when I run the code it returns
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.457e+01 4.519e+00 5.437 0.00555 **
Width 1.810e-08 1.641e-01 0.000 1.00000
MBL -2.827e-08 9.906e-02 0.000 1.00000
DFT 2.905e-07 1.268e+00 0.000 1.00000
SGP 1.064e-06 2.691e+00 0.000 1.00000
SGC -2.703e-09 3.289e-02 0.000 1.00000
Depth 1.480e-07 9.619e-01 0.000 1.00000
SubstrateSand&Searass -8.516e-08 1.626e+00 0.000 1.00000
Does this mean my data set is just to small to asses detection probability or am I doing something wrong?
According to Hair (author of book Multivariate Data Analysis), you need at least 15 examples for each feature (column) of your data. If you have 12, you could only select one feature.
So, run a t-test comparing means of features related the each one of the two classes (0 and 1 at target - dependent variable) and choose the feature (independent variable) whose mean difference between classes is the biggest. This means that variable can properly create a boundary to split these two classes.

nMDS with seemingly poor Shepard's Plot but good ANOSIM/ADONIS

Having produced a Bray-Curtis dissimilarity with my Hellinger-transformed data (26 samples, 3000+ species/OTUs), I went on to build a MDS plot.
I got the following metrics:
Dimensions: 2
Stress: 0.111155
Stress type 1, weak ties
Two convergent solutions found after 2 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘ALG_Hellinger’
However, the corresponding Shepard's plot looked as follows:
Which, although achieving good fits seems as if the BC dissimilarity has not enough resolution to differentiate across samples. Is this right?
Testing it through ANOSIM, I got the following,
ANOSIM statistic R: 1
Significance: 0.001
Permutation: free
Number of permutations: 999
Upper quantiles of permutations (null model):
90% 95% 97.5% 99%
0.123 0.166 0.203 0.249
Dissimilarity ranks between and within classes:
0% 25% 50% 75% 100% N
Between 97 154.0 212.0 266.50 325 229
Cliona celata complex 19 32.0 46.0 59.00 66 21
Cliona viridis 3 26.5 37.5 48.50 60 6
Dysidea fragilis 56 56.5 57.0 59.50 62 3
Phorbas fictitius 1 18.5 48.5 79.75 96 66
And ADONIS told me the same:
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
SCIE_NAME 3 7.8738 2.62461 43.049 0.85445 0.001 ***
Residuals 22 1.3413 0.06097 0.14555
Total 25 9.2151 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This is, the differences among the samples are significant, but the MDS ordination seems somewhat misleading.
How can I test another aspect of the MDS or change anything about this analysis, if even needed?
Thank you in advance!
André
I don't think that the Shepard plot is poor. Rather, it shows that your data are strongly clustered. This is consistent with adonis which says that most (85%) of variation is between clusters. It is also consistent with anosim which shows that within-cluster distances are much shorter than between-cluster distances.

survfit() Shade 95% confidence interval survival plot

Im not sure... this cant be that difficult i think, but i cant work it out. If you run:
library(survival)
leukemia.surv <- survfit(Surv(time, status) ~ 1, data = aml)
plot(leukemia.surv, lty = 2:3)
you see the survival curve and its 95% confidence interval. Instead of showing two lines that show the upper and lower 95% CI, id like to shade the area between the upper and lower 95% boundries.
Does this have to be done by something like polygon()? All coordinates can be found in the summary...
> summary(leukemia.surv)
Call: survfit(formula = Surv(time, status) ~ 1, data = aml)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 23 2 0.9130 0.0588 0.8049 1.000
8 21 2 0.8261 0.0790 0.6848 0.996
9 19 1 0.7826 0.0860 0.6310 0.971
12 18 1 0.7391 0.0916 0.5798 0.942
13 17 1 0.6957 0.0959 0.5309 0.912
18 14 1 0.6460 0.1011 0.4753 0.878
23 13 2 0.5466 0.1073 0.3721 0.803
27 11 1 0.4969 0.1084 0.3240 0.762
30 9 1 0.4417 0.1095 0.2717 0.718
31 8 1 0.3865 0.1089 0.2225 0.671
33 7 1 0.3313 0.1064 0.1765 0.622
34 6 1 0.2761 0.1020 0.1338 0.569
43 5 1 0.2208 0.0954 0.0947 0.515
45 4 1 0.1656 0.0860 0.0598 0.458
48 2 1 0.0828 0.0727 0.0148 0.462
Is there an existing function to shade the 95% CI area?
You can use data from the summary() to make your own plot with the confidence interval as polygon.
First, save the summary() as an object. Data for plotting are located in variables time, surv, upper and lower.
mod<-summary(leukemia.surv)
Now you can use function plot() to define the plotting region. Then with polygon() plot confidence interval. Here you have to provide x values and x values in reverse order, and for y values use lower values and revere upper values. With function lines() add survival line. By adding argument type="s" to lines() you will get line as steps.
with(mod,plot(time,surv,type="n",xlim=c(5,50),ylim=c(0,1)))
with(mod,polygon(c(time,rev(time)),c(lower,rev(upper)),
col = "grey75", border = FALSE))
with(mod,lines(time,surv,type="s"))
I've developed a function to plot shaded confidence intervals in survival curves. You can find it here: Plotting survival curves in R with ggplot2
Maybe you can find it useful.

Resources