Do results of survival analysis only pertain to the observations analyzed? - r

Hey guys, so I taught myself time-to-event analysis recently and I need some help understanding it. I made some Kaplan-Meier survival curves.
Sure, the number of observations within each node is small but let's pretend that I have plenty.
K <- HF %>%
filter(serum_creatinine <= 1.8, ejection_fraction <= 25)
## Call: survfit(formula = Surv(time, DEATH_EVENT) ~ 1, data = K)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 20 36 5 0.881 0.0500 0.788 0.985
## 45 33 3 0.808 0.0612 0.696 0.937
## 60 31 3 0.734 0.0688 0.611 0.882
## 80 23 6 0.587 0.0768 0.454 0.759
## 100 17 1 0.562 0.0776 0.429 0.736
## 110 17 0 0.562 0.0776 0.429 0.736
## 120 16 1 0.529 0.0798 0.393 0.711
## 130 14 0 0.529 0.0798 0.393 0.711
## 140 14 0 0.529 0.0798 0.393 0.711
## 150 13 1 0.488 0.0834 0.349 0.682
If someone were to ask me about the third node, would the following statements be valid?:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
What about:
The survival distributions for the samples analyzed, and no other future incoming samples, are visualized above.
I want to make sure these statements are correct.
I would also like to know if logistic regression could be used to predict the binary variable DEATH_EVENT? Since the TIME variable contributes to how much weight one patient's death at 20 days has over another patient's death at 175 days, I understand that this needs to be accounted for.
If logistic regression can be used, does that imply anything over keeping/removing variable TIME?

Here are some thoughts:
Logistic regression is not appropriate in your case. As it is not the correct method for time to event analysis.
If the clinical outcome observed is “either-or,” such as if a patient suffers an MI or not, logistic regression can be used.
However, if the information on the time to MI is the observed outcome, data are analyzed using statistical methods for survival analysis.
Text from here
If you want to use a regression model in survival analysis then you should use a COX PROPORTIONAL HAZARDS MODEL. To understand the difference of a Kaplan-Meier analysis and Cox proportional hazards model you should understand both of them.
The next step would be to understand what is a univariable in contrast to a multivariable Cox proportional hazard model.
At the end you should understand all 3 methods(Kaplan-Meier, Cox univariable and Cox multivariable) then you can answer your question if this is a valid statement:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
There is nothing wrong to state the results of a subgroup of a Kaplan-Meier method. But it has a different value if the statement comes from a multivariable Cox regression analysis.

Related

Why do I keep getting the "The number of variables in newx must be 8" error when I'm trying to predict on the test set in R?

I'm trying to build a logistic regression with a dataset containing 9 variables and 3000 observations. This is what it looks like with the head() function:
Sex Length Diameter Height WholeWeight ShuckedWeight VisceraWeight ShellWeight Age
1516 3 0.655 0.510 0.215 1.7835 0.8885 0.4095 0.4195 1
529 1 0.570 0.450 0.160 0.9715 0.3965 0.2550 0.2600 1
1244 2 0.385 0.280 0.085 0.2175 0.0970 0.0380 0.0670 2
1880 2 0.545 0.430 0.140 0.6870 0.2615 0.1405 0.2500 2
1311 2 0.545 0.405 0.135 0.5945 0.2700 0.1185 0.1850 2
1759 3 0.735 0.590 0.215 1.7470 0.7275 0.4030 0.5570 1
My membership class is Age and I want to build a lasso model with it, which I have done. The problem is that the predict() function returns this error and I have no idea what to do about it: "The number of variables in newx must be 8" .
The code I have used is below:
myabalone<-read.table(file.choose(), header = T, stringsAsFactors = T)
myabalone$Sex<-as.numeric(myabalone$Sex)
myabalone$Age<-as.numeric(myabalone$Age)
set.seed(69)
mysampleabalone<-myabalone[sample(nrow(myabalone)),]
train<-mysampleabalone[1:floor(nrow(mysampleabalone)*0.7),]
test<-mysampleabalone[(floor(nrow(mysampleabalone)*0.7)+1):nrow(mysampleabalone),]
set.seed(300)
x<-model.matrix(Age~., train)[,-1]
y<-ifelse(train$Age=="1", "1", "0")
cv.lasso<-cv.glmnet(x,y, alpha=1, family="binomial")
model2<-glmnet(x,y, family="binomial", alpha=1, lambda=cv.lasso$lambda.1se)
set.seed(123)
predicted<-predict(model2, test, type = "response")
This is where the "The number of variables in newx must be 8" error occurs.
Why should there be 8 variables and not 9, if the training data I used to build the model also has 9 variables? I have seen in other posts suggested that I should try to pass the test set as.data.frame(), because there might be some issues with the column names, but I tried and nothing. Plus, when I use the head() function on it, it returns exactly the same column names as the training set used for the building the model.
Anybody has any ideas how do I fix this?

GlmmTMB model and emmeans

I am new to glmmtmb models, so i have ran into a problem.
I build a model and then based on the AICtab and DHARMa this was the best:
Insecticide_2<- glmmTMB(Insect_abundace~field_element+land_distance+sampling_time+year+treatment_day+(1|field_id),
data=Insect_002,
family= nbinom2)
After glmmTMB i ran Anova (from Car), and then emmeans, but the results of p-values in emmeans are the same (not lower.CL or upper.CL). What may be the problem? Is the model overfitted? Is the way i am doing the emmeans wrong?
Anova also showed that the land_distance, sampling_time, treatment_day were significant, year was almost significant (p= 0.07)
comp_emmeans1<-emmeans(Insect_002, pairwise ~ land_distance|year , type = "response")
> comp_emmeans1
$emmeans
Year = 2018:
land_distance response SE df lower.CL upper.CL
30m 2.46 0.492 474 1.658 3.64
50m 1.84 0.369 474 1.241 2.73
80m 1.36 0.283 474 0.906 2.05
110m 1.25 0.259 474 0.836 1.88
Year = 2019:
land_distance response SE df lower.CL upper.CL
30m 3.42 0.593 474 2.434 4.81
50m 2.56 0.461 474 1.799 3.65
80m 1.90 0.335 474 1.343 2.68
110m 1.75 0.317 474 1.222 2.49
Results are averaged over the levels of: field_element, sampling_time, treatment_day
Confidence level used: 0.95
Intervals are back-transformed from the log scale
$contrasts
year = 2018:
contrast ratio SE df null t.ratio p.value
30m / 50m 1.34 0.203 474 1 1.906 0.2268
30m / 80m 1.80 0.279 474 1 3.798 0.0009
30m / 110m 1.96 0.311 474 1 4.239 0.0002
50m / 80m 1.35 0.213 474 1 1.896 0.2311
50m / 110m 1.47 0.234 474 1 2.405 0.0776
80m / 110m 1.09 0.176 474 1 0.516 0.9552
year = 2019:
contrast ratio SE df null t.ratio p.value
30m / 50m 1.34 0.203 474 1 1.906 0.2268
30m / 80m 1.80 0.279 474 1 3.798 0.0009
30m / 110m 1.96 0.311 474 1 4.239 0.0002
50m / 80m 1.35 0.213 474 1 1.896 0.2311
50m / 110m 1.47 0.234 474 1 2.405 0.0776
80m / 110m 1.09 0.176 474 1 0.516 0.9552
Results are averaged over the levels of: field_element, sampling_time, treatment_day
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log scale
Should i use different comparison way? I saw that some use poly~, I tried that, results picture is the same. Also am I comparing the right things?
Last and also important question is how should i report the glmmTMB, Anova and emmeans results?
I don't recall seeing this question before, but it's been 8 months, and maybe I just forgot.
Anyway, I am not sure exactly what the question is, but there are three things going on that might possibly have caused some confusion:
The emmeans() call has the specification pairwise ~ land_distance|year, which causes it to compute both means and pairwise comparisons thereof. I think users are almost always better served by separating those steps, because estimating means and estimating contrasts are two different things.
The default way in which means are summarized (estimates, SEs, and confidence intervals) is different than the default for comparisons or other contrasts (estimates, SEs, t ratios, and adjusted P values). That's because, as I said before, there are two different things, and usually people want CIs for means and P values for contrasts. See below.
There is a log link in this model, and that has special properties when it comes to contrasts, because the difference on a log scale is the log of the ratio. So we display a ratio when we have type = "response". (With most other link functions, there is no way to back-transform the differences of transformed values.)
What I suggest, per (1), is to get the means (and not comparisons) first:
EMM <- emmeans(Insect_002, ~ land_distance|year , type = "response")
EMM # see the estimates
You can get pairwise comparisons next:
CON <- pairs(EMM) # or contrast(EMM, "pairwise")
CON # see the ratios as shown in the OP
confint(CON) # see confidence intervals instead of tests
confint(CON, type = "link") # See the pairwise differences on the log scale
If you actually want differences on the response scale rather than ratios, that's possible too:
pairs(regrid(EMM)) # tests
confint(pairs(regrid(EMM)) # CIs

Obtain importance of individual trees in a RandomForest

Question: Is there a way to extract the variable importance for each individual CART model from a randomForest object?
rf_mod$forest doesn't seem to have this information, and the docs don't mention it.
In R's randomForest package, the average variable importance for the entire forest of CART models is given by importance(rf_mod).
library(randomForest)
df <- mtcars
set.seed(1)
rf_mod = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 200)
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
We can also extract individual tree structure with getTree. Here's the first tree.
head(getTree(rf_mod, k = 1, labelVar = TRUE))
left daughter right daughter split var split point status prediction
1 2 3 wt 2.15 -3 18.91875
2 0 0 <NA> 0.00 -1 31.56667
3 4 5 wt 3.16 -3 17.61034
4 6 7 drat 3.66 -3 21.26667
5 8 9 carb 3.50 -3 15.96500
6 0 0 <NA> 0.00 -1 19.70000
One workaround is to grow many CARTs (i.e. - ntree = 1), get the variable importance of each tree, and average the resulting %IncMSE:
# number of trees to grow
nn <- 200
# function to run nn CART models
run_rf <- function(rand_seed){
set.seed(rand_seed)
one_tr = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 1)
return(one_tr)
}
# list to store output of each model
l <- vector("list", length = nn)
l <- lapply(1:nn, run_rf)
The extraction, averaging, and comparison step.
# extract importance of each CART model
library(dplyr); library(purrr)
map(l, importance) %>%
map(as.data.frame) %>%
map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>%
bind_rows() %>%
group_by(var) %>%
summarise(`%IncMSE` = mean(`%IncMSE`)) %>%
arrange(-`%IncMSE`)
# A tibble: 10 x 2
var `%IncMSE`
<chr> <dbl>
1 wt 8.52
2 cyl 7.75
3 disp 7.74
4 hp 5.53
5 drat 1.65
6 carb 1.52
7 vs 0.938
8 qsec 0.824
9 gear 0.495
10 am 0.355
# compare to the RF model above
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
I'd like to be able to extract the variable importance of each tree directly from a randomForest object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for mtcars. Minimal example here.
I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing.
When training a randomForest model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a randomForest object.
Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a randomForest object is self-contained, and you don't need to implement your own run_rf. Instead, you can use stats::update to re-fit the random forest model with a single tree and randomForest::grow to add additional trees one at a time:
## Starting with a random forest having a single tree,
## grow it 9 times, one tree at a time
rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1),
rep(1,9), randomForest::grow )
## Retrieve the importance scores from each random forest
imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] )
## Combine all results into a single data frame
dplyr::bind_rows( !!!imp )
# # A tibble: 10 x 10
# cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 18.8 8.63 1.05 0 1.17 0 0 0 0.194
# 2 0 10.0 46.4 0.561 0 -0.299 0 0 0.543 2.05
# 3 0 22.4 31.2 0.955 0 -0.199 0 0 0.362 5.1
# 4 1.55 24.1 23.4 0.717 0 -0.150 0 0 0.272 5.28
# 5 1.24 22.8 23.6 0.573 0 -0.178 0 0 -0.0259 4.98
# 6 1.03 26.2 22.3 0.478 1.25 0.775 0 0 -0.0216 4.1
# 7 0.887 22.5 22.5 0.406 1.79 -0.101 0 0 -0.0185 3.56
# 8 0.776 19.7 21.3 0.944 1.70 0.105 0 0.0225 -0.0162 3.11
# 9 0.690 18.4 19.1 0.839 1.51 1.24 1.01 0.02 -0.0144 2.77
# 10 0.621 18.4 21.2 0.937 1.32 1.11 0.910 0.0725 -0.114 2.49
The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by dplyr::last( rfs ).
Disclaimer: This is not really an answer, but too long to post as a comment. Will remove if deemed not appropriate.
While I (think I) understand your question, to be honest I am unsure whether your question makes sense from a statistics/ML point-of-view. The following is based on my obviously limited understanding of RF and CART. Perhaps my comment-post will lead to some insights.
Let's start with some general random forest (RF) theory on variable importance from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, p. 593 (bold-face mine):
At each split in each tree, the improvement in the split-criterion is the
importance measure attributed to the splitting variable, and is accumulated
over all the trees in the forest separately for each variable. [...]
Random forests also use the oob samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable.
So the variable importance measure in RF is defined as a measure accumulated over all trees.
In traditional single classification trees (CARTs), variable importance is characterised through the Gini index that measures node impurity (see e.g. How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R) and Carolin Strobl's PhD thesis)
More complex measures to characterise variable importance in CART-like models exist; for example in rpart:
An overall measure of variable importance is the sum of the goodness of split
measures for each split for which it was the primary variable, plus goodness * (adjusted
agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum
to 100 and the rounded values are shown, omitting any variable whose proportion is less
than 1%.
So the bottom line here is the following: At the very least it won't be easy (and in the worst case it won't make sense) to compare variable measures from single classifaction trees with variable importance measures applied to ensemble-based methods like RF.
Which leads me to ask: Why do you want to extract variable importance measures for individual trees from an RF model? Even if you came up with a method to calculate variable importances from individual trees, I believe they wouldn't be very meaningful, and they wouldn't have to "converge" to the ensemble-accumulated values.
We can simplify it by
library(tidyverse)
out <- map(seq_len(nn), ~
run_rf(.x) %>%
importance) %>%
reduce(`+`) %>%
magrittr::divide_by(nn)

Kaplan Meier survival plot

Good morning,
I am having trouble understanding some of my outputs for my Kaplan Meier analyses.
I have managed to produce the following plots and outputs using ggsurvplot and survfit.
I first made a plot of survival time of 55 nest with time and then did the same with the top predictors for nest failure, one being microtopography, as seen in this example.
Call: npsurv(formula = (S) ~ 1, data = nestdata, conf.type = "log-log")
26 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
55 45 0 13 29 2 NA
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata, conf.type = "log-log")
29 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
Microtopography=0 14 13 0 1 NA NA NA
Microtopography=1 26 21 0 7 NA 29 NA
Microtopography=2 12 8 0 5 3 2 NA
So, I have two primary questions.
1. The survival curves are for a ground nesting bird with an egg incubation time of 21-23 days. Incubation time is the number of days the hen sits of the eggs before they hatch. Knowing that, how is it possible that the median survival time in plot #1 is 29 days? It seems to fit with the literature I have read on this same species, however, I assume it has something to do with the left censoring in my models, but am honestly at a loss. If anyone has any insight or even any litterature that could help me understand this concept, I would really appreciate it.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt, the median survival times returned are NA. I understand I can chose another interval, such as .75, but in this example that still wouldnt help me because microtopography 0 never drops below .9 or so. How would one go about reporting this data. Would the work around be to choose a survival interval, using:
summary(s,times=c(7,14,21,29))
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata,
conf.type =
"log-log")
29 observations deleted due to missingness
Microtopography=0
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 3 0 0 1.000 0.0000 1.000 1.000
14 7 0 0 1.000 0.0000 1.000 1.000
21 13 0 0 1.000 0.0000 1.000 1.000
29 8 1 5 0.909 0.0867 0.508 0.987
Microtopography=1
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 9 0 0 1.000 0.0000 1.000 1.000
14 17 1 0 0.933 0.0644 0.613 0.990
21 21 3 0 0.798 0.0909 0.545 0.919
29 15 3 7 0.655 0.1060 0.409 0.819
Microtopography=2
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 1 2 0 0.333 0.272 0.00896 0.774
14 7 1 0 0.267 0.226 0.00968 0.686
21 8 1 0 0.233 0.200 0.00990 0.632
29 3 1 5 0.156 0.148 0.00636 0.504
Late to the party...
The median survival time of 29 days is the median incubation time that birds of this species are expected to be in the egg until they hatch - based on your data. Your median of 21-24 (based on ?) is probably based on many experiments/studies of eggs that have hatched, ignoring those that haven't hatched yet (those that failed?).
From your overall survival curve, it is clear that some eggs have not yet hatched, even after more than 35 days. These are taken into account when calculating the expected survival times. If you think that these eggs will fail, then omit them. Otherwise, the software cannot possibly know that they will eventually fail. But how can anyone know for sure if an egg is going to fail, even after 30 days? Is there a known maximum hatching time? The record-breaker of all hatched eggs?
There are not really R questions, so this question might be more appropriate for the statistics site. But the following might help.
how is it possible that the median survival time in plot #1 is 29 days?
The median survival is where the survival curve passes the 50% mark. Eyeballing it, 29 days looks right.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt.
Given your data, you cannot compare the median. You can compare the 75% or 90%, if you must. You can compare the point survival at, say, 30 days. You can compare the truncated average survival in the first 30 days.
In order to compare the median, you would have to make an assumption. I reasonable assumption would be an exponential decay after some tenure point that includes at least one failure.

survfit() Shade 95% confidence interval survival plot

Im not sure... this cant be that difficult i think, but i cant work it out. If you run:
library(survival)
leukemia.surv <- survfit(Surv(time, status) ~ 1, data = aml)
plot(leukemia.surv, lty = 2:3)
you see the survival curve and its 95% confidence interval. Instead of showing two lines that show the upper and lower 95% CI, id like to shade the area between the upper and lower 95% boundries.
Does this have to be done by something like polygon()? All coordinates can be found in the summary...
> summary(leukemia.surv)
Call: survfit(formula = Surv(time, status) ~ 1, data = aml)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 23 2 0.9130 0.0588 0.8049 1.000
8 21 2 0.8261 0.0790 0.6848 0.996
9 19 1 0.7826 0.0860 0.6310 0.971
12 18 1 0.7391 0.0916 0.5798 0.942
13 17 1 0.6957 0.0959 0.5309 0.912
18 14 1 0.6460 0.1011 0.4753 0.878
23 13 2 0.5466 0.1073 0.3721 0.803
27 11 1 0.4969 0.1084 0.3240 0.762
30 9 1 0.4417 0.1095 0.2717 0.718
31 8 1 0.3865 0.1089 0.2225 0.671
33 7 1 0.3313 0.1064 0.1765 0.622
34 6 1 0.2761 0.1020 0.1338 0.569
43 5 1 0.2208 0.0954 0.0947 0.515
45 4 1 0.1656 0.0860 0.0598 0.458
48 2 1 0.0828 0.0727 0.0148 0.462
Is there an existing function to shade the 95% CI area?
You can use data from the summary() to make your own plot with the confidence interval as polygon.
First, save the summary() as an object. Data for plotting are located in variables time, surv, upper and lower.
mod<-summary(leukemia.surv)
Now you can use function plot() to define the plotting region. Then with polygon() plot confidence interval. Here you have to provide x values and x values in reverse order, and for y values use lower values and revere upper values. With function lines() add survival line. By adding argument type="s" to lines() you will get line as steps.
with(mod,plot(time,surv,type="n",xlim=c(5,50),ylim=c(0,1)))
with(mod,polygon(c(time,rev(time)),c(lower,rev(upper)),
col = "grey75", border = FALSE))
with(mod,lines(time,surv,type="s"))
I've developed a function to plot shaded confidence intervals in survival curves. You can find it here: Plotting survival curves in R with ggplot2
Maybe you can find it useful.

Resources