I'm new to R and survival analysis, and I am interested to export into a dataframe the results from survfit where there is strata.
This site has provided an excellent solution but not to one with strata (https://stat.ethz.ch/pipermail/r-help/2014-October/422348.html). How can i append (or stack) each strata with an extra column which contains the strata type.
solution in the link offered is not applicable to strata groupings
library(survival)
data(lung)
mod <- with(lung, survfit(Surv(time, status)~ 1))
res <- summary(mod)
str(res)
# Extract the columns you want
cols <- lapply(c(2:6, 8:10) , function(x) res[x])
# Combine the columns into a data frame
tbl <- do.call(data.frame, cols)
str(tbl)
Thank you in advanced,
R newbie
It is basically the same as you have there, just an extra column
res <- summary( survfit( Surv(futime, fustat)~rx, data=ovarian))
cols <- lapply(c(2:6, 8:11) , function(x) res[x])
tbl <- do.call(data.frame, cols)
head(tbl)
# time n.risk n.event n.censor surv strata std.err upper lower
# 1 59 13 1 0 0.9230769 rx=1 0.0739053 1.0000000 0.7890186
# 2 115 12 1 0 0.8461538 rx=1 0.1000683 1.0000000 0.6710952
# 3 156 11 1 0 0.7692308 rx=1 0.1168545 1.0000000 0.5711496
# 4 268 10 1 0 0.6923077 rx=1 0.1280077 0.9946869 0.4818501
# 5 329 9 1 0 0.6153846 rx=1 0.1349320 0.9457687 0.4004132
# 6 431 8 1 0 0.5384615 rx=1 0.1382642 0.8906828 0.3255265
Another option is to use the ggfortify library.
library(survival)
library(ggfortify)
# fit a survival model
mod <- survfit(Surv(futime, fustat) ~ rx, data = ovarian)
# extract results to a data.frame
res <- fortify(mod)
str(res)
'data.frame': 26 obs. of 9 variables:
$ time : int 59 115 156 268 329 431 448 477 638 803 ...
$ n.risk : num 13 12 11 10 9 8 7 6 5 4 ...
$ n.event : num 1 1 1 1 1 1 0 0 1 0 ...
$ n.censor: num 0 0 0 0 0 0 1 1 0 1 ...
$ surv : num 0.923 0.846 0.769 0.692 0.615 ...
$ std.err : num 0.0801 0.1183 0.1519 0.1849 0.2193 ...
$ upper : num 1 1 1 0.995 0.946 ...
$ lower : num 0.789 0.671 0.571 0.482 0.4 ...
$ strata : Factor w/ 2 levels "rx=1","rx=2": 1 1 1 1 1 1 1 1 1 1 ...
The advantage of this method is that you get the complete data (i.e 26 observations instead of 12) and you can plot your survival curves with ggplot.
library(ggplot2)
ggplot(data = res, aes(x = time, y = surv, color = strata)) +
geom_line() +
# plot censor marks
geom_point(aes(shape = factor(ifelse(n.censor >= 1, 1, NA)))) +
# format censor shape as "+"
scale_shape_manual(values = 3) +
# hide censor legend
guides(shape = "none")
The easiest way I see is to use the tidy() function from the broom package.
library(survival)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(broom)
survfit( Surv(futime, fustat)~rx, data=ovarian) %>%
tidy()
#> # A tibble: 26 x 9
#> time n.risk n.event n.censor estimate std.error conf.high conf.low strata
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 59 13 1 0 0.923 0.0801 1 0.789 rx=1
#> 2 115 12 1 0 0.846 0.118 1 0.671 rx=1
#> 3 156 11 1 0 0.769 0.152 1 0.571 rx=1
#> 4 268 10 1 0 0.692 0.185 0.995 0.482 rx=1
#> 5 329 9 1 0 0.615 0.219 0.946 0.400 rx=1
#> 6 431 8 1 0 0.538 0.257 0.891 0.326 rx=1
#> 7 448 7 0 1 0.538 0.257 0.891 0.326 rx=1
#> 8 477 6 0 1 0.538 0.257 0.891 0.326 rx=1
#> 9 638 5 1 0 0.431 0.340 0.840 0.221 rx=1
#> 10 803 4 0 1 0.431 0.340 0.840 0.221 rx=1
#> # … with 16 more rows
Created on 2021-08-04 by the reprex package (v0.3.0)
Related
I have a dataset let say it df which has 100 columns of different type of column vectors (class), i.e character, integer,double etc.
How I can use dplyr or any other tidyverse like functions in order to change the class of columns? I want to mutate_if across all columns that are integer to double.
And then keep only the as.double columns and drop the character columns?
Is there a way in R ?
Any help ?
In both cases you can use tidy select calls to choose the correct columns:
library(dplyr)
df <- tibble(
a = runif(100),
b = sample(1:1000, 100),
c = sample(letters, 100, replace = TRUE),
d = rnorm(100),
e = 101:200
)
df |>
mutate(across(where(is.integer), as.double)) |>
select(where(is.double))
#> # A tibble: 100 × 4
#> a b d e
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.196 468 -1.35 101
#> 2 0.373 865 0.123 102
#> 3 0.0250 534 0.131 103
#> 4 0.622 388 0.426 104
#> 5 0.354 670 0.625 105
#> 6 0.806 474 -1.15 106
#> 7 0.282 318 -1.27 107
#> 8 0.813 331 1.05 108
#> 9 0.360 165 -0.765 109
#> 10 0.0929 645 -0.0232 110
#> # … with 90 more rows
I'm doing some analysis with the BaseballR package and want to be able to combine dataframes by using a loop.
For example, the following code using the standings_on_date_bref function gives me a table of division standings for the specified day (plus manually adding a column for the date of those standings):
library("baseballr")
library("dplyr")
standings_on_date_bref(date = "04-28-2021", division = "NL West") %>%
mutate(date = "04-28-2021")
Tm
W-L%
date
SFG
0.640
04-28-2021
LAD
0.640
04-28-2021
SDP
0.538
04-28-2021
ARI
0.500
04-28-2021
COL
0.375
04-28-2021
However, I'm interested in getting the standings for a whole range of days (which would end up being a dataframe with rows = 5 teams * x number of days) for example for 04-28-2021 to 04-29-2021, I'm hoping it would look something like this:
Tm
W-L%
date
SFG
0.640
04-28-2021
LAD
0.640
04-28-2021
SDP
0.538
04-28-2021
ARI
0.500
04-28-2021
COL
0.375
04-28-2021
SFG
0.640
04-29-2021
LAD
0.615
04-29-2021
SDP
0.538
04-29-2021
ARI
0.520
04-29-2021
COL
0.360
04-29-2021
I have tried to do so by implementing some sort of loop. This is what I've come up with so far, but in the end it just gives me the standings for the end date.
start <- as.Date("04-01-21",format="%m-%d-%y")
end <- as.Date("04-03-21",format="%m-%d-%y")
theDate <- start
while (theDate <= end)
{
all_standings <- standings_on_date_bref(date = theDate, division = "NL West") %>%
mutate(date = theDate)
theDate <- theDate + 1
}
You can try purrr which would do it quite nicely with map_dfr function
library(baseballr)
library(dplyr)
library(purrr)
date_seq <- seq(as.Date("04-01-21",format="%m-%d-%y"),
as.Date("04-03-21",format="%m-%d-%y"), by = "1 day")
map_dfr(.x = date_seq,
.f = function(x) {
standings_on_date_bref(date = x, division = "NL West") %>%
mutate(date = x)
})
#> # A tibble: 15 x 9
#> Tm W L `W-L%` GB RS RA `pythW-L%` date
#> <chr> <int> <int> <dbl> <chr> <int> <int> <dbl> <date>
#> 1 SDP 1 0 1 -- 8 7 0.561 2021-04-01
#> 2 COL 1 0 1 -- 8 5 0.703 2021-04-01
#> 3 ARI 0 1 0 1.0 7 8 0.439 2021-04-01
#> 4 SFG 0 1 0 1.0 7 8 0.439 2021-04-01
#> 5 LAD 0 1 0 1.0 5 8 0.297 2021-04-01
#> 6 SDP 2 0 1 -- 12 9 0.629 2021-04-02
#> 7 COL 1 1 0.5 1.0 14 16 0.439 2021-04-02
#> 8 SFG 1 1 0.5 1.0 13 11 0.576 2021-04-02
#> 9 LAD 1 1 0.5 1.0 16 14 0.561 2021-04-02
#> 10 ARI 0 2 0 2.0 9 12 0.371 2021-04-02
#> 11 SDP 3 0 1 -- 19 9 0.797 2021-04-03
#> 12 LAD 2 1 0.667 1.0 22 19 0.567 2021-04-03
#> 13 COL 1 2 0.333 2.0 19 22 0.433 2021-04-03
#> 14 SFG 1 2 0.333 2.0 13 15 0.435 2021-04-03
#> 15 ARI 0 3 0 3.0 9 19 0.203 2021-04-03
Created on 2022-01-02 by the reprex package (v2.0.1)
I am doing a linear regression by group and want to extract the residuals of the regression
library(dplyr)
set.seed(124)
dat <- data.frame(ID = sample(111:503, 18576, replace = T),
ID2 = sample(11:50, 18576, replace = T),
ID3 = sample(1:14, 18576, replace = T),
yearRef = sample(1998:2014, 18576, replace = T),
value = rnorm(18576))
resid <- dat %>% dplyr::group_by(ID3) %>%
do(augment(lm(value ~ yearRef, data=.))) %>% ungroup()
How do I retain the ID, ID2 as well in the resid. At the moment, it only retains the ID3 in the final data frame
Use group_split then loop through each group using map_dfr to bind ID, ID2 and augment output using bind_cols
library(dplyr)
library(purrr)
dat %>% group_split(ID3) %>%
map_dfr(~bind_cols(select(.x,ID,ID2), augment(lm(value~yearRef, data=.x))), .id = "ID3")
# A tibble: 18,576 x 12
ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma .cooksd
<chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 196 16 -0.385 2009 -0.0406 0.0308 -0.344 1.00e-3 0.973 6.27e-5
2 1 372 47 -0.793 2012 -0.0676 0.0414 -0.726 1.81e-3 0.973 5.05e-4
3 1 470 15 -0.496 2011 -0.0586 0.0374 -0.438 1.48e-3 0.973 1.50e-4
4 1 242 40 -1.13 2010 -0.0496 0.0338 -1.08 1.21e-3 0.973 7.54e-4
5 1 471 34 1.28 2006 -0.0135 0.0262 1.29 7.26e-4 0.972 6.39e-4
6 1 434 35 -1.09 1998 0.0586 0.0496 -1.15 2.61e-3 0.973 1.82e-3
7 1 467 45 -0.0663 2011 -0.0586 0.0374 -0.00769 1.48e-3 0.973 4.64e-8
8 1 334 27 -1.37 2003 0.0135 0.0305 -1.38 9.86e-4 0.972 9.92e-4
9 1 186 25 -0.0195 2003 0.0135 0.0305 -0.0331 9.86e-4 0.973 5.71e-7
10 1 114 34 1.09 2014 -0.0857 0.0500 1.18 2.64e-3 0.973 1.94e-3
# ... with 18,566 more rows, and 1 more variable: .std.resid <dbl>
Taking the "many models" approach, you can nest the data on ID3 and use purrr::map to create a list-column of the broom::augment data frames. The data list-column has all the original columns aside from ID3; map into that and select just the ones you want. Here I'm assuming you want to keep any column that starts with "ID", but you can change this. Then unnest both the data and the augment data frames.
library(dplyr)
library(tidyr)
dat %>%
group_by(ID3) %>%
nest() %>%
mutate(aug = purrr::map(data, ~broom::augment(lm(value ~ yearRef, data = .))),
data = purrr::map(data, select, starts_with("ID"))) %>%
unnest(c(data, aug))
#> # A tibble: 18,576 x 12
#> # Groups: ID3 [14]
#> ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma
#> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 431 15 0.619 2002 0.0326 0.0346 0.586 1.21e-3 0.995
#> 2 11 500 21 -0.432 2000 0.0299 0.0424 -0.462 1.82e-3 0.995
#> 3 11 392 28 -0.246 1998 0.0273 0.0515 -0.273 2.67e-3 0.995
#> 4 11 292 40 -0.425 1998 0.0273 0.0515 -0.452 2.67e-3 0.995
#> 5 11 175 36 -0.258 1999 0.0286 0.0468 -0.287 2.22e-3 0.995
#> 6 11 419 23 3.13 2005 0.0365 0.0273 3.09 7.54e-4 0.992
#> 7 11 329 17 -0.0414 2007 0.0391 0.0274 -0.0806 7.57e-4 0.995
#> 8 11 284 23 -0.450 2006 0.0378 0.0268 -0.488 7.25e-4 0.995
#> 9 11 136 28 -0.129 2006 0.0378 0.0268 -0.167 7.25e-4 0.995
#> 10 11 118 17 -1.55 2013 0.0470 0.0470 -1.60 2.24e-3 0.995
#> # … with 18,566 more rows, and 2 more variables: .cooksd <dbl>,
#> # .std.resid <dbl>
In R, I have a dataframe that looks like this:
sample value gene tag isPTV
1 1120 3.4 arx1 1120|arx1 0
2 2123 2.3 mnf2 2123|mnf2 0
3 1129 1.9 trf4 1129|trf4 0
4 2198 0.2 brc1 2198|brc1 0
5 1120 2.1 arx1 1120|arx1 1
6 2123 0.4 mnf2 2123|mnf2 1
7 1129 1.2 trf4 1129|trf4 1
8 2198 0.9 brc1 2198|brc1 1
Such that 0 means false and 1 means true. What I'm ultimately trying to do is create a dataframe that, for each tag, finds the absolute value between the value numbers.
For instance, for 1129|trf4 occurs in two separate rows. There's a value for when it isPTV and when it is not, so the absolute value would be 1.9 - 1.2 = 0.7.
I started out by trying to write a function to do these for a given tag value, such that, for a given tag, it would return both rows containing the tag:
getExprValue <- function(dataframe, tag){
return(dataframe[tag,])
}
But this is not working, and I'm not very familiar with how you index dataframes in R.
What is the right way to do this?
UPDATE:
Solution 1 Attempt:
m_diff <- m %>% group_by(tag) %>% mutate(absDiff = abs(diff(value)))
Response:
Error in mutate_impl(.data, dots) : ColumnabsDiffmust be length 1 (the group size), not 0
Solution 2 Attempt:
with(df1, abs(ave(value, tag, FUN = diff)))
Response:
Error in x[i] <- value[[j]] : replacement has length zero
Edit: I just noticed that #akrun had a much simpler solution
Create data with a structure similar to yours:
library(tidyverse)
dat <- tibble(
sample = rep(sample(1000:3000, 10), 2),
value = rnorm(20, 5, 1),
gene = rep(letters[1:10], 2),
tag = paste(sample, gene, sep = "|"),
isPTV = rep(0:1, each = 10)
)
dat
#> # A tibble: 20 x 5
#> sample value gene tag isPTV
#> <int> <dbl> <chr> <chr> <int>
#> 1 2149 5.90 a 2149|a 0
#> 2 1027 5.46 b 1027|b 0
#> 3 1103 5.65 c 1103|c 0
#> 4 1884 4.86 d 1884|d 0
#> 5 2773 5.58 e 2773|e 0
#> 6 2948 6.98 f 2948|f 0
#> 7 2478 5.17 g 2478|g 0
#> 8 2724 6.71 h 2724|h 0
#> 9 1927 5.06 i 1927|i 0
#> 10 1081 4.39 j 1081|j 0
#> 11 2149 4.60 a 2149|a 1
#> 12 1027 2.97 b 1027|b 1
#> 13 1103 6.17 c 1103|c 1
#> 14 1884 5.83 d 1884|d 1
#> 15 2773 4.23 e 2773|e 1
#> 16 2948 6.48 f 2948|f 1
#> 17 2478 5.06 g 2478|g 1
#> 18 2724 5.32 h 2724|h 1
#> 19 1927 7.32 i 1927|i 1
#> 20 1081 4.73 j 1081|j 1
#akrun solution (much better than mine):
dat %>%
group_by(tag) %>%
mutate(absDiff = abs(diff(value)))
#> # A tibble: 20 x 6
#> # Groups: tag [10]
#> sample value gene tag isPTV absDiff
#> <int> <dbl> <chr> <chr> <int> <dbl>
#> 1 2149 5.90 a 2149|a 0 1.30
#> 2 1027 5.46 b 1027|b 0 2.49
#> 3 1103 5.65 c 1103|c 0 0.520
#> 4 1884 4.86 d 1884|d 0 0.974
#> 5 2773 5.58 e 2773|e 0 1.34
#> 6 2948 6.98 f 2948|f 0 0.502
#> 7 2478 5.17 g 2478|g 0 0.114
#> 8 2724 6.71 h 2724|h 0 1.39
#> 9 1927 5.06 i 1927|i 0 2.26
#> 10 1081 4.39 j 1081|j 0 0.337
#> 11 2149 4.60 a 2149|a 1 1.30
#> 12 1027 2.97 b 1027|b 1 2.49
#> 13 1103 6.17 c 1103|c 1 0.520
#> 14 1884 5.83 d 1884|d 1 0.974
#> 15 2773 4.23 e 2773|e 1 1.34
#> 16 2948 6.48 f 2948|f 1 0.502
#> 17 2478 5.06 g 2478|g 1 0.114
#> 18 2724 5.32 h 2724|h 1 1.39
#> 19 1927 7.32 i 1927|i 1 2.26
#> 20 1081 4.73 j 1081|j 1 0.337
My initial suggestion (unnecessarily complicated):
nested <- dat %>%
group_by(tag) %>%
nest()
nested %>%
mutate(difference = map(data, ~ abs(diff(.$value)))) %>%
select(- data) %>%
unnest()
#> # A tibble: 10 x 2
#> tag difference
#> <chr> <dbl>
#> 1 2149|a 1.30
#> 2 1027|b 2.49
#> 3 1103|c 0.520
#> 4 1884|d 0.974
#> 5 2773|e 1.34
#> 6 2948|f 0.502
#> 7 2478|g 0.114
#> 8 2724|h 1.39
#> 9 1927|i 2.26
#> 10 1081|j 0.337
For a publication in a peer-reviewed scientific journal (http://www.redjournal.org), we would like to prepare Kaplan-Meier plots. The journal has the following specific guidelines for these plots:
"If your figures include curves generated from analyses using the Kaplan-Meier method or the cumulative incidence method, the following are now requirements for the presentation of these curves:
That the number of patients at risk is indicated;
That censoring marks are included;
That curves be truncated when there are fewer than 10 patients at risk; and
An estimate of the confidence interval should be included either in the figure itself or the text.”
Here, I illustrate my problem with the veteran dataset (https://github.com/tidyverse/reprex is great!).
We can adress 1, 2 and 4 easily with the survminer package:
library(survival)
library(survminer)
#> Warning: package 'survminer' was built under R version 3.4.3
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Warning: package 'ggpubr' was built under R version 3.4.3
#> Loading required package: magrittr
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
ggsurvplot(fit.obj,
conf.int = T,
risk.table ="absolute",
tables.theme = theme_cleantable())
I have, however, a problem with requirement 3 (truncate curves when there are fewer than 10 patients at risk). I see that all the required information is available in the survfit object:
library(survival)
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
summary(fit.obj)
#> Call: survfit(formula = Surv(time, status) ~ celltype, data = veteran)
#>
#> celltype=squamous
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 35 2 0.943 0.0392 0.8690 1.000
#> 8 33 1 0.914 0.0473 0.8261 1.000
#> 10 32 1 0.886 0.0538 0.7863 0.998
#> 11 31 1 0.857 0.0591 0.7487 0.981
#> 15 30 1 0.829 0.0637 0.7127 0.963
#> 25 29 1 0.800 0.0676 0.6779 0.944
#> 30 27 1 0.770 0.0713 0.6426 0.924
#> 33 26 1 0.741 0.0745 0.6083 0.902
#> 42 25 1 0.711 0.0772 0.5749 0.880
#> 44 24 1 0.681 0.0794 0.5423 0.856
#> 72 23 1 0.652 0.0813 0.5105 0.832
#> 82 22 1 0.622 0.0828 0.4793 0.808
#> 110 19 1 0.589 0.0847 0.4448 0.781
#> 111 18 1 0.557 0.0861 0.4112 0.754
#> 112 17 1 0.524 0.0870 0.3784 0.726
#> 118 16 1 0.491 0.0875 0.3464 0.697
#> 126 15 1 0.458 0.0876 0.3152 0.667
#> 144 14 1 0.426 0.0873 0.2849 0.636
#> 201 13 1 0.393 0.0865 0.2553 0.605
#> 228 12 1 0.360 0.0852 0.2265 0.573
#> 242 10 1 0.324 0.0840 0.1951 0.539
#> 283 9 1 0.288 0.0820 0.1650 0.503
#> 314 8 1 0.252 0.0793 0.1362 0.467
#> 357 7 1 0.216 0.0757 0.1088 0.429
#> 389 6 1 0.180 0.0711 0.0831 0.391
#> 411 5 1 0.144 0.0654 0.0592 0.351
#> 467 4 1 0.108 0.0581 0.0377 0.310
#> 587 3 1 0.072 0.0487 0.0192 0.271
#> 991 2 1 0.036 0.0352 0.0053 0.245
#> 999 1 1 0.000 NaN NA NA
#>
#> celltype=smallcell
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 48 1 0.9792 0.0206 0.93958 1.000
#> 4 47 1 0.9583 0.0288 0.90344 1.000
#> 7 46 2 0.9167 0.0399 0.84172 0.998
#> 8 44 1 0.8958 0.0441 0.81345 0.987
#> 10 43 1 0.8750 0.0477 0.78627 0.974
#> 13 42 2 0.8333 0.0538 0.73430 0.946
#> 16 40 1 0.8125 0.0563 0.70926 0.931
#> 18 39 2 0.7708 0.0607 0.66065 0.899
#> 20 37 2 0.7292 0.0641 0.61369 0.866
#> 21 35 2 0.6875 0.0669 0.56812 0.832
#> 22 33 1 0.6667 0.0680 0.54580 0.814
#> 24 32 1 0.6458 0.0690 0.52377 0.796
#> 25 31 2 0.6042 0.0706 0.48052 0.760
#> 27 29 1 0.5833 0.0712 0.45928 0.741
#> 29 28 1 0.5625 0.0716 0.43830 0.722
#> 30 27 1 0.5417 0.0719 0.41756 0.703
#> 31 26 1 0.5208 0.0721 0.39706 0.683
#> 51 25 2 0.4792 0.0721 0.35678 0.644
#> 52 23 1 0.4583 0.0719 0.33699 0.623
#> 54 22 2 0.4167 0.0712 0.29814 0.582
#> 56 20 1 0.3958 0.0706 0.27908 0.561
#> 59 19 1 0.3750 0.0699 0.26027 0.540
#> 61 18 1 0.3542 0.0690 0.24171 0.519
#> 63 17 1 0.3333 0.0680 0.22342 0.497
#> 80 16 1 0.3125 0.0669 0.20541 0.475
#> 87 15 1 0.2917 0.0656 0.18768 0.453
#> 95 14 1 0.2708 0.0641 0.17026 0.431
#> 99 12 2 0.2257 0.0609 0.13302 0.383
#> 117 9 1 0.2006 0.0591 0.11267 0.357
#> 122 8 1 0.1755 0.0567 0.09316 0.331
#> 139 6 1 0.1463 0.0543 0.07066 0.303
#> 151 5 1 0.1170 0.0507 0.05005 0.274
#> 153 4 1 0.0878 0.0457 0.03163 0.244
#> 287 3 1 0.0585 0.0387 0.01600 0.214
#> 384 2 1 0.0293 0.0283 0.00438 0.195
#> 392 1 1 0.0000 NaN NA NA
#>
#> celltype=adeno
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 3 27 1 0.9630 0.0363 0.89430 1.000
#> 7 26 1 0.9259 0.0504 0.83223 1.000
#> 8 25 2 0.8519 0.0684 0.72786 0.997
#> 12 23 1 0.8148 0.0748 0.68071 0.975
#> 18 22 1 0.7778 0.0800 0.63576 0.952
#> 19 21 1 0.7407 0.0843 0.59259 0.926
#> 24 20 1 0.7037 0.0879 0.55093 0.899
#> 31 19 1 0.6667 0.0907 0.51059 0.870
#> 35 18 1 0.6296 0.0929 0.47146 0.841
#> 36 17 1 0.5926 0.0946 0.43344 0.810
#> 45 16 1 0.5556 0.0956 0.39647 0.778
#> 48 15 1 0.5185 0.0962 0.36050 0.746
#> 51 14 1 0.4815 0.0962 0.32552 0.712
#> 52 13 1 0.4444 0.0956 0.29152 0.678
#> 73 12 1 0.4074 0.0946 0.25850 0.642
#> 80 11 1 0.3704 0.0929 0.22649 0.606
#> 84 9 1 0.3292 0.0913 0.19121 0.567
#> 90 8 1 0.2881 0.0887 0.15759 0.527
#> 92 7 1 0.2469 0.0850 0.12575 0.485
#> 95 6 1 0.2058 0.0802 0.09587 0.442
#> 117 5 1 0.1646 0.0740 0.06824 0.397
#> 132 4 1 0.1235 0.0659 0.04335 0.352
#> 140 3 1 0.0823 0.0553 0.02204 0.307
#> 162 2 1 0.0412 0.0401 0.00608 0.279
#> 186 1 1 0.0000 NaN NA NA
#>
#> celltype=large
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 12 27 1 0.9630 0.0363 0.89430 1.000
#> 15 26 1 0.9259 0.0504 0.83223 1.000
#> 19 25 1 0.8889 0.0605 0.77791 1.000
#> 43 24 1 0.8519 0.0684 0.72786 0.997
#> 49 23 1 0.8148 0.0748 0.68071 0.975
#> 52 22 1 0.7778 0.0800 0.63576 0.952
#> 53 21 1 0.7407 0.0843 0.59259 0.926
#> 100 20 1 0.7037 0.0879 0.55093 0.899
#> 103 19 1 0.6667 0.0907 0.51059 0.870
#> 105 18 1 0.6296 0.0929 0.47146 0.841
#> 111 17 1 0.5926 0.0946 0.43344 0.810
#> 133 16 1 0.5556 0.0956 0.39647 0.778
#> 143 15 1 0.5185 0.0962 0.36050 0.746
#> 156 14 1 0.4815 0.0962 0.32552 0.712
#> 162 13 1 0.4444 0.0956 0.29152 0.678
#> 164 12 1 0.4074 0.0946 0.25850 0.642
#> 177 11 1 0.3704 0.0929 0.22649 0.606
#> 200 9 1 0.3292 0.0913 0.19121 0.567
#> 216 8 1 0.2881 0.0887 0.15759 0.527
#> 231 7 1 0.2469 0.0850 0.12575 0.485
#> 250 6 1 0.2058 0.0802 0.09587 0.442
#> 260 5 1 0.1646 0.0740 0.06824 0.397
#> 278 4 1 0.1235 0.0659 0.04335 0.352
#> 340 3 1 0.0823 0.0553 0.02204 0.307
#> 378 2 1 0.0412 0.0401 0.00608 0.279
#> 553 1 1 0.0000 NaN NA NA
But I have no idea how I can manipulate this list. I would very much appreciate any advice on how to filter out all lines with n.risk < 10 from fit.obj.
I can't quite seem to get this all the way there. But I see that you can pass a data.frame rather than a fit object to the plotting function. You can do this and clip the values. For example
ss <- subset(surv_summary(fit.obj), n.risk>=10)
ggsurvplot(ss,
conf.int = T)
But it seems in this mode it does not automatically print the table. There is a function to draw just the table with
ggrisktable(fit.obj, tables.theme = theme_cleantable())
So I guess you could just combine them. Maybe i'm missing an easier way to draw the table when using a data.frame in the same plot.
As a slight variation on the above answers, if you want to truncate each group individually when less than 10 patients are at risk in that group, I found this to work and not require plotting the figure and table separately:
library(survival)
library(survminer)
# truncate each line when fewer than 10 at risk
atrisk <- 10
# KM fit
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
# subset each stratum separately
maxcutofftime = 0 # for plotting
strata <- rep(names(fit.obj$strata), fit.obj$strata)
for (i in names(fit.obj$strata)){
cutofftime <- min(fit.obj$time[fit.obj$n.risk < atrisk & strata == i])
maxcutofftime = max(maxcutofftime, cutofftime)
cutoffs <- which(fit.obj$n.risk < atrisk & strata == i)
fit.obj$lower[cutoffs] <- NA
fit.obj$upper[cutoffs] <- NA
fit.obj$surv[cutoffs] <- NA
}
# plot
ggsurvplot(fit.obj, data = veteran, risk.table = TRUE, conf.int = T, pval = F,
tables.theme = theme_cleantable(), xlim = c(0,maxcutofftime), break.x.by = 90)
edited to add: note that if we had used pval = T above, that would give the p-value for the truncated data, not the full data. It doesn't make much of a difference in this example as both are p<0.0001, but be careful :)
I'm following up on MrFlick's great answer.
I'd interpret 3) to mean that there should be at least 10 at risk total - i.e., not per group. So we have to create an ungrouped Kaplan-Meier fit first and determine the time cutoff from there.
Subset the surv_summary object w/r/t this cutoff.
Plot KM-curve and risk table separately. Crucially, function survminer::ggrisktable() (a minimal front end for ggsurvtable()) accepts options xlim and break.time.by. However, the function can currently only extend the upper time limit, not reduce it. I assume this is a bug. I created function ggsurvtable_mod() to change this.
Turn ggplot objects into grobs and use ggExtra::grid.arrange() to put both plots together. There is probably a more elegant way to do this based on options widths and heights.
Admittedly, this is a bit of a hack and needs tweaking to get the correct alignment between survival plot and risk table.
library(survival)
library(survminer)
# ungrouped KM estimate to determine cutoff
fit1_ss <- surv_summary(survfit(Surv(time, status) ~ 1, data=veteran))
# time cutoff with fewer than 10 at risk
cutoff <- min(fit1_ss$time[fit1_ss$n.risk < 10])
# KM fit and subset to cutoff
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
fit_ss <- subset(surv_summary(fit.obj), time < cutoff)
# KM survival plot and risk table as separate plots
p1 <- ggsurvplot(fit_ss, conf.int=TRUE)
# note options xlim and break.time.by
p2 <- ggsurvtable_mod(fit.obj,
survtable="risk.table",
tables.theme=theme_cleantable(),
xlim=c(0, cutoff),
break.time.by=100)
# turn ggplot objects into grobs and arrange them (needs tweaking)
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
lom <- rbind(c(NA, rep(1, 14)),
c(NA, rep(1, 14)),
c(rep(2, 15)))
gridExtra::grid.arrange(grobs=list(g1, g2), layout_matrix=lom)