Finding the mean of selected variables in R - r

I'm trying to build a table that summarizes the median of each isoform (ADA1, ADA2, and Total ADA) per each of the four visits, totalling at least 12 medians. I'm struggling to put the select and median functions together without error. I'm also unsure whether I accidentally lost all of the Visit 1 variables during my manipulations of the dataframe. Will someone please help me check that Visit 1 variables still exist, and then create a table listing all of the medians?

I'll start with sample data, somewhat similar to your picture.
library(dplyr)
set.seed(42)
dat <- tibble(
Vaccine = sample(c("HBV", "BCG"), size=100, replace=TRUE),
isoform = sample(c("ADA1", "ADA2", "Total ADA"), size=100, replace=TRUE),
fc = rexp(100), `log(fc)` = log1p(fc))
dat
# # A tibble: 100 x 4
# Vaccine isoform fc `log(fc)`
# <chr> <chr> <dbl> <dbl>
# 1 HBV ADA2 1.10 0.741
# 2 HBV ADA1 1.66 0.978
# 3 HBV ADA2 0.910 0.647
# 4 HBV ADA1 3.59 1.52
# 5 BCG ADA2 0.0600 0.0583
# 6 BCG Total ADA 1.38 0.867
# 7 BCG ADA1 2.16 1.15
# 8 BCG ADA1 0.123 0.116
# 9 HBV Total ADA 2.21 1.17
# 10 BCG ADA2 2.08 1.12
# # ... with 90 more rows
We can simply group and summarize with:
dat %>%
group_by(Vaccine, isoform) %>%
summarize_at(vars(fc, "log(fc)"), list(mu = ~ mean(.), median = ~ median(.))) %>%
ungroup()
# # A tibble: 6 x 6
# Vaccine isoform fc_mu `log(fc)_mu` fc_median `log(fc)_median`
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 BCG ADA1 0.970 0.571 0.590 0.459
# 2 BCG ADA2 1.44 0.795 1.13 0.757
# 3 BCG Total ADA 1.22 0.731 1.27 0.819
# 4 HBV ADA1 1.26 0.739 0.964 0.674
# 5 HBV ADA2 1.56 0.798 1.10 0.741
# 6 HBV Total ADA 0.876 0.582 0.787 0.581
If you want base R,
aggregate(cbind(fc, `log(fc)`) ~ Vaccine + isoform, data = dat,
FUN = function(z) c(mu=mean(z), med=median(z)))
# Vaccine isoform fc.mu fc.med log(fc).mu log(fc).med
# 1 BCG ADA1 0.9704116 0.5899109 0.5714296 0.4594965
# 2 HBV ADA1 1.2644729 0.9636581 0.7386704 0.6740528
# 3 BCG ADA2 1.4422983 1.1322882 0.7949414 0.7571957
# 4 HBV ADA2 1.5551761 1.0975749 0.7977444 0.7407819
# 5 BCG Total ADA 1.2164525 1.2682549 0.7308947 0.8190108
# 6 HBV Total ADA 0.8756976 0.7872367 0.5820488 0.5806707
or with data.table:
library(data.table)
as.data.table(dat)[
, unlist(lapply(.SD, function(z) list(mu=mean(z), med=median(z))),
recursive = FALSE),
by = .(Vaccine, isoform), .SDcols = c("fc", "log(fc)")][]
# Vaccine isoform fc.mu fc.med log(fc).mu log(fc).med
# 1: HBV ADA2 1.5551761 1.0975749 0.7977444 0.7407819
# 2: HBV ADA1 1.2644729 0.9636581 0.7386704 0.6740528
# 3: BCG ADA2 1.4422983 1.1322882 0.7949414 0.7571957
# 4: BCG Total ADA 1.2164525 1.2682549 0.7308947 0.8190108
# 5: BCG ADA1 0.9704116 0.5899109 0.5714296 0.4594965
# 6: HBV Total ADA 0.8756976 0.7872367 0.5820488 0.5806707

Related

Average a multiple number of rows for every column, multiple times

Here I have a snippet of my dataset. The rows indicate different days of the year.
The Substations represent individuals, there are over 500 individuals.
The 10 minute time periods run all the way through 24 hours.
I need to find an average value for each 10 minute interval for each individual in this dataset. This should result in single row for each individual substation, with the respective average value for each time interval.
I have tried:
meanbygroup <- stationgroup %>%
group_by(Substation) %>%
summarise(means = colMeans(tenminintervals[sapply(tenminintervals, is.numeric)]))
But this averages the entire column and I am left with the same average values for each individual substation.
So for each individual substation, I need an average for each individual time interval.
Please help!
Try using summarize(across()), like this:
df %>%
group_by(Substation) %>%
summarize(across(everything(), ~mean(.x, na.rm=T)))
Output:
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A -0.233 0.110 -0.106
2 B 0.203 -0.0997 -0.128
3 C -0.0733 0.196 -0.0205
4 D 0.0905 -0.0449 -0.0529
5 E 0.401 0.152 -0.0957
6 F 0.0368 0.120 -0.0787
7 G 0.0323 -0.0792 -0.278
8 H 0.132 -0.0766 0.157
9 I -0.0693 0.0578 0.0732
10 J 0.0776 -0.176 -0.0192
# … with 16 more rows
Input:
set.seed(123)
df = bind_cols(
tibble(Substation = sample(LETTERS,size = 1000, replace=T)),
as_tibble(setNames(lapply(1:3, function(x) rnorm(1000)),c("00:00", "00:10", "00:20")))
) %>% arrange(Substation)
# A tibble: 1,000 × 4
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A 0.121 -1.94 0.137
2 A -0.322 1.05 0.416
3 A -0.158 -1.40 0.192
4 A -1.85 1.69 -0.0922
5 A -1.16 -0.455 0.754
6 A 1.95 1.06 0.732
7 A -0.132 0.655 -1.84
8 A 1.08 -0.329 -0.130
9 A -1.21 2.82 -0.0571
10 A -1.04 0.237 -0.328
# … with 990 more rows

Coalesce column values based on mapping to a vector of values

If I have the following two objects:
> set.seed(100)
> lookup <- sample(1:3, 20, replace=T)
> lookup
[1] 2 3 2 3 1 2 2 3 2 2 3 2 2 3 3 3 3 2 1 3
and
> tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
> tb
> tb
# A tibble: 20 × 3
A B C
<dbl> <dbl> <dbl>
1 0.770 0.780 0.456
2 0.882 0.884 0.445
3 0.549 0.208 0.245
4 0.278 0.307 0.694
5 0.488 0.331 0.412
6 0.929 0.199 0.328
7 0.349 0.236 0.573
8 0.954 0.275 0.967
9 0.695 0.591 0.662
10 0.889 0.253 0.625
11 0.180 0.123 0.857
12 0.629 0.230 0.775
13 0.990 0.598 0.834
14 0.130 0.211 0.0915
15 0.331 0.464 0.460
16 0.865 0.647 0.599
17 0.778 0.961 0.920
18 0.827 0.676 0.983
19 0.603 0.445 0.0378
20 0.491 0.358 0.578
How do I use lookup to select the value of the corresponding row/column from tb?
i.e.
if the first element of lookup = 1 then I would like to select the value in A from the first row of tb
if the second element of lookup = 2 then I would like to select the value in B from the second row of tb
So I should end up with a 1d vector that is the same size as lookup. It will look like this:
> new data
> [1] 0.780 0.445 0.208 0.694 0.488 ... 0.578
Thanks!
data.frame (but not tibble or data.table) supports indexing on a matrix, so with this data,
set.seed(42)
lookup <- sample(1:3, 20, replace=T)
lookup
# [1] 1 1 1 1 2 2 2 1 3 3 1 1 2 2 2 3 3 1 1 3
tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
head(tb)
# # A tibble: 6 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 0.514 0.958 0.189
# 2 0.390 0.888 0.271
# 3 0.906 0.640 0.828
# 4 0.447 0.971 0.693
# 5 0.836 0.619 0.241
# 6 0.738 0.333 0.0430
We can do
as.data.frame(tb)[cbind(seq_along(lookup), lookup)]
# [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363
A less-efficient method can be done without as.data.frame:
mapply(`[[`, list(tb), seq_along(lookup), lookup)
# [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363
## also works with `list(as.data.table(tb))`
Though it does take a big hit in performance (not a surprise):
bench::mark(
sindri_baldur1 = unlist(tb, use.names = FALSE)[seq_along(lookup) + (lookup - 1L)*nrow(tb)],
sindri_baldur2 = unlist(tb)[seq_along(lookup) + (lookup - 1L)*nrow(tb)],
base = as.data.frame(tb)[cbind(seq_along(lookup), lookup)],
mapply = mapply(`[[`, list(tb), seq_along(lookup), lookup),
paulsmith2 = {
tb %>%
mutate(lookup = lookup) %>%
rowwise %>%
mutate(new = c_across(A:C)[lookup]) %>%
pull(new)
},
check = FALSE)
# # A tibble: 5 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 sindri_baldur1 4.5us 5.3us 159430. 736B 15.9 9999 1 62.7ms <NULL> <Rprof~ <benc~ <tibb~
# 2 sindri_baldur2 13.2us 14.7us 56723. 1.44KB 0 10000 0 176.3ms <NULL> <Rprof~ <benc~ <tibb~
# 3 base 78.3us 91.6us 7334. 944B 8.59 3414 4 465.5ms <NULL> <Rprof~ <benc~ <tibb~
# 4 mapply 612.4us 779.45us 942. 720B 6.39 442 3 469.4ms <NULL> <Rprof~ <benc~ <tibb~
# 5 paulsmith2 4.37ms 5.85ms 147. 20.3KB 6.51 68 3 461.1ms <NULL> <Rprof~ <benc~ <tibb~
(I have to use check=FALSE to work with the names introduced in sindri_baldur2, otherwise all results are numerically identical.)
You could:
unlist(tb, use.names = FALSE)[seq_along(lookup) + (lookup - 1L)*nrow(tb)]
# [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907 0.23569430 0.96699908 0.59132105
# [10] 0.25339065 0.85665304 0.22990589 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
# [19] 0.60332436 0.57793740
You could also use.names and keep track of the original location:
unlist(tb)[seq_along(lookup) + (lookup - 1L)*nrow(tb)] |> head()
# B1 C2 B3 C4 A5 B6
# 0.7803585 0.4454140 0.2077139 0.6943507 0.4883060 0.1986791
A base R solution:
tb$lookup <- lookup
tb$new <- apply(tb, 1, function(x) x[x[4]])
new <- tb$new
new
#> [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907
#> [7] 0.23569430 0.96699908 0.59132105 0.25339065 0.85665304 0.22990589
#> [13] 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
#> [19] 0.60332436 0.57793740
Another possible solution, based on tidyverse:
library(tidyverse)
set.seed(100)
lookup <- sample(1:3, 20, replace=T)
tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
tb %>%
mutate(lookup = lookup) %>%
rowwise %>%
mutate(new = c_across(A:C)[lookup]) %>%
pull(new)
#> [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907
#> [7] 0.23569430 0.96699908 0.59132105 0.25339065 0.85665304 0.22990589
#> [13] 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
#> [19] 0.60332436 0.57793740

Summing up Certain Sequences of a Dataframe in R

I have several data frames of daily rates of different regions by age-groups:
Date 0-14 Rate 15-29 Rate 30-44 Rate 45-64 Rate 65-79 Rate 80+ Rate
2020-23-12 0 33.54 45.68 88.88 96.13 41.28
2020-24-12 0 25.14 35.28 66.14 90.28 38.41
It begins on Wednesday (2020-23-12) and I have data from then on up to date.
I want to obtain weekly row sums of rates from each Wednesday to Tuesday.
There should be a wise way of combinations with aggregate, seq and rowsum functions to do this using a few lines. Otherwise, I'll use too long ways to do this.
I created some minimal data, three weeks with some arbitrary column and numerics (no missings). You can use tidyverse language to sum over columns, create groups per week and sum over rowsums by week:
# Minimal Data
MWE <- data.frame(date = c(outer(as.Date("12/23/20", "%m/%d/%y"), 0:20, `+`)),
column1 = runif(21,0,1),
column2 = runif(21,0,1))
library(tidyverse)
MWE %>%
# Calculate Row Sum Everywhere
mutate(sum = rowSums(across(where(is.numeric)))) %>%
# Create Week Groups
group_by(week = ceiling(row_number()/7)) %>%
# Sum Over All RowSums per Group
summarise(rowSums_by_week = sum(sum))
# Groups: week [3]
date column1 column2 sum week
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-12-23 0.449 0.759 1.21 1
2 2020-12-24 0.423 0.0956 0.519 1
3 2020-12-25 0.974 0.592 1.57 1
4 2020-12-26 0.798 0.250 1.05 1
5 2020-12-27 0.870 0.487 1.36 1
6 2020-12-28 0.952 0.345 1.30 1
7 2020-12-29 0.349 0.817 1.17 1
8 2020-12-30 0.227 0.727 0.954 2
9 2020-12-31 0.292 0.209 0.501 2
10 2021-01-01 0.678 0.276 0.954 2
# ... with 11 more rows
# A tibble: 3 x 2
week rowSums_by_week
<dbl> <dbl>
1 1 8.16
2 2 6.02
3 3 6.82

Iterate through groups of variables in a survey - R

As indicated here, if you want to calculate proportions of a categorical variable in the amazing srvyr package you first have to group over the variables as factors and then use an empty srvyr::survey_mean, as in this example.
My goal is to iterate over the second variables cname and sch.wide while keeping the first grouping variable stype to avoid duplicating the code.
library(survey)
library(srvyr)
data(api)
df <- apiclus1 %>%
mutate(cname=as.factor(cname)) %>%
select(pw,stype, cname,sch.wide) %>%
as_survey_design(weights=pw)
# proportions of sch.wide
df %>%
group_by(stype,sch.wide) %>%
summarise(prop=srvyr::survey_mean())
#> # A tibble: 6 x 4
#> stype sch.wide prop prop_se
#> <fct> <fct> <dbl> <dbl>
#> 1 E No 0.0833 0.0231
#> 2 E Yes 0.917 0.0231
#> 3 H No 0.214 0.110
#> 4 H Yes 0.786 0.110
#> 5 M No 0.32 0.0936
#> 6 M Yes 0.68 0.0936
# proportions of cname
df %>%
group_by(stype,cname) %>%
summarise(prop=srvyr::survey_mean())
#> # A tibble: 33 x 4
#> stype cname prop prop_se
#> <fct> <fct> <dbl> <dbl>
#> 1 E Alameda 0.0556 0.0191
#> 2 E Fresno 0.0139 0.00978
#> 3 E Kern 0.00694 0.00694
#> 4 E Los Angeles 0.0833 0.0231
#> 5 E Mendocino 0.0139 0.00978
#> 6 E Merced 0.0139 0.00978
#> 7 E Orange 0.0903 0.0239
#> 8 E Plumas 0.0278 0.0137
#> 9 E San Diego 0.347 0.0398
#> 10 E San Joaquin 0.208 0.0339
#> # ... with 23 more rows
Created on 2019-11-28 by the reprex package (v0.3.0)
Maybe the way to go here is creating lists that keep the first grouping variable and divide the data by a another group of variables, and then calculate the proportions.
I would like to find a solution that involves purrr:map or tidyverse.
Thanks in advance for the help, or for pointing to the answer!
There are multiple ways. If we pass as string, one option is to make use of group_by_at which takes strings as argument
library(purrr)
library(dplyr)
library(survey)
library(srvyr)
map(c('sch.wide', 'cname'), ~
df %>%
group_by_at(vars("stype", .x)) %>%
summarise(prop = srvyr::survey_mean()))
#[[1]]
# A tibble: 6 x 4
# stype sch.wide prop prop_se
# <fct> <fct> <dbl> <dbl>
#1 E No 0.0833 0.0231
#2 E Yes 0.917 0.0231
#3 H No 0.214 0.110
#4 H Yes 0.786 0.110
#5 M No 0.32 0.0936
#6 M Yes 0.68 0.0936
#[[2]]
# A tibble: 30 x 4
# stype cname prop prop_se
# <fct> <fct> <dbl> <dbl>
# 1 E Alameda 0.0556 0.0191
# 2 E Fresno 0.0139 0.00978
# 3 E Kern 0.00694 0.00694
# 4 E Los Angeles 0.0833 0.0231
# 5 E Mendocino 0.0139 0.00978
# 6 E Merced 0.0139 0.00978
# 7 E Orange 0.0903 0.0239
# 8 E Plumas 0.0278 0.0137
# 9 E San Diego 0.347 0.0398
#10 E San Joaquin 0.208 0.0339
# … with 20 more rows
Or another option is to wrap with quos to create a quosure list and evaluate (!!) it in group_by
map(quos(sch.wide, cname), ~
df %>%
group_by(stype, !!.x) %>%
summarise(prop = srvyr::survey_mean()))

How to truncate Kaplan Meier curves when number at risk is < 10

For a publication in a peer-reviewed scientific journal (http://www.redjournal.org), we would like to prepare Kaplan-Meier plots. The journal has the following specific guidelines for these plots:
"If your figures include curves generated from analyses using the Kaplan-Meier method or the cumulative incidence method, the following are now requirements for the presentation of these curves:
That the number of patients at risk is indicated;
That censoring marks are included;
That curves be truncated when there are fewer than 10 patients at risk; and
An estimate of the confidence interval should be included either in the figure itself or the text.”
Here, I illustrate my problem with the veteran dataset (https://github.com/tidyverse/reprex is great!).
We can adress 1, 2 and 4 easily with the survminer package:
library(survival)
library(survminer)
#> Warning: package 'survminer' was built under R version 3.4.3
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Warning: package 'ggpubr' was built under R version 3.4.3
#> Loading required package: magrittr
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
ggsurvplot(fit.obj,
conf.int = T,
risk.table ="absolute",
tables.theme = theme_cleantable())
I have, however, a problem with requirement 3 (truncate curves when there are fewer than 10 patients at risk). I see that all the required information is available in the survfit object:
library(survival)
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
summary(fit.obj)
#> Call: survfit(formula = Surv(time, status) ~ celltype, data = veteran)
#>
#> celltype=squamous
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 35 2 0.943 0.0392 0.8690 1.000
#> 8 33 1 0.914 0.0473 0.8261 1.000
#> 10 32 1 0.886 0.0538 0.7863 0.998
#> 11 31 1 0.857 0.0591 0.7487 0.981
#> 15 30 1 0.829 0.0637 0.7127 0.963
#> 25 29 1 0.800 0.0676 0.6779 0.944
#> 30 27 1 0.770 0.0713 0.6426 0.924
#> 33 26 1 0.741 0.0745 0.6083 0.902
#> 42 25 1 0.711 0.0772 0.5749 0.880
#> 44 24 1 0.681 0.0794 0.5423 0.856
#> 72 23 1 0.652 0.0813 0.5105 0.832
#> 82 22 1 0.622 0.0828 0.4793 0.808
#> 110 19 1 0.589 0.0847 0.4448 0.781
#> 111 18 1 0.557 0.0861 0.4112 0.754
#> 112 17 1 0.524 0.0870 0.3784 0.726
#> 118 16 1 0.491 0.0875 0.3464 0.697
#> 126 15 1 0.458 0.0876 0.3152 0.667
#> 144 14 1 0.426 0.0873 0.2849 0.636
#> 201 13 1 0.393 0.0865 0.2553 0.605
#> 228 12 1 0.360 0.0852 0.2265 0.573
#> 242 10 1 0.324 0.0840 0.1951 0.539
#> 283 9 1 0.288 0.0820 0.1650 0.503
#> 314 8 1 0.252 0.0793 0.1362 0.467
#> 357 7 1 0.216 0.0757 0.1088 0.429
#> 389 6 1 0.180 0.0711 0.0831 0.391
#> 411 5 1 0.144 0.0654 0.0592 0.351
#> 467 4 1 0.108 0.0581 0.0377 0.310
#> 587 3 1 0.072 0.0487 0.0192 0.271
#> 991 2 1 0.036 0.0352 0.0053 0.245
#> 999 1 1 0.000 NaN NA NA
#>
#> celltype=smallcell
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 48 1 0.9792 0.0206 0.93958 1.000
#> 4 47 1 0.9583 0.0288 0.90344 1.000
#> 7 46 2 0.9167 0.0399 0.84172 0.998
#> 8 44 1 0.8958 0.0441 0.81345 0.987
#> 10 43 1 0.8750 0.0477 0.78627 0.974
#> 13 42 2 0.8333 0.0538 0.73430 0.946
#> 16 40 1 0.8125 0.0563 0.70926 0.931
#> 18 39 2 0.7708 0.0607 0.66065 0.899
#> 20 37 2 0.7292 0.0641 0.61369 0.866
#> 21 35 2 0.6875 0.0669 0.56812 0.832
#> 22 33 1 0.6667 0.0680 0.54580 0.814
#> 24 32 1 0.6458 0.0690 0.52377 0.796
#> 25 31 2 0.6042 0.0706 0.48052 0.760
#> 27 29 1 0.5833 0.0712 0.45928 0.741
#> 29 28 1 0.5625 0.0716 0.43830 0.722
#> 30 27 1 0.5417 0.0719 0.41756 0.703
#> 31 26 1 0.5208 0.0721 0.39706 0.683
#> 51 25 2 0.4792 0.0721 0.35678 0.644
#> 52 23 1 0.4583 0.0719 0.33699 0.623
#> 54 22 2 0.4167 0.0712 0.29814 0.582
#> 56 20 1 0.3958 0.0706 0.27908 0.561
#> 59 19 1 0.3750 0.0699 0.26027 0.540
#> 61 18 1 0.3542 0.0690 0.24171 0.519
#> 63 17 1 0.3333 0.0680 0.22342 0.497
#> 80 16 1 0.3125 0.0669 0.20541 0.475
#> 87 15 1 0.2917 0.0656 0.18768 0.453
#> 95 14 1 0.2708 0.0641 0.17026 0.431
#> 99 12 2 0.2257 0.0609 0.13302 0.383
#> 117 9 1 0.2006 0.0591 0.11267 0.357
#> 122 8 1 0.1755 0.0567 0.09316 0.331
#> 139 6 1 0.1463 0.0543 0.07066 0.303
#> 151 5 1 0.1170 0.0507 0.05005 0.274
#> 153 4 1 0.0878 0.0457 0.03163 0.244
#> 287 3 1 0.0585 0.0387 0.01600 0.214
#> 384 2 1 0.0293 0.0283 0.00438 0.195
#> 392 1 1 0.0000 NaN NA NA
#>
#> celltype=adeno
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 3 27 1 0.9630 0.0363 0.89430 1.000
#> 7 26 1 0.9259 0.0504 0.83223 1.000
#> 8 25 2 0.8519 0.0684 0.72786 0.997
#> 12 23 1 0.8148 0.0748 0.68071 0.975
#> 18 22 1 0.7778 0.0800 0.63576 0.952
#> 19 21 1 0.7407 0.0843 0.59259 0.926
#> 24 20 1 0.7037 0.0879 0.55093 0.899
#> 31 19 1 0.6667 0.0907 0.51059 0.870
#> 35 18 1 0.6296 0.0929 0.47146 0.841
#> 36 17 1 0.5926 0.0946 0.43344 0.810
#> 45 16 1 0.5556 0.0956 0.39647 0.778
#> 48 15 1 0.5185 0.0962 0.36050 0.746
#> 51 14 1 0.4815 0.0962 0.32552 0.712
#> 52 13 1 0.4444 0.0956 0.29152 0.678
#> 73 12 1 0.4074 0.0946 0.25850 0.642
#> 80 11 1 0.3704 0.0929 0.22649 0.606
#> 84 9 1 0.3292 0.0913 0.19121 0.567
#> 90 8 1 0.2881 0.0887 0.15759 0.527
#> 92 7 1 0.2469 0.0850 0.12575 0.485
#> 95 6 1 0.2058 0.0802 0.09587 0.442
#> 117 5 1 0.1646 0.0740 0.06824 0.397
#> 132 4 1 0.1235 0.0659 0.04335 0.352
#> 140 3 1 0.0823 0.0553 0.02204 0.307
#> 162 2 1 0.0412 0.0401 0.00608 0.279
#> 186 1 1 0.0000 NaN NA NA
#>
#> celltype=large
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 12 27 1 0.9630 0.0363 0.89430 1.000
#> 15 26 1 0.9259 0.0504 0.83223 1.000
#> 19 25 1 0.8889 0.0605 0.77791 1.000
#> 43 24 1 0.8519 0.0684 0.72786 0.997
#> 49 23 1 0.8148 0.0748 0.68071 0.975
#> 52 22 1 0.7778 0.0800 0.63576 0.952
#> 53 21 1 0.7407 0.0843 0.59259 0.926
#> 100 20 1 0.7037 0.0879 0.55093 0.899
#> 103 19 1 0.6667 0.0907 0.51059 0.870
#> 105 18 1 0.6296 0.0929 0.47146 0.841
#> 111 17 1 0.5926 0.0946 0.43344 0.810
#> 133 16 1 0.5556 0.0956 0.39647 0.778
#> 143 15 1 0.5185 0.0962 0.36050 0.746
#> 156 14 1 0.4815 0.0962 0.32552 0.712
#> 162 13 1 0.4444 0.0956 0.29152 0.678
#> 164 12 1 0.4074 0.0946 0.25850 0.642
#> 177 11 1 0.3704 0.0929 0.22649 0.606
#> 200 9 1 0.3292 0.0913 0.19121 0.567
#> 216 8 1 0.2881 0.0887 0.15759 0.527
#> 231 7 1 0.2469 0.0850 0.12575 0.485
#> 250 6 1 0.2058 0.0802 0.09587 0.442
#> 260 5 1 0.1646 0.0740 0.06824 0.397
#> 278 4 1 0.1235 0.0659 0.04335 0.352
#> 340 3 1 0.0823 0.0553 0.02204 0.307
#> 378 2 1 0.0412 0.0401 0.00608 0.279
#> 553 1 1 0.0000 NaN NA NA
But I have no idea how I can manipulate this list. I would very much appreciate any advice on how to filter out all lines with n.risk < 10 from fit.obj.
I can't quite seem to get this all the way there. But I see that you can pass a data.frame rather than a fit object to the plotting function. You can do this and clip the values. For example
ss <- subset(surv_summary(fit.obj), n.risk>=10)
ggsurvplot(ss,
conf.int = T)
But it seems in this mode it does not automatically print the table. There is a function to draw just the table with
ggrisktable(fit.obj, tables.theme = theme_cleantable())
So I guess you could just combine them. Maybe i'm missing an easier way to draw the table when using a data.frame in the same plot.
As a slight variation on the above answers, if you want to truncate each group individually when less than 10 patients are at risk in that group, I found this to work and not require plotting the figure and table separately:
library(survival)
library(survminer)
# truncate each line when fewer than 10 at risk
atrisk <- 10
# KM fit
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
# subset each stratum separately
maxcutofftime = 0 # for plotting
strata <- rep(names(fit.obj$strata), fit.obj$strata)
for (i in names(fit.obj$strata)){
cutofftime <- min(fit.obj$time[fit.obj$n.risk < atrisk & strata == i])
maxcutofftime = max(maxcutofftime, cutofftime)
cutoffs <- which(fit.obj$n.risk < atrisk & strata == i)
fit.obj$lower[cutoffs] <- NA
fit.obj$upper[cutoffs] <- NA
fit.obj$surv[cutoffs] <- NA
}
# plot
ggsurvplot(fit.obj, data = veteran, risk.table = TRUE, conf.int = T, pval = F,
tables.theme = theme_cleantable(), xlim = c(0,maxcutofftime), break.x.by = 90)
edited to add: note that if we had used pval = T above, that would give the p-value for the truncated data, not the full data. It doesn't make much of a difference in this example as both are p<0.0001, but be careful :)
I'm following up on MrFlick's great answer.
I'd interpret 3) to mean that there should be at least 10 at risk total - i.e., not per group. So we have to create an ungrouped Kaplan-Meier fit first and determine the time cutoff from there.
Subset the surv_summary object w/r/t this cutoff.
Plot KM-curve and risk table separately. Crucially, function survminer::ggrisktable() (a minimal front end for ggsurvtable()) accepts options xlim and break.time.by. However, the function can currently only extend the upper time limit, not reduce it. I assume this is a bug. I created function ggsurvtable_mod() to change this.
Turn ggplot objects into grobs and use ggExtra::grid.arrange() to put both plots together. There is probably a more elegant way to do this based on options widths and heights.
Admittedly, this is a bit of a hack and needs tweaking to get the correct alignment between survival plot and risk table.
library(survival)
library(survminer)
# ungrouped KM estimate to determine cutoff
fit1_ss <- surv_summary(survfit(Surv(time, status) ~ 1, data=veteran))
# time cutoff with fewer than 10 at risk
cutoff <- min(fit1_ss$time[fit1_ss$n.risk < 10])
# KM fit and subset to cutoff
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
fit_ss <- subset(surv_summary(fit.obj), time < cutoff)
# KM survival plot and risk table as separate plots
p1 <- ggsurvplot(fit_ss, conf.int=TRUE)
# note options xlim and break.time.by
p2 <- ggsurvtable_mod(fit.obj,
survtable="risk.table",
tables.theme=theme_cleantable(),
xlim=c(0, cutoff),
break.time.by=100)
# turn ggplot objects into grobs and arrange them (needs tweaking)
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
lom <- rbind(c(NA, rep(1, 14)),
c(NA, rep(1, 14)),
c(rep(2, 15)))
gridExtra::grid.arrange(grobs=list(g1, g2), layout_matrix=lom)

Resources