Lets say I am passing a list of functions using the ...
distributions <- function(...){
dist_list <- list(...)
}
Now if I run distributions(rnorm(50), TidyDensity::tidy_normal()) then I get back a list with a vector and a data.frame.
My question is how can I get the name of the function called ie rnorm() and the parameters passed to it?
Using something like dist_list %>% map(formalArgs) gives NULL and In formals(fun) : argument is not a function
Are you looking for match.call ?
distributions <- function(...){
as.list(match.call())[-1]
}
distributions(rnorm(50), TidyDensity::tidy_normal())
#> [[1]]
#> rnorm(50)
#>
#> [[2]]
#> TidyDensity::tidy_normal()
Or perhaps, if you want access to both the evaluated and unevaluated expressions:
distributions <- function(...){
setNames(list(...), sapply(as.list(match.call())[-1], deparse))
}
distributions(rnorm(50), TidyDensity::tidy_normal())
#> $`rnorm(50)`
#> [1] -0.52410930 -0.48754350 -0.31346114 1.11142888 -0.16829168 0.14389782
#> [7] 1.87285979 0.22663043 -1.18221292 -0.65343574 -0.36147761 -1.03521579
#> [13] 1.33469895 0.21420578 1.22697541 -0.39742602 0.57371164 1.36802888
#> [19] -0.46048771 -1.40676587 0.38244090 -0.74532223 -0.10575884 0.88656441
#> [25] 1.03761952 0.11923645 -1.25080762 0.04605158 1.13500076 -0.45793246
#> [31] -0.74270252 -0.35263243 1.51000758 0.02781866 1.80205985 -1.13545504
#> [37] 1.21807981 -0.52062922 -0.54958956 0.54630736 0.22934998 -1.57051922
#> [43] 0.52189051 -0.01885723 -1.59054477 0.57197369 -1.44277344 -0.64757076
#> [49] -1.76299781 0.64173935
#>
#> $`TidyDensity::tidy_normal()`
#> # A tibble: 50 x 7
#> sim_number x y dx dy p q
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 -0.895 -3.63 0.000224 0.5 -Inf
#> 2 1 2 0.648 -3.49 0.000616 0.508 -2.05
#> 3 1 3 0.153 -3.34 0.00149 0.516 -1.74
#> 4 1 4 1.36 -3.20 0.00318 0.524 -1.54
#> 5 1 5 0.632 -3.05 0.00597 0.533 -1.39
#> 6 1 6 0.830 -2.91 0.00990 0.541 -1.27
#> 7 1 7 -0.428 -2.77 0.0146 0.549 -1.16
#> 8 1 8 0.435 -2.62 0.0193 0.557 -1.07
#> 9 1 9 1.25 -2.48 0.0233 0.565 -0.981
#> 10 1 10 -0.701 -2.33 0.0267 0.573 -0.901
#> # ... with 40 more rows
Created on 2022-04-04 by the reprex package (v2.0.1)
Related
Trying to mutate and pivot_longer a dataframe column raises the error
non-numeric argument to binary operator.
Any assistance will be greatly appreciated.
Below is the code in use.
DF <- mydata4 %>%
mutate('S-25-(OH)-D3' = 'S-25-(OH)-D3 (nmol/L)'/1000) %>%
pivot_longer(.,-`Date of examination`, names_to = "variable",values_to = "value")
You have your variables in quotes. See 10.2 here re non-syntactic names.
You could use backticks instead per example two below, or clean the names first, e.g. make them snake_case per example three:
library(tidyverse)
library(janitor)
tibble(`S-25-(OH)-D3 (nmol/L)` = c(1000:1010)) |>
mutate('S-S-25-(OH)-D3' = 'S-25-(OH)-D3 (nmol/L)' / 1000)
#> Error in `mutate()`:
#> ! Problem while computing `S-S-25-(OH)-D3 = "S-25-(OH)-D3
#> (nmol/L)"/1000`.
#> Caused by error in `"S-25-(OH)-D3 (nmol/L)" / 1000`:
#> ! non-numeric argument to binary operator
tibble(`S-25-(OH)-D3 (nmol/L)` = c(1000:1010)) |>
mutate(`S-S-25-(OH)-D3` = `S-25-(OH)-D3 (nmol/L)` / 1000)
#> # A tibble: 11 × 2
#> `S-25-(OH)-D3 (nmol/L)` `S-S-25-(OH)-D3`
#> <int> <dbl>
#> 1 1000 1
#> 2 1001 1.00
#> 3 1002 1.00
#> 4 1003 1.00
#> 5 1004 1.00
#> 6 1005 1.00
#> 7 1006 1.01
#> 8 1007 1.01
#> 9 1008 1.01
#> 10 1009 1.01
#> 11 1010 1.01
tibble(`S-25-(OH)-D3 (nmol/L)` = c(1000:1010)) |>
clean_names() |>
mutate(s_25_oh_d3 = s_25_oh_d3_nmol_l / 1000)
#> # A tibble: 11 × 2
#> s_25_oh_d3_nmol_l s_25_oh_d3
#> <int> <dbl>
#> 1 1000 1
#> 2 1001 1.00
#> 3 1002 1.00
#> 4 1003 1.00
#> 5 1004 1.00
#> 6 1005 1.00
#> 7 1006 1.01
#> 8 1007 1.01
#> 9 1008 1.01
#> 10 1009 1.01
#> 11 1010 1.01
Created on 2022-05-31 by the reprex package (v2.0.1)
If I have the following two objects:
> set.seed(100)
> lookup <- sample(1:3, 20, replace=T)
> lookup
[1] 2 3 2 3 1 2 2 3 2 2 3 2 2 3 3 3 3 2 1 3
and
> tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
> tb
> tb
# A tibble: 20 × 3
A B C
<dbl> <dbl> <dbl>
1 0.770 0.780 0.456
2 0.882 0.884 0.445
3 0.549 0.208 0.245
4 0.278 0.307 0.694
5 0.488 0.331 0.412
6 0.929 0.199 0.328
7 0.349 0.236 0.573
8 0.954 0.275 0.967
9 0.695 0.591 0.662
10 0.889 0.253 0.625
11 0.180 0.123 0.857
12 0.629 0.230 0.775
13 0.990 0.598 0.834
14 0.130 0.211 0.0915
15 0.331 0.464 0.460
16 0.865 0.647 0.599
17 0.778 0.961 0.920
18 0.827 0.676 0.983
19 0.603 0.445 0.0378
20 0.491 0.358 0.578
How do I use lookup to select the value of the corresponding row/column from tb?
i.e.
if the first element of lookup = 1 then I would like to select the value in A from the first row of tb
if the second element of lookup = 2 then I would like to select the value in B from the second row of tb
So I should end up with a 1d vector that is the same size as lookup. It will look like this:
> new data
> [1] 0.780 0.445 0.208 0.694 0.488 ... 0.578
Thanks!
data.frame (but not tibble or data.table) supports indexing on a matrix, so with this data,
set.seed(42)
lookup <- sample(1:3, 20, replace=T)
lookup
# [1] 1 1 1 1 2 2 2 1 3 3 1 1 2 2 2 3 3 1 1 3
tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
head(tb)
# # A tibble: 6 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 0.514 0.958 0.189
# 2 0.390 0.888 0.271
# 3 0.906 0.640 0.828
# 4 0.447 0.971 0.693
# 5 0.836 0.619 0.241
# 6 0.738 0.333 0.0430
We can do
as.data.frame(tb)[cbind(seq_along(lookup), lookup)]
# [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363
A less-efficient method can be done without as.data.frame:
mapply(`[[`, list(tb), seq_along(lookup), lookup)
# [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363
## also works with `list(as.data.table(tb))`
Though it does take a big hit in performance (not a surprise):
bench::mark(
sindri_baldur1 = unlist(tb, use.names = FALSE)[seq_along(lookup) + (lookup - 1L)*nrow(tb)],
sindri_baldur2 = unlist(tb)[seq_along(lookup) + (lookup - 1L)*nrow(tb)],
base = as.data.frame(tb)[cbind(seq_along(lookup), lookup)],
mapply = mapply(`[[`, list(tb), seq_along(lookup), lookup),
paulsmith2 = {
tb %>%
mutate(lookup = lookup) %>%
rowwise %>%
mutate(new = c_across(A:C)[lookup]) %>%
pull(new)
},
check = FALSE)
# # A tibble: 5 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 sindri_baldur1 4.5us 5.3us 159430. 736B 15.9 9999 1 62.7ms <NULL> <Rprof~ <benc~ <tibb~
# 2 sindri_baldur2 13.2us 14.7us 56723. 1.44KB 0 10000 0 176.3ms <NULL> <Rprof~ <benc~ <tibb~
# 3 base 78.3us 91.6us 7334. 944B 8.59 3414 4 465.5ms <NULL> <Rprof~ <benc~ <tibb~
# 4 mapply 612.4us 779.45us 942. 720B 6.39 442 3 469.4ms <NULL> <Rprof~ <benc~ <tibb~
# 5 paulsmith2 4.37ms 5.85ms 147. 20.3KB 6.51 68 3 461.1ms <NULL> <Rprof~ <benc~ <tibb~
(I have to use check=FALSE to work with the names introduced in sindri_baldur2, otherwise all results are numerically identical.)
You could:
unlist(tb, use.names = FALSE)[seq_along(lookup) + (lookup - 1L)*nrow(tb)]
# [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907 0.23569430 0.96699908 0.59132105
# [10] 0.25339065 0.85665304 0.22990589 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
# [19] 0.60332436 0.57793740
You could also use.names and keep track of the original location:
unlist(tb)[seq_along(lookup) + (lookup - 1L)*nrow(tb)] |> head()
# B1 C2 B3 C4 A5 B6
# 0.7803585 0.4454140 0.2077139 0.6943507 0.4883060 0.1986791
A base R solution:
tb$lookup <- lookup
tb$new <- apply(tb, 1, function(x) x[x[4]])
new <- tb$new
new
#> [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907
#> [7] 0.23569430 0.96699908 0.59132105 0.25339065 0.85665304 0.22990589
#> [13] 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
#> [19] 0.60332436 0.57793740
Another possible solution, based on tidyverse:
library(tidyverse)
set.seed(100)
lookup <- sample(1:3, 20, replace=T)
tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
tb %>%
mutate(lookup = lookup) %>%
rowwise %>%
mutate(new = c_across(A:C)[lookup]) %>%
pull(new)
#> [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907
#> [7] 0.23569430 0.96699908 0.59132105 0.25339065 0.85665304 0.22990589
#> [13] 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
#> [19] 0.60332436 0.57793740
I have created an stm topic model and I have issues with summary.estimateEffect, I have around 150 days, yet, it only prints 10 days for regression estimates.
parlPrevFit<- stm(document = out$documents, vocab = out$vocab, K = 0, prevalence =~s(day),
max.em.its = 150, data = out$meta, init.type = "Spectral")
prep<- estimateEffect(c(14, 40, 5, 41)~s(day), parlPrevFit, meta = meta, uncertainty = "Global")
summary(prep, topics = c(14, 40, 5, 41))
Topic 14 Coefficients- https://prnt.sc/105pg1a
Could anyone recommend any suggestions on how to print more than 10 days, please?
Instead of using summary(), which you don't have much control over, load the tidytext package and use tidy() instead.
Let's walk through an example where we train a topic model on Jane Austen's novels, with the documents being each chapter:
library(tidyverse)
library(tidytext)
library(stm)
#> stm v1.3.6 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
library(janeaustenr)
books <- austen_books() %>%
group_by(book) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, book, chapter, remove = FALSE)
austen_sparse <- books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(document, word) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
Let's train a topic model with 6 topics (there are 6 books):
topic_model <- stm(
austen_sparse,
K = 6,
init.type = "Spectral",
verbose = FALSE
)
Let's make a data set to use in estimateEffect():
chapters <- books %>%
group_by(document) %>%
summarize(text = str_c(text, collapse = " ")) %>%
ungroup() %>%
inner_join(books %>%
distinct(document, book))
#> Joining, by = "document"
chapters
#> # A tibble: 269 x 3
#> document text book
#> <chr> <chr> <fct>
#> 1 Emma_1 "CHAPTER I Emma Woodhouse, handsome, clever, and rich, with… Emma
#> 2 Emma_10 "CHAPTER X Though now the middle of December, there had yet… Emma
#> 3 Emma_11 "CHAPTER XI Mr. Elton must now be left to himself. It was n… Emma
#> 4 Emma_12 "CHAPTER XII Mr. Knightley was to dine with them--rather ag… Emma
#> 5 Emma_13 "CHAPTER XIII There could hardly be a happier creature in t… Emma
#> 6 Emma_14 "CHAPTER XIV Some change of countenance was necessary for e… Emma
#> 7 Emma_15 "CHAPTER XV Mr. Woodhouse was soon ready for his tea; and w… Emma
#> 8 Emma_16 "CHAPTER XVI The hair was curled, and the maid sent away, a… Emma
#> 9 Emma_17 "CHAPTER XVII Mr. and Mrs. John Knightley were not detained… Emma
#> 10 Emma_18 "CHAPTER XVIII Mr. Frank Churchill did not come. When the t… Emma
#> # … with 259 more rows
Now let's estimate regressions from our topic model, for our first three topics and our data set of "chapter" documents:
effects <- estimateEffect(1:3 ~ book, topic_model, chapters)
summary(effects)
#>
#> Call:
#> estimateEffect(formula = 1:3 ~ book, stmobj = topic_model, metadata = chapters)
#>
#>
#> Topic 1:
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.018033 0.023726 0.760 0.448
#> bookPride & Prejudice 0.799555 0.037140 21.528 <2e-16 ***
#> bookMansfield Park -0.006387 0.032662 -0.196 0.845
#> bookEmma 0.003188 0.033393 0.095 0.924
#> bookNorthanger Abbey 0.002535 0.039017 0.065 0.948
#> bookPersuasion 0.025725 0.044281 0.581 0.562
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#>
#> Topic 2:
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.015289 0.016478 0.928 0.354
#> bookPride & Prejudice 0.001785 0.023489 0.076 0.939
#> bookMansfield Park 0.001616 0.024664 0.066 0.948
#> bookEmma 0.892516 0.037833 23.591 <2e-16 ***
#> bookNorthanger Abbey 0.006032 0.031530 0.191 0.848
#> bookPersuasion -0.001142 0.030052 -0.038 0.970
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#>
#> Topic 3:
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.0196151 0.0225115 0.871 0.3844
#> bookPride & Prejudice -0.0004909 0.0286302 -0.017 0.9863
#> bookMansfield Park 0.0148960 0.0341272 0.436 0.6628
#> bookEmma -0.0004006 0.0301741 -0.013 0.9894
#> bookNorthanger Abbey 0.8730570 0.0457994 19.063 <2e-16 ***
#> bookPersuasion 0.1030537 0.0495148 2.081 0.0384 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This example doesn't have the problem you mentioned of printing limitations, but you can avoid any problem like that by using tidy() instead where you get the actual content of the regressions out:
tidy(effects)
#> # A tibble: 18 x 6
#> topic term estimate std.error statistic p.value
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 0.0179 0.0238 0.753 4.52e- 1
#> 2 1 bookPride & Prejudice 0.799 0.0373 21.4 1.09e-59
#> 3 1 bookMansfield Park -0.00614 0.0325 -0.189 8.50e- 1
#> 4 1 bookEmma 0.00350 0.0336 0.104 9.17e- 1
#> 5 1 bookNorthanger Abbey 0.00323 0.0394 0.0820 9.35e- 1
#> 6 1 bookPersuasion 0.0253 0.0443 0.571 5.68e- 1
#> 7 2 (Intercept) 0.0153 0.0165 0.925 3.56e- 1
#> 8 2 bookPride & Prejudice 0.00165 0.0234 0.0707 9.44e- 1
#> 9 2 bookMansfield Park 0.00167 0.0246 0.0680 9.46e- 1
#> 10 2 bookEmma 0.892 0.0381 23.4 2.84e-66
#> 11 2 bookNorthanger Abbey 0.00606 0.0317 0.191 8.49e- 1
#> 12 2 bookPersuasion -0.00107 0.0298 -0.0359 9.71e- 1
#> 13 3 (Intercept) 0.0197 0.0228 0.864 3.89e- 1
#> 14 3 bookPride & Prejudice -0.000835 0.0288 -0.0290 9.77e- 1
#> 15 3 bookMansfield Park 0.0147 0.0342 0.428 6.69e- 1
#> 16 3 bookEmma -0.000707 0.0305 -0.0232 9.82e- 1
#> 17 3 bookNorthanger Abbey 0.873 0.0461 18.9 4.93e-51
#> 18 3 bookPersuasion 0.103 0.0496 2.08 3.85e- 2
Created on 2021-02-26 by the reprex package (v1.0.0)
In R, I have a dataframe that looks like this:
sample value gene tag isPTV
1 1120 3.4 arx1 1120|arx1 0
2 2123 2.3 mnf2 2123|mnf2 0
3 1129 1.9 trf4 1129|trf4 0
4 2198 0.2 brc1 2198|brc1 0
5 1120 2.1 arx1 1120|arx1 1
6 2123 0.4 mnf2 2123|mnf2 1
7 1129 1.2 trf4 1129|trf4 1
8 2198 0.9 brc1 2198|brc1 1
Such that 0 means false and 1 means true. What I'm ultimately trying to do is create a dataframe that, for each tag, finds the absolute value between the value numbers.
For instance, for 1129|trf4 occurs in two separate rows. There's a value for when it isPTV and when it is not, so the absolute value would be 1.9 - 1.2 = 0.7.
I started out by trying to write a function to do these for a given tag value, such that, for a given tag, it would return both rows containing the tag:
getExprValue <- function(dataframe, tag){
return(dataframe[tag,])
}
But this is not working, and I'm not very familiar with how you index dataframes in R.
What is the right way to do this?
UPDATE:
Solution 1 Attempt:
m_diff <- m %>% group_by(tag) %>% mutate(absDiff = abs(diff(value)))
Response:
Error in mutate_impl(.data, dots) : ColumnabsDiffmust be length 1 (the group size), not 0
Solution 2 Attempt:
with(df1, abs(ave(value, tag, FUN = diff)))
Response:
Error in x[i] <- value[[j]] : replacement has length zero
Edit: I just noticed that #akrun had a much simpler solution
Create data with a structure similar to yours:
library(tidyverse)
dat <- tibble(
sample = rep(sample(1000:3000, 10), 2),
value = rnorm(20, 5, 1),
gene = rep(letters[1:10], 2),
tag = paste(sample, gene, sep = "|"),
isPTV = rep(0:1, each = 10)
)
dat
#> # A tibble: 20 x 5
#> sample value gene tag isPTV
#> <int> <dbl> <chr> <chr> <int>
#> 1 2149 5.90 a 2149|a 0
#> 2 1027 5.46 b 1027|b 0
#> 3 1103 5.65 c 1103|c 0
#> 4 1884 4.86 d 1884|d 0
#> 5 2773 5.58 e 2773|e 0
#> 6 2948 6.98 f 2948|f 0
#> 7 2478 5.17 g 2478|g 0
#> 8 2724 6.71 h 2724|h 0
#> 9 1927 5.06 i 1927|i 0
#> 10 1081 4.39 j 1081|j 0
#> 11 2149 4.60 a 2149|a 1
#> 12 1027 2.97 b 1027|b 1
#> 13 1103 6.17 c 1103|c 1
#> 14 1884 5.83 d 1884|d 1
#> 15 2773 4.23 e 2773|e 1
#> 16 2948 6.48 f 2948|f 1
#> 17 2478 5.06 g 2478|g 1
#> 18 2724 5.32 h 2724|h 1
#> 19 1927 7.32 i 1927|i 1
#> 20 1081 4.73 j 1081|j 1
#akrun solution (much better than mine):
dat %>%
group_by(tag) %>%
mutate(absDiff = abs(diff(value)))
#> # A tibble: 20 x 6
#> # Groups: tag [10]
#> sample value gene tag isPTV absDiff
#> <int> <dbl> <chr> <chr> <int> <dbl>
#> 1 2149 5.90 a 2149|a 0 1.30
#> 2 1027 5.46 b 1027|b 0 2.49
#> 3 1103 5.65 c 1103|c 0 0.520
#> 4 1884 4.86 d 1884|d 0 0.974
#> 5 2773 5.58 e 2773|e 0 1.34
#> 6 2948 6.98 f 2948|f 0 0.502
#> 7 2478 5.17 g 2478|g 0 0.114
#> 8 2724 6.71 h 2724|h 0 1.39
#> 9 1927 5.06 i 1927|i 0 2.26
#> 10 1081 4.39 j 1081|j 0 0.337
#> 11 2149 4.60 a 2149|a 1 1.30
#> 12 1027 2.97 b 1027|b 1 2.49
#> 13 1103 6.17 c 1103|c 1 0.520
#> 14 1884 5.83 d 1884|d 1 0.974
#> 15 2773 4.23 e 2773|e 1 1.34
#> 16 2948 6.48 f 2948|f 1 0.502
#> 17 2478 5.06 g 2478|g 1 0.114
#> 18 2724 5.32 h 2724|h 1 1.39
#> 19 1927 7.32 i 1927|i 1 2.26
#> 20 1081 4.73 j 1081|j 1 0.337
My initial suggestion (unnecessarily complicated):
nested <- dat %>%
group_by(tag) %>%
nest()
nested %>%
mutate(difference = map(data, ~ abs(diff(.$value)))) %>%
select(- data) %>%
unnest()
#> # A tibble: 10 x 2
#> tag difference
#> <chr> <dbl>
#> 1 2149|a 1.30
#> 2 1027|b 2.49
#> 3 1103|c 0.520
#> 4 1884|d 0.974
#> 5 2773|e 1.34
#> 6 2948|f 0.502
#> 7 2478|g 0.114
#> 8 2724|h 1.39
#> 9 1927|i 2.26
#> 10 1081|j 0.337
For a publication in a peer-reviewed scientific journal (http://www.redjournal.org), we would like to prepare Kaplan-Meier plots. The journal has the following specific guidelines for these plots:
"If your figures include curves generated from analyses using the Kaplan-Meier method or the cumulative incidence method, the following are now requirements for the presentation of these curves:
That the number of patients at risk is indicated;
That censoring marks are included;
That curves be truncated when there are fewer than 10 patients at risk; and
An estimate of the confidence interval should be included either in the figure itself or the text.”
Here, I illustrate my problem with the veteran dataset (https://github.com/tidyverse/reprex is great!).
We can adress 1, 2 and 4 easily with the survminer package:
library(survival)
library(survminer)
#> Warning: package 'survminer' was built under R version 3.4.3
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Warning: package 'ggpubr' was built under R version 3.4.3
#> Loading required package: magrittr
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
ggsurvplot(fit.obj,
conf.int = T,
risk.table ="absolute",
tables.theme = theme_cleantable())
I have, however, a problem with requirement 3 (truncate curves when there are fewer than 10 patients at risk). I see that all the required information is available in the survfit object:
library(survival)
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
summary(fit.obj)
#> Call: survfit(formula = Surv(time, status) ~ celltype, data = veteran)
#>
#> celltype=squamous
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 35 2 0.943 0.0392 0.8690 1.000
#> 8 33 1 0.914 0.0473 0.8261 1.000
#> 10 32 1 0.886 0.0538 0.7863 0.998
#> 11 31 1 0.857 0.0591 0.7487 0.981
#> 15 30 1 0.829 0.0637 0.7127 0.963
#> 25 29 1 0.800 0.0676 0.6779 0.944
#> 30 27 1 0.770 0.0713 0.6426 0.924
#> 33 26 1 0.741 0.0745 0.6083 0.902
#> 42 25 1 0.711 0.0772 0.5749 0.880
#> 44 24 1 0.681 0.0794 0.5423 0.856
#> 72 23 1 0.652 0.0813 0.5105 0.832
#> 82 22 1 0.622 0.0828 0.4793 0.808
#> 110 19 1 0.589 0.0847 0.4448 0.781
#> 111 18 1 0.557 0.0861 0.4112 0.754
#> 112 17 1 0.524 0.0870 0.3784 0.726
#> 118 16 1 0.491 0.0875 0.3464 0.697
#> 126 15 1 0.458 0.0876 0.3152 0.667
#> 144 14 1 0.426 0.0873 0.2849 0.636
#> 201 13 1 0.393 0.0865 0.2553 0.605
#> 228 12 1 0.360 0.0852 0.2265 0.573
#> 242 10 1 0.324 0.0840 0.1951 0.539
#> 283 9 1 0.288 0.0820 0.1650 0.503
#> 314 8 1 0.252 0.0793 0.1362 0.467
#> 357 7 1 0.216 0.0757 0.1088 0.429
#> 389 6 1 0.180 0.0711 0.0831 0.391
#> 411 5 1 0.144 0.0654 0.0592 0.351
#> 467 4 1 0.108 0.0581 0.0377 0.310
#> 587 3 1 0.072 0.0487 0.0192 0.271
#> 991 2 1 0.036 0.0352 0.0053 0.245
#> 999 1 1 0.000 NaN NA NA
#>
#> celltype=smallcell
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 48 1 0.9792 0.0206 0.93958 1.000
#> 4 47 1 0.9583 0.0288 0.90344 1.000
#> 7 46 2 0.9167 0.0399 0.84172 0.998
#> 8 44 1 0.8958 0.0441 0.81345 0.987
#> 10 43 1 0.8750 0.0477 0.78627 0.974
#> 13 42 2 0.8333 0.0538 0.73430 0.946
#> 16 40 1 0.8125 0.0563 0.70926 0.931
#> 18 39 2 0.7708 0.0607 0.66065 0.899
#> 20 37 2 0.7292 0.0641 0.61369 0.866
#> 21 35 2 0.6875 0.0669 0.56812 0.832
#> 22 33 1 0.6667 0.0680 0.54580 0.814
#> 24 32 1 0.6458 0.0690 0.52377 0.796
#> 25 31 2 0.6042 0.0706 0.48052 0.760
#> 27 29 1 0.5833 0.0712 0.45928 0.741
#> 29 28 1 0.5625 0.0716 0.43830 0.722
#> 30 27 1 0.5417 0.0719 0.41756 0.703
#> 31 26 1 0.5208 0.0721 0.39706 0.683
#> 51 25 2 0.4792 0.0721 0.35678 0.644
#> 52 23 1 0.4583 0.0719 0.33699 0.623
#> 54 22 2 0.4167 0.0712 0.29814 0.582
#> 56 20 1 0.3958 0.0706 0.27908 0.561
#> 59 19 1 0.3750 0.0699 0.26027 0.540
#> 61 18 1 0.3542 0.0690 0.24171 0.519
#> 63 17 1 0.3333 0.0680 0.22342 0.497
#> 80 16 1 0.3125 0.0669 0.20541 0.475
#> 87 15 1 0.2917 0.0656 0.18768 0.453
#> 95 14 1 0.2708 0.0641 0.17026 0.431
#> 99 12 2 0.2257 0.0609 0.13302 0.383
#> 117 9 1 0.2006 0.0591 0.11267 0.357
#> 122 8 1 0.1755 0.0567 0.09316 0.331
#> 139 6 1 0.1463 0.0543 0.07066 0.303
#> 151 5 1 0.1170 0.0507 0.05005 0.274
#> 153 4 1 0.0878 0.0457 0.03163 0.244
#> 287 3 1 0.0585 0.0387 0.01600 0.214
#> 384 2 1 0.0293 0.0283 0.00438 0.195
#> 392 1 1 0.0000 NaN NA NA
#>
#> celltype=adeno
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 3 27 1 0.9630 0.0363 0.89430 1.000
#> 7 26 1 0.9259 0.0504 0.83223 1.000
#> 8 25 2 0.8519 0.0684 0.72786 0.997
#> 12 23 1 0.8148 0.0748 0.68071 0.975
#> 18 22 1 0.7778 0.0800 0.63576 0.952
#> 19 21 1 0.7407 0.0843 0.59259 0.926
#> 24 20 1 0.7037 0.0879 0.55093 0.899
#> 31 19 1 0.6667 0.0907 0.51059 0.870
#> 35 18 1 0.6296 0.0929 0.47146 0.841
#> 36 17 1 0.5926 0.0946 0.43344 0.810
#> 45 16 1 0.5556 0.0956 0.39647 0.778
#> 48 15 1 0.5185 0.0962 0.36050 0.746
#> 51 14 1 0.4815 0.0962 0.32552 0.712
#> 52 13 1 0.4444 0.0956 0.29152 0.678
#> 73 12 1 0.4074 0.0946 0.25850 0.642
#> 80 11 1 0.3704 0.0929 0.22649 0.606
#> 84 9 1 0.3292 0.0913 0.19121 0.567
#> 90 8 1 0.2881 0.0887 0.15759 0.527
#> 92 7 1 0.2469 0.0850 0.12575 0.485
#> 95 6 1 0.2058 0.0802 0.09587 0.442
#> 117 5 1 0.1646 0.0740 0.06824 0.397
#> 132 4 1 0.1235 0.0659 0.04335 0.352
#> 140 3 1 0.0823 0.0553 0.02204 0.307
#> 162 2 1 0.0412 0.0401 0.00608 0.279
#> 186 1 1 0.0000 NaN NA NA
#>
#> celltype=large
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 12 27 1 0.9630 0.0363 0.89430 1.000
#> 15 26 1 0.9259 0.0504 0.83223 1.000
#> 19 25 1 0.8889 0.0605 0.77791 1.000
#> 43 24 1 0.8519 0.0684 0.72786 0.997
#> 49 23 1 0.8148 0.0748 0.68071 0.975
#> 52 22 1 0.7778 0.0800 0.63576 0.952
#> 53 21 1 0.7407 0.0843 0.59259 0.926
#> 100 20 1 0.7037 0.0879 0.55093 0.899
#> 103 19 1 0.6667 0.0907 0.51059 0.870
#> 105 18 1 0.6296 0.0929 0.47146 0.841
#> 111 17 1 0.5926 0.0946 0.43344 0.810
#> 133 16 1 0.5556 0.0956 0.39647 0.778
#> 143 15 1 0.5185 0.0962 0.36050 0.746
#> 156 14 1 0.4815 0.0962 0.32552 0.712
#> 162 13 1 0.4444 0.0956 0.29152 0.678
#> 164 12 1 0.4074 0.0946 0.25850 0.642
#> 177 11 1 0.3704 0.0929 0.22649 0.606
#> 200 9 1 0.3292 0.0913 0.19121 0.567
#> 216 8 1 0.2881 0.0887 0.15759 0.527
#> 231 7 1 0.2469 0.0850 0.12575 0.485
#> 250 6 1 0.2058 0.0802 0.09587 0.442
#> 260 5 1 0.1646 0.0740 0.06824 0.397
#> 278 4 1 0.1235 0.0659 0.04335 0.352
#> 340 3 1 0.0823 0.0553 0.02204 0.307
#> 378 2 1 0.0412 0.0401 0.00608 0.279
#> 553 1 1 0.0000 NaN NA NA
But I have no idea how I can manipulate this list. I would very much appreciate any advice on how to filter out all lines with n.risk < 10 from fit.obj.
I can't quite seem to get this all the way there. But I see that you can pass a data.frame rather than a fit object to the plotting function. You can do this and clip the values. For example
ss <- subset(surv_summary(fit.obj), n.risk>=10)
ggsurvplot(ss,
conf.int = T)
But it seems in this mode it does not automatically print the table. There is a function to draw just the table with
ggrisktable(fit.obj, tables.theme = theme_cleantable())
So I guess you could just combine them. Maybe i'm missing an easier way to draw the table when using a data.frame in the same plot.
As a slight variation on the above answers, if you want to truncate each group individually when less than 10 patients are at risk in that group, I found this to work and not require plotting the figure and table separately:
library(survival)
library(survminer)
# truncate each line when fewer than 10 at risk
atrisk <- 10
# KM fit
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
# subset each stratum separately
maxcutofftime = 0 # for plotting
strata <- rep(names(fit.obj$strata), fit.obj$strata)
for (i in names(fit.obj$strata)){
cutofftime <- min(fit.obj$time[fit.obj$n.risk < atrisk & strata == i])
maxcutofftime = max(maxcutofftime, cutofftime)
cutoffs <- which(fit.obj$n.risk < atrisk & strata == i)
fit.obj$lower[cutoffs] <- NA
fit.obj$upper[cutoffs] <- NA
fit.obj$surv[cutoffs] <- NA
}
# plot
ggsurvplot(fit.obj, data = veteran, risk.table = TRUE, conf.int = T, pval = F,
tables.theme = theme_cleantable(), xlim = c(0,maxcutofftime), break.x.by = 90)
edited to add: note that if we had used pval = T above, that would give the p-value for the truncated data, not the full data. It doesn't make much of a difference in this example as both are p<0.0001, but be careful :)
I'm following up on MrFlick's great answer.
I'd interpret 3) to mean that there should be at least 10 at risk total - i.e., not per group. So we have to create an ungrouped Kaplan-Meier fit first and determine the time cutoff from there.
Subset the surv_summary object w/r/t this cutoff.
Plot KM-curve and risk table separately. Crucially, function survminer::ggrisktable() (a minimal front end for ggsurvtable()) accepts options xlim and break.time.by. However, the function can currently only extend the upper time limit, not reduce it. I assume this is a bug. I created function ggsurvtable_mod() to change this.
Turn ggplot objects into grobs and use ggExtra::grid.arrange() to put both plots together. There is probably a more elegant way to do this based on options widths and heights.
Admittedly, this is a bit of a hack and needs tweaking to get the correct alignment between survival plot and risk table.
library(survival)
library(survminer)
# ungrouped KM estimate to determine cutoff
fit1_ss <- surv_summary(survfit(Surv(time, status) ~ 1, data=veteran))
# time cutoff with fewer than 10 at risk
cutoff <- min(fit1_ss$time[fit1_ss$n.risk < 10])
# KM fit and subset to cutoff
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
fit_ss <- subset(surv_summary(fit.obj), time < cutoff)
# KM survival plot and risk table as separate plots
p1 <- ggsurvplot(fit_ss, conf.int=TRUE)
# note options xlim and break.time.by
p2 <- ggsurvtable_mod(fit.obj,
survtable="risk.table",
tables.theme=theme_cleantable(),
xlim=c(0, cutoff),
break.time.by=100)
# turn ggplot objects into grobs and arrange them (needs tweaking)
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
lom <- rbind(c(NA, rep(1, 14)),
c(NA, rep(1, 14)),
c(rep(2, 15)))
gridExtra::grid.arrange(grobs=list(g1, g2), layout_matrix=lom)