I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.)
However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare:
First dplyr.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DF = tibble(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z))
#> # A tibble: 10 x 2
#> y `mean(z)`
#> * <chr> <dbl>
#> 1 A -0.00336
#> 2 B -0.00702
#> 3 C 0.00291
#> 4 D -0.00430
#> 5 E -0.00705
#> 6 F -0.00568
#> 7 G -0.00344
#> 8 H 0.000553
#> 9 I -0.00168
#> 10 J 0.00661
bench::bench_process_memory()
#> current max
#> 585MB 611MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Then data.table.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 948.47MB 1.17GB
Created on 2020-04-22 by the reprex package (v0.3.0)
So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that #Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.)
Any ideas, or am I just missing something obvious?
P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a bench_process_memory() function that also tracks memory outside of R’s GC heap, which is why I use it here.
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 knitr_1.28 magrittr_1.5 tidyselect_1.0.0
#> [5] R6_2.4.1 rlang_0.4.5.9000 stringr_1.4.0 highr_0.8
#> [9] tools_3.6.3 xfun_0.13 htmltools_0.4.0 ellipsis_0.3.0
#> [13] yaml_2.2.1 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0
#> [17] crayon_1.3.4 purrr_0.3.4 vctrs_0.2.99.9011 glue_1.4.0
#> [21] evaluate_0.14 rmarkdown_2.1 stringi_1.4.6 compiler_3.6.3
#> [25] pillar_1.4.3 generics_0.0.2 pkgconfig_2.0.3
Created on 2020-04-22 by the reprex package (v0.3.0)
UPDATE: Following #jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.
dplyr
$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user: 0.526 s
Child sys : 0.033 s
Child wall: 0.455 s
Child high-water RSS : 128952 KiB
Recursive and acc. high-water RSS+CACHE : 118516 KiB
data.table
$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user: 0.510 s
Child sys : 0.056 s
Child wall: 0.464 s
Child high-water RSS : 129032 KiB
Recursive and acc. high-water RSS+CACHE : 118320 KiB
Bottom line: Accurately measuring memory usage from within R is complicated.
I'll leave my original answer below because I think it still has value.
ORIGINAL ANSWER:
Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
setDTthreads(1) ## TURN OFF MULTITHREADING
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 589MB 612MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...
Related
When calculating the average by group in a data.table I get distinct results:
qty <- c(1:6)
name <- c("a", "b","a", "a", "c","b" )
type <- c("i", "i", "i", "f", "f", "f")
DT <- data.table(qty,name,type)
DT[, avg_mean := mean(qty) , by = .(name, type)]
DT[, avg_sum_N := sum(qty)/.N , by = .(name, type)]
> DT
qty name type avg_mean avg_sum_N
<int> <char> <char> <num> <num>
1: 1 a i 2 2
2: 2 b i 4 2
3: 3 a i 2 2
4: 4 a f 2 4
5: 5 c f 6 5
6: 6 b f 5 6
I would expect that avg_mean and avg_sum_N would be exactly the same, such as avg_sum_N.
Why are they different? Thank you.
Please find below session info.
> packageVersion('data.table')
[1] ‘1.14.3’
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.8-10 lubridate_1.8.0 RPostgres_1.4.3 DBI_1.1.2 stringi_1.7.6 readxl_1.4.0
[7] gsubfn_0.7 proto_1.0.0 stringr_1.4.0 magrittr_2.0.3 stringdist_0.9.8 fuzzyjoin_0.1.6
[13] data.table_1.14.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 pillar_1.7.0 compiler_4.1.0 cellranger_1.1.0 tools_4.1.0 bit_4.0.4
[7] lattice_0.20-44 lifecycle_1.0.1 tibble_3.1.6 pkgconfig_2.0.3 rlang_1.0.2 cli_3.2.0
[13] rstudioapi_0.13 writexl_1.4.0 parallel_4.1.0 dplyr_1.0.8 hms_1.1.1 generics_0.1.2
[19] vctrs_0.4.1 grid_4.1.0 bit64_4.0.5 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
[25] fansi_1.0.3 tcltk_4.1.0 blob_1.2.3 purrr_0.3.4 ellipsis_0.3.2 assertthat_0.2.1
[31] utf8_1.2.2 crayon_1.5.1
The problem was a bug in dev data.table version.
data.table::update.dev.pkg() fixed the problem.
It is related to the GForce optimization for sum and mean which use gsum and gmean. It can be either set as FALSE with options
options(datatable.optimize=1)
Or may specifically use base::mean, base::sum
DT[, avg_mean := base::mean(qty) , by = .(name, type)]
DT[, avg_sum_N := base::sum(qty)/.N , by = .(name, type)]
It would be revealed with verbose
> DT[, avg_mean := mean(qty) , by = .(name, type), verbose = TRUE]
Argument 'by' after substitute: .(name, type)
Detected that j uses these columns: [avg_mean, qty]
Finding groups using forderv ... forder.c received 6 rows and 2 columns
0.001s elapsed (0.001s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.000s cpu)
Getting back original order ... forder.c received a vector type 'integer' length 5
0.001s elapsed (0.001s cpu)
lapply optimization is on, j unchanged as 'mean(qty)'
GForce optimized j to 'gmean(qty)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
This gmean took (narm=FALSE) ... gather took ... 0.000s
0.000s
gforce eval took 0.000
0.001s elapsed (0.002s cpu)
Assigning to 6 row subset of 6 rows
RHS_list_of_columns == false
> DT[, avg_mean := base::mean(qty) , by = .(name, type), verbose = TRUE]
Argument 'by' after substitute: .(name, type)
Detected that j uses these columns: [avg_mean, qty]
Finding groups using forderv ... forder.c received 6 rows and 2 columns
0.002s elapsed (0.001s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
Getting back original order ... forder.c received a vector type 'integer' length 5
0.001s elapsed (0.001s cpu)
lapply optimization is on, j unchanged as 'base::mean(qty)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ...
collecting discontiguous groups took 0.000s for 5 groups
eval(j) took 0.000s for 5 calls
0.001s elapsed (0.001s cpu)
If we want to make use of the GForce optimization, then we may need to order with the by columns first i.e. can use setkey
setkey(DT, name, type)
DT[, avg_sum_N := sum(qty)/.N , by = .(name, type)]
DT[, avg_mean := mean(qty) , by = .(name, type)]
> DT
Key: <name, type>
qty name type avg_sum_N avg_mean
<int> <char> <char> <num> <num>
1: 4 a f 4 4
2: 1 a i 2 2
3: 3 a i 2 2
4: 6 b f 6 6
5: 2 b i 2 2
6: 5 c f 5 5
I would like an object that gives me a date range for every month (or quarter) from 1990-01-01 to 2021-12-31, separated by a colon. So for example in the monthly case, the first object would be 1990-01-01:1990-01-31, the second object would be 1990-02-01:1990-02-31, and so on.
The issue I am having trouble with is making sure that the date range is exclusive, i.e., that no date gets repeated.
start_date1 <- as.Date("1990-01-01", "%Y-%m-%d")
end_date1 <- as.Date("2021-12-01", "%Y-%m-%d")
first_date <- format(seq(start_date1,end_date1,by="month"),"%Y-%m-%d")
start_date2 <- as.Date("1990-02-01", "%Y-%m-%d")
end_date2 <- as.Date("2022-01-01", "%Y-%m-%d")
second_date <- format(seq(start_date2,end_date2,by="month"),"%Y-%m-%d")
date<-paste0(first_date, ":")
finaldate<-paste0(date, second_date)
This code works, except that the first date in each month gets repeated "1990-01-01:1990-02-01" "1990-02-01:1990-03-01", and that the last date is "2021-12-01:2022-01-01" (including Jan 1, 2022 rather than stopping at Dec 31, 2021.
If I go by 30 days instead, it doesn't work as well because not every month has 30 days.
What's the best way to get an exclusive date range?
You could do:
dates <- seq(as.Date("1990-01-01"), as.Date("2022-01-01"), by = "month")
dates <- paste(head(dates, -1), tail(dates-1, - 1), sep = ":")
resulting in:
dates
#> [1] "1990-01-01:1990-01-31" "1990-02-01:1990-02-28" "1990-03-01:1990-03-31"
#> [4] "1990-04-01:1990-04-30" "1990-05-01:1990-05-31" "1990-06-01:1990-06-30"
#> [7] "1990-07-01:1990-07-31" "1990-08-01:1990-08-31" "1990-09-01:1990-09-30"
#> [10] "1990-10-01:1990-10-31" "1990-11-01:1990-11-30" "1990-12-01:1990-12-31"
#> [13] "1991-01-01:1991-01-31" "1991-02-01:1991-02-28" "1991-03-01:1991-03-31"
#> [16] "1991-04-01:1991-04-30" "1991-05-01:1991-05-31" "1991-06-01:1991-06-30"
#> [19] "1991-07-01:1991-07-31" "1991-08-01:1991-08-31" "1991-09-01:1991-09-30"
#> [22] "1991-10-01:1991-10-31" "1991-11-01:1991-11-30" "1991-12-01:1991-12-31"
#> [25] "1992-01-01:1992-01-31" "1992-02-01:1992-02-29" "1992-03-01:1992-03-31"
#> [28] "1992-04-01:1992-04-30" "1992-05-01:1992-05-31" "1992-06-01:1992-06-30"
#> [31] "1992-07-01:1992-07-31" "1992-08-01:1992-08-31" "1992-09-01:1992-09-30"
#> [34] "1992-10-01:1992-10-31" "1992-11-01:1992-11-30" "1992-12-01:1992-12-31"
#> [37] "1993-01-01:1993-01-31" "1993-02-01:1993-02-28" "1993-03-01:1993-03-31"
#> [40] "1993-04-01:1993-04-30" "1993-05-01:1993-05-31" "1993-06-01:1993-06-30"
#> [43] "1993-07-01:1993-07-31" "1993-08-01:1993-08-31" "1993-09-01:1993-09-30"
#> [46] "1993-10-01:1993-10-31" "1993-11-01:1993-11-30" "1993-12-01:1993-12-31"
#> [49] "1994-01-01:1994-01-31" "1994-02-01:1994-02-28" "1994-03-01:1994-03-31"
#> [52] "1994-04-01:1994-04-30" "1994-05-01:1994-05-31" "1994-06-01:1994-06-30"
#> [55] "1994-07-01:1994-07-31" "1994-08-01:1994-08-31" "1994-09-01:1994-09-30"
#> [58] "1994-10-01:1994-10-31" "1994-11-01:1994-11-30" "1994-12-01:1994-12-31"
#> [61] "1995-01-01:1995-01-31" "1995-02-01:1995-02-28" "1995-03-01:1995-03-31"
#> [64] "1995-04-01:1995-04-30" "1995-05-01:1995-05-31" "1995-06-01:1995-06-30"
#> [67] "1995-07-01:1995-07-31" "1995-08-01:1995-08-31" "1995-09-01:1995-09-30"
#> [70] "1995-10-01:1995-10-31" "1995-11-01:1995-11-30" "1995-12-01:1995-12-31"
#> [73] "1996-01-01:1996-01-31" "1996-02-01:1996-02-29" "1996-03-01:1996-03-31"
#> [76] "1996-04-01:1996-04-30" "1996-05-01:1996-05-31" "1996-06-01:1996-06-30"
#> [79] "1996-07-01:1996-07-31" "1996-08-01:1996-08-31" "1996-09-01:1996-09-30"
#> [82] "1996-10-01:1996-10-31" "1996-11-01:1996-11-30" "1996-12-01:1996-12-31"
#> [85] "1997-01-01:1997-01-31" "1997-02-01:1997-02-28" "1997-03-01:1997-03-31"
#> [88] "1997-04-01:1997-04-30" "1997-05-01:1997-05-31" "1997-06-01:1997-06-30"
#> [91] "1997-07-01:1997-07-31" "1997-08-01:1997-08-31" "1997-09-01:1997-09-30"
#> [94] "1997-10-01:1997-10-31" "1997-11-01:1997-11-30" "1997-12-01:1997-12-31"
#> [97] "1998-01-01:1998-01-31" "1998-02-01:1998-02-28" "1998-03-01:1998-03-31"
#> [100] "1998-04-01:1998-04-30" "1998-05-01:1998-05-31" "1998-06-01:1998-06-30"
#> [103] "1998-07-01:1998-07-31" "1998-08-01:1998-08-31" "1998-09-01:1998-09-30"
#> [106] "1998-10-01:1998-10-31" "1998-11-01:1998-11-30" "1998-12-01:1998-12-31"
#> [109] "1999-01-01:1999-01-31" "1999-02-01:1999-02-28" "1999-03-01:1999-03-31"
#> [112] "1999-04-01:1999-04-30" "1999-05-01:1999-05-31" "1999-06-01:1999-06-30"
#> [115] "1999-07-01:1999-07-31" "1999-08-01:1999-08-31" "1999-09-01:1999-09-30"
#> [118] "1999-10-01:1999-10-31" "1999-11-01:1999-11-30" "1999-12-01:1999-12-31"
#> [121] "2000-01-01:2000-01-31" "2000-02-01:2000-02-29" "2000-03-01:2000-03-31"
#> [124] "2000-04-01:2000-04-30" "2000-05-01:2000-05-31" "2000-06-01:2000-06-30"
#> [127] "2000-07-01:2000-07-31" "2000-08-01:2000-08-31" "2000-09-01:2000-09-30"
#> [130] "2000-10-01:2000-10-31" "2000-11-01:2000-11-30" "2000-12-01:2000-12-31"
#> [133] "2001-01-01:2001-01-31" "2001-02-01:2001-02-28" "2001-03-01:2001-03-31"
#> [136] "2001-04-01:2001-04-30" "2001-05-01:2001-05-31" "2001-06-01:2001-06-30"
#> [139] "2001-07-01:2001-07-31" "2001-08-01:2001-08-31" "2001-09-01:2001-09-30"
#> [142] "2001-10-01:2001-10-31" "2001-11-01:2001-11-30" "2001-12-01:2001-12-31"
#> [145] "2002-01-01:2002-01-31" "2002-02-01:2002-02-28" "2002-03-01:2002-03-31"
#> [148] "2002-04-01:2002-04-30" "2002-05-01:2002-05-31" "2002-06-01:2002-06-30"
#> [151] "2002-07-01:2002-07-31" "2002-08-01:2002-08-31" "2002-09-01:2002-09-30"
#> [154] "2002-10-01:2002-10-31" "2002-11-01:2002-11-30" "2002-12-01:2002-12-31"
#> [157] "2003-01-01:2003-01-31" "2003-02-01:2003-02-28" "2003-03-01:2003-03-31"
#> [160] "2003-04-01:2003-04-30" "2003-05-01:2003-05-31" "2003-06-01:2003-06-30"
#> [163] "2003-07-01:2003-07-31" "2003-08-01:2003-08-31" "2003-09-01:2003-09-30"
#> [166] "2003-10-01:2003-10-31" "2003-11-01:2003-11-30" "2003-12-01:2003-12-31"
#> [169] "2004-01-01:2004-01-31" "2004-02-01:2004-02-29" "2004-03-01:2004-03-31"
#> [172] "2004-04-01:2004-04-30" "2004-05-01:2004-05-31" "2004-06-01:2004-06-30"
#> [175] "2004-07-01:2004-07-31" "2004-08-01:2004-08-31" "2004-09-01:2004-09-30"
#> [178] "2004-10-01:2004-10-31" "2004-11-01:2004-11-30" "2004-12-01:2004-12-31"
#> [181] "2005-01-01:2005-01-31" "2005-02-01:2005-02-28" "2005-03-01:2005-03-31"
#> [184] "2005-04-01:2005-04-30" "2005-05-01:2005-05-31" "2005-06-01:2005-06-30"
#> [187] "2005-07-01:2005-07-31" "2005-08-01:2005-08-31" "2005-09-01:2005-09-30"
#> [190] "2005-10-01:2005-10-31" "2005-11-01:2005-11-30" "2005-12-01:2005-12-31"
#> [193] "2006-01-01:2006-01-31" "2006-02-01:2006-02-28" "2006-03-01:2006-03-31"
#> [196] "2006-04-01:2006-04-30" "2006-05-01:2006-05-31" "2006-06-01:2006-06-30"
#> [199] "2006-07-01:2006-07-31" "2006-08-01:2006-08-31" "2006-09-01:2006-09-30"
#> [202] "2006-10-01:2006-10-31" "2006-11-01:2006-11-30" "2006-12-01:2006-12-31"
#> [205] "2007-01-01:2007-01-31" "2007-02-01:2007-02-28" "2007-03-01:2007-03-31"
#> [208] "2007-04-01:2007-04-30" "2007-05-01:2007-05-31" "2007-06-01:2007-06-30"
#> [211] "2007-07-01:2007-07-31" "2007-08-01:2007-08-31" "2007-09-01:2007-09-30"
#> [214] "2007-10-01:2007-10-31" "2007-11-01:2007-11-30" "2007-12-01:2007-12-31"
#> [217] "2008-01-01:2008-01-31" "2008-02-01:2008-02-29" "2008-03-01:2008-03-31"
#> [220] "2008-04-01:2008-04-30" "2008-05-01:2008-05-31" "2008-06-01:2008-06-30"
#> [223] "2008-07-01:2008-07-31" "2008-08-01:2008-08-31" "2008-09-01:2008-09-30"
#> [226] "2008-10-01:2008-10-31" "2008-11-01:2008-11-30" "2008-12-01:2008-12-31"
#> [229] "2009-01-01:2009-01-31" "2009-02-01:2009-02-28" "2009-03-01:2009-03-31"
#> [232] "2009-04-01:2009-04-30" "2009-05-01:2009-05-31" "2009-06-01:2009-06-30"
#> [235] "2009-07-01:2009-07-31" "2009-08-01:2009-08-31" "2009-09-01:2009-09-30"
#> [238] "2009-10-01:2009-10-31" "2009-11-01:2009-11-30" "2009-12-01:2009-12-31"
#> [241] "2010-01-01:2010-01-31" "2010-02-01:2010-02-28" "2010-03-01:2010-03-31"
#> [244] "2010-04-01:2010-04-30" "2010-05-01:2010-05-31" "2010-06-01:2010-06-30"
#> [247] "2010-07-01:2010-07-31" "2010-08-01:2010-08-31" "2010-09-01:2010-09-30"
#> [250] "2010-10-01:2010-10-31" "2010-11-01:2010-11-30" "2010-12-01:2010-12-31"
#> [253] "2011-01-01:2011-01-31" "2011-02-01:2011-02-28" "2011-03-01:2011-03-31"
#> [256] "2011-04-01:2011-04-30" "2011-05-01:2011-05-31" "2011-06-01:2011-06-30"
#> [259] "2011-07-01:2011-07-31" "2011-08-01:2011-08-31" "2011-09-01:2011-09-30"
#> [262] "2011-10-01:2011-10-31" "2011-11-01:2011-11-30" "2011-12-01:2011-12-31"
#> [265] "2012-01-01:2012-01-31" "2012-02-01:2012-02-29" "2012-03-01:2012-03-31"
#> [268] "2012-04-01:2012-04-30" "2012-05-01:2012-05-31" "2012-06-01:2012-06-30"
#> [271] "2012-07-01:2012-07-31" "2012-08-01:2012-08-31" "2012-09-01:2012-09-30"
#> [274] "2012-10-01:2012-10-31" "2012-11-01:2012-11-30" "2012-12-01:2012-12-31"
#> [277] "2013-01-01:2013-01-31" "2013-02-01:2013-02-28" "2013-03-01:2013-03-31"
#> [280] "2013-04-01:2013-04-30" "2013-05-01:2013-05-31" "2013-06-01:2013-06-30"
#> [283] "2013-07-01:2013-07-31" "2013-08-01:2013-08-31" "2013-09-01:2013-09-30"
#> [286] "2013-10-01:2013-10-31" "2013-11-01:2013-11-30" "2013-12-01:2013-12-31"
#> [289] "2014-01-01:2014-01-31" "2014-02-01:2014-02-28" "2014-03-01:2014-03-31"
#> [292] "2014-04-01:2014-04-30" "2014-05-01:2014-05-31" "2014-06-01:2014-06-30"
#> [295] "2014-07-01:2014-07-31" "2014-08-01:2014-08-31" "2014-09-01:2014-09-30"
#> [298] "2014-10-01:2014-10-31" "2014-11-01:2014-11-30" "2014-12-01:2014-12-31"
#> [301] "2015-01-01:2015-01-31" "2015-02-01:2015-02-28" "2015-03-01:2015-03-31"
#> [304] "2015-04-01:2015-04-30" "2015-05-01:2015-05-31" "2015-06-01:2015-06-30"
#> [307] "2015-07-01:2015-07-31" "2015-08-01:2015-08-31" "2015-09-01:2015-09-30"
#> [310] "2015-10-01:2015-10-31" "2015-11-01:2015-11-30" "2015-12-01:2015-12-31"
#> [313] "2016-01-01:2016-01-31" "2016-02-01:2016-02-29" "2016-03-01:2016-03-31"
#> [316] "2016-04-01:2016-04-30" "2016-05-01:2016-05-31" "2016-06-01:2016-06-30"
#> [319] "2016-07-01:2016-07-31" "2016-08-01:2016-08-31" "2016-09-01:2016-09-30"
#> [322] "2016-10-01:2016-10-31" "2016-11-01:2016-11-30" "2016-12-01:2016-12-31"
#> [325] "2017-01-01:2017-01-31" "2017-02-01:2017-02-28" "2017-03-01:2017-03-31"
#> [328] "2017-04-01:2017-04-30" "2017-05-01:2017-05-31" "2017-06-01:2017-06-30"
#> [331] "2017-07-01:2017-07-31" "2017-08-01:2017-08-31" "2017-09-01:2017-09-30"
#> [334] "2017-10-01:2017-10-31" "2017-11-01:2017-11-30" "2017-12-01:2017-12-31"
#> [337] "2018-01-01:2018-01-31" "2018-02-01:2018-02-28" "2018-03-01:2018-03-31"
#> [340] "2018-04-01:2018-04-30" "2018-05-01:2018-05-31" "2018-06-01:2018-06-30"
#> [343] "2018-07-01:2018-07-31" "2018-08-01:2018-08-31" "2018-09-01:2018-09-30"
#> [346] "2018-10-01:2018-10-31" "2018-11-01:2018-11-30" "2018-12-01:2018-12-31"
#> [349] "2019-01-01:2019-01-31" "2019-02-01:2019-02-28" "2019-03-01:2019-03-31"
#> [352] "2019-04-01:2019-04-30" "2019-05-01:2019-05-31" "2019-06-01:2019-06-30"
#> [355] "2019-07-01:2019-07-31" "2019-08-01:2019-08-31" "2019-09-01:2019-09-30"
#> [358] "2019-10-01:2019-10-31" "2019-11-01:2019-11-30" "2019-12-01:2019-12-31"
#> [361] "2020-01-01:2020-01-31" "2020-02-01:2020-02-29" "2020-03-01:2020-03-31"
#> [364] "2020-04-01:2020-04-30" "2020-05-01:2020-05-31" "2020-06-01:2020-06-30"
#> [367] "2020-07-01:2020-07-31" "2020-08-01:2020-08-31" "2020-09-01:2020-09-30"
#> [370] "2020-10-01:2020-10-31" "2020-11-01:2020-11-30" "2020-12-01:2020-12-31"
#> [373] "2021-01-01:2021-01-31" "2021-02-01:2021-02-28" "2021-03-01:2021-03-31"
#> [376] "2021-04-01:2021-04-30" "2021-05-01:2021-05-31" "2021-06-01:2021-06-30"
#> [379] "2021-07-01:2021-07-31" "2021-08-01:2021-08-31" "2021-09-01:2021-09-30"
#> [382] "2021-10-01:2021-10-31" "2021-11-01:2021-11-30" "2021-12-01:2021-12-31"
Created on 2022-03-19 by the reprex package (v2.0.1)
I used lubridate for the simplicity of its ymd() function.
require(lubridate)
You start with creating a vector of first days of the month:
start <- seq(ymd("1990-01-01"), ymd("2021-12-01"), by = "month")
Then you create another vector subtracting 1 day to obtain the last day of each month:
b <- start - 1
You remove the first element of that vector
end <- b[-1]
You join them all
paste0(start, ":", end)
There's an easily (manually) fixable issue: the very last interval is incorrect.
1) yearmon/yearqtr Create a monthly sequence using yearmon class and then convert that to the start and end dates. Similarly for quarters and yearqtr class. Internally both represent dates by year and fraction of year so use 1/12 and 1/4 in by=. Also note that using as.Date gives the date at the start of the month or quarter and the same but with the frac=1 argument gives the end.
library(zoo)
# input
st <- as.Date("1990-01-01")
en <- as.Date("2021-12-01")
# by month
mon <- seq(as.yearmon(st), as.yearmon(en), 1/12)
paste(as.Date(mon), as.Date(mon, frac = 1), sep = ":")
# by quarter
qtr <- seq(as.yearqtr(st), as.yearqtr(en), 1/4)
paste(as.Date(qtr), as.Date(qtr, frac = 1), sep = ":")
There is some question of what the end date should be. The above give an end date on the last interval of 2021-12-31 but if the end date should be 2021-12-01 so that no interval extends past en then replace the two paste lines with these respectively.
paste(as.Date(mon), pmin(as.Date(mon, frac = 1), en), sep = ":")
paste(as.Date(qtr), pmin(as.Date(qtr, frac = 1), en), sep = ":")
2) Base R A base R alternative is to use the expressions involving cut shown below to get the end of period. (1) seems less tricky but this might be useful if using only base R were desired. A similar approach with pmin as in (1) could be used if we want to ensure that no range extends beyond en.
This and the remaining solutions, but not (1), assume that st is the first of the month; however, that could readily be handled if needed.
mon <- seq(st, en, by = "month")
paste(mon, as.Date(cut(mon + 31, "month")) - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(as.Date(qtr), as.Date(cut(qtr + 93, "month")) - 1, sep = ":")
3) lubridate Using various functions from this package we can write the following. A similar approach using pmin as in (1) could be used if the ranges may not extend beyond en.
library(lubridate)
mon <- seq(st, en, by = "month")
paste(mon, mon + month(1) - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(qtr, qtr + quarter(1) - 1, sep = ":")
4) IDate We can use IDate class from data.table in which case we can make use of cut.IDate which returns another IDate object rather than a character string (as in base R).
st <- as.IDate("1990-01-01")
en <- as.IDate("2021-12-01")
mon <- seq(st, en, by = "month")
paste(mon, cut(mon + 31, "month") - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(qtr, cut(qtr + 93, "month") - 1, sep = ":")
Here is a sample of my tibble
protein patient value
<chr> <chr> <dbl>
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71
In the "patient" column the "d" as in "Case-x-d" represents the a number of days. What I would like to do is create a new column stating whether the strings in the "patient" column contain values less than 14d.
I have managed to do this using the following command:
under14 <- "-1d|-2d|-3d|-4d|-4d|-5d|-6d|-7d|-8d|-9d|-11d|-12d|-13d|-14d"
data <- data %>%
mutate(case=ifelse(grepl(under14,data$patient),'under14days','over14days'))
However this seems extremely clunky and actually took way to long to type. I will have to be changing my search term many times so would like a quicker way to do this? Perhaps using some kind of regex is the best option, but I don't really know where to start with this.
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_1.1.0 Rmisc_1.5 plyr_1.8.4 lattice_0.20-35 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.5
[9] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.2.3 compiler_3.5.0 bindr_0.1.1 tools_3.5.0 lubridate_1.7.4
[8] jsonlite_1.5 nlme_3.1-137 gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1 psych_1.8.4 cli_1.0.0
[15] rstudioapi_0.7 yaml_2.1.19 parallel_3.5.0 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1
[22] hms_0.4.2 grid_3.5.0 tidyselect_0.2.4 glue_1.2.0 R6_2.2.2 foreign_0.8-70 modelr_0.1.2
[29] reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[36] utf8_1.1.4 stringi_1.2.3 lazyeval_0.2.1 munsell_0.5.0 broom_0.4.4 crayon_1.3.4
>
One possibility is to use tidyr::separate
library(tidyverse)
df %>%
separate(patient, into = c("ID1", "Days", "ID2"), sep = "-", extra = "merge", remove = F) %>%
mutate(case = ifelse(as.numeric(Days) <= 14, "under14days", "over14days")) %>%
select(-ID1, -ID2)
# protein patient Days value case
#1 BOD1L2 RF0064_Case-9-d- 9 10.40 under14days
#2 PPFIA2 RF0064_Case-20-d- 20 7.83 over14days
#3 STAT4 RF0064_Case-11-d- 11 11.00 under14days
#4 TOM1L2 RF0064_Case-29-d- 29 13.00 over14days
#5 SH2D2A RF0064_Case-2-d- 2 8.28 under14days
#6 TIGD4 RF0064_Case-49-d- 49 9.71 over14days
Sample data
df <-read.table(text =
" protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71 ", header = T, row.names = 1)
Since, format of the patient is clearly defined, a possible solution in base-R using gsub can be to extract the days and check with range as:
df$case <- ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", df$patient)) <= 14,
"under14days", "over14days")
Exactly, same way, OP can modify code used in mutate as:
library(dplyr)
df <- df %>%
mutate(case = ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", patient)) <= 14,
"under14days", "over14days"))
df
# protein patient value case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 over14days
Data:
df <- read.table(text =
"protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71",
header = TRUE, stringsAsFactors = FALSE)
We can also extract the number directly with regex. ?<=- is look behind, which identifies the position with "-"
library(tidyverse)
dat2 <- dat %>%
mutate(Day = as.numeric(str_extract(patient, pattern = "(?<=-)[0-9]*"))) %>%
mutate(case = ifelse(Day <= 14,'under14days','over14days'))
dat2
# protein patient value Day case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 9 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 20 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 11 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 29 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 2 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 49 over14days
DATA
dat <- read.table(text = " protein patient value
1 BOD1L2 'RF0064_Case-9-d-' 10.4
2 PPFIA2 'RF0064_Case-20-d-' 7.83
3 STAT4 'RF0064_Case-11-d-' 11.0
4 TOM1L2 'RF0064_Case-29-d-' 13.0
5 SH2D2A 'RF0064_Case-2-d-' 8.28
6 TIGD4 'RF0064_Case-49-d-' 9.71",
header = TRUE, stringsAsFactors = FALSE)
I want to create a survival dataset featuring multiple-record ids. The existing event data consists of one row observations with the date formatted as dd/mm/yy. The idea is to count the number of consecutive months where there is at least one event/month (there are multiple years, so this has to be accounted for somehow). In other words, I want to create episodes that capture such monthly streaks, including periods of inactivity. To give an example, the code should transform something like this:
df1
id event.date
group1 01/01/16
group1 05/02/16
group1 07/03/16
group1 10/06/16
group1 12/09/16
to this:
df2
id t0 t1 ep.no ep.t ep.type
group1 1 3 1 3 1
group1 4 5 2 2 0
group1 6 6 3 1 1
group1 7 8 4 2 0
group1 9 9 5 1 1
group1 10 ... ... ... ...
where t0 and t1 are the start and end months, ep.no is the episode counter for the particular id, ep.t is the length of that particular episode, and ep.type indicates the type of episode (active/inactive). In the example above, there is an initial three-months of activity, then a two-month break, followed by a single-month episode of relapse etc.
I am mostly concerned about the transformation that brings about the t0 and t1 from df1 to df2, as the other variables in df2 can be constructed afterwards based on them (e.g. no is a counter, time is arithmetic, and type always starts out as 1 and alternates). Given the complexity of the problem (at least for me), I get the need to provide the actual data, but I am not sure if that is allowed? I will see what I can do if a mod chimes in.
I think this does what you want. The trick is identifying the sequence of observations that need to be treated together, and using dplyr::lag with cumsum is the way to go.
# Convert to date objects, summarize by month, insert missing months
library(tidyverse)
library(lubridate)
# added rows of data to demonstrate that it works with
# > id and > 1 event per month and rolls across year end
df1 <- read_table("id event.date
group1 01/01/16
group1 02/01/16
group1 05/02/16
group1 07/03/16
group1 10/06/16
group1 12/09/16
group1 01/02/17
group2 01/01/16
group2 05/02/16
group2 07/03/16",col_types="cc")
# need to get rid of extra whitespace, but automatically converts to date
# summarize by month to count events per month
df1.1 <- mutate(df1, event.date=dmy(event.date),
yr=year(event.date),
mon=month(event.date))
# get down to one row per event and complete data
df2 <- group_by(df1.1,id,yr,mon) %>%
summarize(events=n()) %>%
complete(id, yr, mon=1:12, fill=list(events=0)) %>%
group_by(id) %>%
mutate(event = as.numeric(events >0),
is_start=lag(event,default=-1)!=event,
episode=cumsum(is_start),
episode.date=ymd(paste(yr,mon,1,sep="-"))) %>%
group_by(id, episode) %>%
summarize(t0 = first(episode.date),
t1 = last(episode.date) %m+% months(1),
ep.length = as.numeric((last(episode.date) %m+% months(1)) - first(episode.date)),
ep.type = first(event))
Gives
Source: local data frame [10 x 6]
Groups: id [?]
id episode t0 t1 ep.length ep.type
<chr> <int> <dttm> <dttm> <dbl> <dbl>
1 group1 1 2016-01-01 2016-04-01 91 1
2 group1 2 2016-04-01 2016-06-01 61 0
3 group1 3 2016-06-01 2016-07-01 30 1
4 group1 4 2016-07-01 2016-09-01 62 0
5 group1 5 2016-09-01 2016-10-01 30 1
6 group1 6 2016-10-01 2017-02-01 123 0
7 group1 7 2017-02-01 2017-03-01 28 1
8 group1 8 2017-03-01 2018-01-01 306 0
9 group2 1 2016-01-01 2016-04-01 91 1
10 group2 2 2016-04-01 2017-01-01 275 0
Using complete() with mon=1:12 will always make the last episode stretch to the end of that year. The solution would be to insert a filter() on yr and mon after complete()
The advantage of keeping t0 and t1 as Date-time objects is that they work correctly across year boundaries, which using month numbers won't.
Session information:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lubridate_1.3.3 dplyr_0.5.0 purrr_0.2.2
[4] readr_0.2.2 tidyr_0.6.0 tibble_1.2
[7] ggplot2_2.2.0 tidyverse_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 knitr_1.15.1 magrittr_1.5
[4] munsell_0.4.2 colorspace_1.2-6 R6_2.1.3
[7] stringr_1.1.0 highr_0.6 plyr_1.8.4
[10] tools_3.3.2 grid_3.3.2 gtable_0.2.0
[13] DBI_0.5 lazyeval_0.2.0 assertthat_0.1
[16] digest_0.6.10 memoise_1.0.0 evaluate_0.10
[19] stringi_1.1.2 scales_0.4.1
I'm not being able to left_join with dplyr 0.3 when trying to use by argument.
First, I installed v0.3 following Hadley's suggestion on github
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")
sessioninfo()
# R version 3.1.1 (2014-07-10)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
#[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
#[5] LC_TIME=Portuguese_Portugal.1252
#
#attached base packages:
#[1] stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] dplyr_0.3
#
#loaded via a namespace (and not attached):
#[1] assertthat_0.1 DBI_0.3.1 magrittr_1.0.1 parallel_3.1.1 Rcpp_0.11.2 tools_3.1.1
Then taking some data
df1<-as.tbl(data.frame('var1'=LETTERS[1:10],'value1'=sample(1:100,10), stringsAsFactors = F))
df2<-as.tbl(data.frame('var2'=LETTERS[1:10],'value2'=sample(1:100,10), stringsAsFactors = F))
library(dplyr)
And finally trying to left_join
left_join(df1, df2, by = c('var1' = 'var2'))
# Error: cannot join on column 'var1'
But it works with
df2$var1 <- df2$var2
left_join(df1, df2, by = c('var1' = 'var2'))
Source: local data frame [10 x 4]
var1 value1 var2 value2
1 A 37 A 48
2 B 90 B 18
3 C 13 C 36
4 D 94 D 75
5 E 14 E 12
6 F 95 F 52
7 G 60 G 55
8 H 69 H 72
9 I 25 I 49
10 J 47 J 10
My questions
Is dplyr ignoring the by argument in the second example?
Can this be a bug?