I want to create a survival dataset featuring multiple-record ids. The existing event data consists of one row observations with the date formatted as dd/mm/yy. The idea is to count the number of consecutive months where there is at least one event/month (there are multiple years, so this has to be accounted for somehow). In other words, I want to create episodes that capture such monthly streaks, including periods of inactivity. To give an example, the code should transform something like this:
df1
id event.date
group1 01/01/16
group1 05/02/16
group1 07/03/16
group1 10/06/16
group1 12/09/16
to this:
df2
id t0 t1 ep.no ep.t ep.type
group1 1 3 1 3 1
group1 4 5 2 2 0
group1 6 6 3 1 1
group1 7 8 4 2 0
group1 9 9 5 1 1
group1 10 ... ... ... ...
where t0 and t1 are the start and end months, ep.no is the episode counter for the particular id, ep.t is the length of that particular episode, and ep.type indicates the type of episode (active/inactive). In the example above, there is an initial three-months of activity, then a two-month break, followed by a single-month episode of relapse etc.
I am mostly concerned about the transformation that brings about the t0 and t1 from df1 to df2, as the other variables in df2 can be constructed afterwards based on them (e.g. no is a counter, time is arithmetic, and type always starts out as 1 and alternates). Given the complexity of the problem (at least for me), I get the need to provide the actual data, but I am not sure if that is allowed? I will see what I can do if a mod chimes in.
I think this does what you want. The trick is identifying the sequence of observations that need to be treated together, and using dplyr::lag with cumsum is the way to go.
# Convert to date objects, summarize by month, insert missing months
library(tidyverse)
library(lubridate)
# added rows of data to demonstrate that it works with
# > id and > 1 event per month and rolls across year end
df1 <- read_table("id event.date
group1 01/01/16
group1 02/01/16
group1 05/02/16
group1 07/03/16
group1 10/06/16
group1 12/09/16
group1 01/02/17
group2 01/01/16
group2 05/02/16
group2 07/03/16",col_types="cc")
# need to get rid of extra whitespace, but automatically converts to date
# summarize by month to count events per month
df1.1 <- mutate(df1, event.date=dmy(event.date),
yr=year(event.date),
mon=month(event.date))
# get down to one row per event and complete data
df2 <- group_by(df1.1,id,yr,mon) %>%
summarize(events=n()) %>%
complete(id, yr, mon=1:12, fill=list(events=0)) %>%
group_by(id) %>%
mutate(event = as.numeric(events >0),
is_start=lag(event,default=-1)!=event,
episode=cumsum(is_start),
episode.date=ymd(paste(yr,mon,1,sep="-"))) %>%
group_by(id, episode) %>%
summarize(t0 = first(episode.date),
t1 = last(episode.date) %m+% months(1),
ep.length = as.numeric((last(episode.date) %m+% months(1)) - first(episode.date)),
ep.type = first(event))
Gives
Source: local data frame [10 x 6]
Groups: id [?]
id episode t0 t1 ep.length ep.type
<chr> <int> <dttm> <dttm> <dbl> <dbl>
1 group1 1 2016-01-01 2016-04-01 91 1
2 group1 2 2016-04-01 2016-06-01 61 0
3 group1 3 2016-06-01 2016-07-01 30 1
4 group1 4 2016-07-01 2016-09-01 62 0
5 group1 5 2016-09-01 2016-10-01 30 1
6 group1 6 2016-10-01 2017-02-01 123 0
7 group1 7 2017-02-01 2017-03-01 28 1
8 group1 8 2017-03-01 2018-01-01 306 0
9 group2 1 2016-01-01 2016-04-01 91 1
10 group2 2 2016-04-01 2017-01-01 275 0
Using complete() with mon=1:12 will always make the last episode stretch to the end of that year. The solution would be to insert a filter() on yr and mon after complete()
The advantage of keeping t0 and t1 as Date-time objects is that they work correctly across year boundaries, which using month numbers won't.
Session information:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lubridate_1.3.3 dplyr_0.5.0 purrr_0.2.2
[4] readr_0.2.2 tidyr_0.6.0 tibble_1.2
[7] ggplot2_2.2.0 tidyverse_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 knitr_1.15.1 magrittr_1.5
[4] munsell_0.4.2 colorspace_1.2-6 R6_2.1.3
[7] stringr_1.1.0 highr_0.6 plyr_1.8.4
[10] tools_3.3.2 grid_3.3.2 gtable_0.2.0
[13] DBI_0.5 lazyeval_0.2.0 assertthat_0.1
[16] digest_0.6.10 memoise_1.0.0 evaluate_0.10
[19] stringi_1.1.2 scales_0.4.1
Related
I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.)
However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare:
First dplyr.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DF = tibble(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z))
#> # A tibble: 10 x 2
#> y `mean(z)`
#> * <chr> <dbl>
#> 1 A -0.00336
#> 2 B -0.00702
#> 3 C 0.00291
#> 4 D -0.00430
#> 5 E -0.00705
#> 6 F -0.00568
#> 7 G -0.00344
#> 8 H 0.000553
#> 9 I -0.00168
#> 10 J 0.00661
bench::bench_process_memory()
#> current max
#> 585MB 611MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Then data.table.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 948.47MB 1.17GB
Created on 2020-04-22 by the reprex package (v0.3.0)
So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that #Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.)
Any ideas, or am I just missing something obvious?
P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a bench_process_memory() function that also tracks memory outside of R’s GC heap, which is why I use it here.
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 knitr_1.28 magrittr_1.5 tidyselect_1.0.0
#> [5] R6_2.4.1 rlang_0.4.5.9000 stringr_1.4.0 highr_0.8
#> [9] tools_3.6.3 xfun_0.13 htmltools_0.4.0 ellipsis_0.3.0
#> [13] yaml_2.2.1 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0
#> [17] crayon_1.3.4 purrr_0.3.4 vctrs_0.2.99.9011 glue_1.4.0
#> [21] evaluate_0.14 rmarkdown_2.1 stringi_1.4.6 compiler_3.6.3
#> [25] pillar_1.4.3 generics_0.0.2 pkgconfig_2.0.3
Created on 2020-04-22 by the reprex package (v0.3.0)
UPDATE: Following #jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.
dplyr
$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user: 0.526 s
Child sys : 0.033 s
Child wall: 0.455 s
Child high-water RSS : 128952 KiB
Recursive and acc. high-water RSS+CACHE : 118516 KiB
data.table
$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user: 0.510 s
Child sys : 0.056 s
Child wall: 0.464 s
Child high-water RSS : 129032 KiB
Recursive and acc. high-water RSS+CACHE : 118320 KiB
Bottom line: Accurately measuring memory usage from within R is complicated.
I'll leave my original answer below because I think it still has value.
ORIGINAL ANSWER:
Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
setDTthreads(1) ## TURN OFF MULTITHREADING
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 589MB 612MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...
Here is a sample of my tibble
protein patient value
<chr> <chr> <dbl>
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71
In the "patient" column the "d" as in "Case-x-d" represents the a number of days. What I would like to do is create a new column stating whether the strings in the "patient" column contain values less than 14d.
I have managed to do this using the following command:
under14 <- "-1d|-2d|-3d|-4d|-4d|-5d|-6d|-7d|-8d|-9d|-11d|-12d|-13d|-14d"
data <- data %>%
mutate(case=ifelse(grepl(under14,data$patient),'under14days','over14days'))
However this seems extremely clunky and actually took way to long to type. I will have to be changing my search term many times so would like a quicker way to do this? Perhaps using some kind of regex is the best option, but I don't really know where to start with this.
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_1.1.0 Rmisc_1.5 plyr_1.8.4 lattice_0.20-35 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.5
[9] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.2.3 compiler_3.5.0 bindr_0.1.1 tools_3.5.0 lubridate_1.7.4
[8] jsonlite_1.5 nlme_3.1-137 gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1 psych_1.8.4 cli_1.0.0
[15] rstudioapi_0.7 yaml_2.1.19 parallel_3.5.0 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1
[22] hms_0.4.2 grid_3.5.0 tidyselect_0.2.4 glue_1.2.0 R6_2.2.2 foreign_0.8-70 modelr_0.1.2
[29] reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[36] utf8_1.1.4 stringi_1.2.3 lazyeval_0.2.1 munsell_0.5.0 broom_0.4.4 crayon_1.3.4
>
One possibility is to use tidyr::separate
library(tidyverse)
df %>%
separate(patient, into = c("ID1", "Days", "ID2"), sep = "-", extra = "merge", remove = F) %>%
mutate(case = ifelse(as.numeric(Days) <= 14, "under14days", "over14days")) %>%
select(-ID1, -ID2)
# protein patient Days value case
#1 BOD1L2 RF0064_Case-9-d- 9 10.40 under14days
#2 PPFIA2 RF0064_Case-20-d- 20 7.83 over14days
#3 STAT4 RF0064_Case-11-d- 11 11.00 under14days
#4 TOM1L2 RF0064_Case-29-d- 29 13.00 over14days
#5 SH2D2A RF0064_Case-2-d- 2 8.28 under14days
#6 TIGD4 RF0064_Case-49-d- 49 9.71 over14days
Sample data
df <-read.table(text =
" protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71 ", header = T, row.names = 1)
Since, format of the patient is clearly defined, a possible solution in base-R using gsub can be to extract the days and check with range as:
df$case <- ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", df$patient)) <= 14,
"under14days", "over14days")
Exactly, same way, OP can modify code used in mutate as:
library(dplyr)
df <- df %>%
mutate(case = ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", patient)) <= 14,
"under14days", "over14days"))
df
# protein patient value case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 over14days
Data:
df <- read.table(text =
"protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71",
header = TRUE, stringsAsFactors = FALSE)
We can also extract the number directly with regex. ?<=- is look behind, which identifies the position with "-"
library(tidyverse)
dat2 <- dat %>%
mutate(Day = as.numeric(str_extract(patient, pattern = "(?<=-)[0-9]*"))) %>%
mutate(case = ifelse(Day <= 14,'under14days','over14days'))
dat2
# protein patient value Day case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 9 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 20 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 11 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 29 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 2 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 49 over14days
DATA
dat <- read.table(text = " protein patient value
1 BOD1L2 'RF0064_Case-9-d-' 10.4
2 PPFIA2 'RF0064_Case-20-d-' 7.83
3 STAT4 'RF0064_Case-11-d-' 11.0
4 TOM1L2 'RF0064_Case-29-d-' 13.0
5 SH2D2A 'RF0064_Case-2-d-' 8.28
6 TIGD4 'RF0064_Case-49-d-' 9.71",
header = TRUE, stringsAsFactors = FALSE)
Just now i start learning the sparklyr package using the reference sparklyr
i did what was written in the document.
when using the following code
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect
Warning messages:
1: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
2: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
> delay
# A tibble: 2,961 x 4
tailnum count dist delay
<chr> <dbl> <dbl> <dbl>
1 N14228 111 1547 3.71
2 N24211 130 1330 7.70
3 N668DN 49.0 1028 2.62
4 N39463 107 1588 2.16
5 N516JB 288 1249 12.0
6 N829AS 230 228 17.2
7 N3ALAA 63.0 1078 3.59
8 N793JB 283 1529 4.72
9 N657JB 285 1286 5.03
10 N53441 102 1661 0.941
# ... with 2,951 more rows
In the similar way i want apply the same operations on nycflights13::flights dataset using dplyr package
nycflights13::flights %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay))
# A tibble: 1,319 x 4
tailnum count dist delay
<chr> <int> <dbl> <dbl>
1 N102UW 48 536 2.94
2 N103US 46 535 - 6.93
3 N105UW 45 525 - 0.267
4 N107US 41 529 - 5.73
5 N108UW 60 534 - 1.25
6 N109UW 48 536 - 2.52
7 N110UW 40 535 2.80
8 N111US 30 536 - 0.467
9 N11206 111 1414 12.7
10 N112US 38 535 - 0.947
# ... with 1,309 more rows
My problem is why i am getting the different results ?
As mention in the documentation dplyr is the complete backend operations
for sparklyr.
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 dplyr_0.7.4 sparklyr_0.7.0
loaded via a namespace (and not attached):
[1] DBI_0.7 readr_1.1.1 withr_2.1.1
[4] nycflights13_0.2.2 rprojroot_1.3-2 lattice_0.20-35
[7] foreign_0.8-69 pkgconfig_2.0.1 config_0.2
[10] utf8_1.1.3 compiler_3.4.0 stringr_1.3.0
[13] parallel_3.4.0 xtable_1.8-2 Rcpp_0.12.15
[16] cli_1.0.0 shiny_1.0.5 plyr_1.8.4
[19] httr_1.3.1 tools_3.4.0 openssl_1.0
[22] nlme_3.1-131.1 broom_0.4.3 R6_2.2.2
[25] dbplyr_1.2.1 bindr_0.1 purrr_0.2.4
[28] assertthat_0.2.0 curl_3.1 digest_0.6.15
[31] mime_0.5 stringi_1.1.6 rstudioapi_0.7
[34] reshape2_1.4.3 hms_0.4.1 backports_1.1.2
[37] htmltools_0.3.6 grid_3.4.0 glue_1.2.0
[40] httpuv_1.3.5 rlang_0.2.0 psych_1.7.8
[43] magrittr_1.5 rappdirs_0.3.1 lazyeval_0.2.1
[46] yaml_2.1.16 crayon_1.3.4 tidyr_0.8.0
[49] pillar_1.1.0 base64enc_0.1-3 mnormt_1.5-5
[52] jsonlite_1.5 tibble_1.4.2 Lahman_6.0-0
The key difference is that in the non-sparklyr, we are not using na.rm = TRUE in mean, therefore, those elements having NA in 'distance' or 'arr_delay' will become NA when we take the mean but in sparklyr the NA values are already removed so the argument is not needed
We can check the NA elements in 'distance' and 'arr_delay'
nycflights13::flights %>%
summarise_at(vars(distance, arr_delay), funs(sum(is.na(.))))
# A tibble: 1 x 2
# distance arr_delay
# <int> <int>
#1 0 9430 #### number of NAs
So, if we correct for that, then the output will be the same
res <- nycflights13::flights %>%
group_by(tailnum) %>%
summarise(count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
arrange(tailnum)
res
# A tibble: 2,961 x 4
# tailnum count dist delay
# <chr> <int> <dbl> <dbl>
# 1 N0EGMQ 371 676 9.98
# 2 N10156 153 758 12.7
# 3 N102UW 48 536 2.94
# 4 N103US 46 535 - 6.93
# 5 N104UW 47 535 1.80
# 6 N10575 289 520 20.7
# 7 N105UW 45 525 - 0.267
# 8 N107US 41 529 - 5.73
# 9 N108UW 60 534 - 1.25
#10 N109UW 48 536 - 2.52
# ... with 2,951 more rows
Using sparklyr
library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local")
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
arrange(tailnum) %>%
collect
delay
# A tibble: 2,961 x 4
# tailnum count dist delay
# <chr> <dbl> <dbl> <dbl>
# 1 N0EGMQ 371 676 9.98
# 2 N10156 153 758 12.7
# 3 N102UW 48.0 536 2.94
# 4 N103US 46.0 535 - 6.93
# 5 N104UW 47.0 535 1.80
# 6 N10575 289 520 20.7
# 7 N105UW 45.0 525 - 0.267
# 8 N107US 41.0 529 - 5.73
# 9 N108UW 60.0 534 - 1.25
#10 N109UW 48.0 536 - 2.52
# ... with 2,951 more rows
I'm not being able to left_join with dplyr 0.3 when trying to use by argument.
First, I installed v0.3 following Hadley's suggestion on github
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")
sessioninfo()
# R version 3.1.1 (2014-07-10)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
#[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
#[5] LC_TIME=Portuguese_Portugal.1252
#
#attached base packages:
#[1] stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] dplyr_0.3
#
#loaded via a namespace (and not attached):
#[1] assertthat_0.1 DBI_0.3.1 magrittr_1.0.1 parallel_3.1.1 Rcpp_0.11.2 tools_3.1.1
Then taking some data
df1<-as.tbl(data.frame('var1'=LETTERS[1:10],'value1'=sample(1:100,10), stringsAsFactors = F))
df2<-as.tbl(data.frame('var2'=LETTERS[1:10],'value2'=sample(1:100,10), stringsAsFactors = F))
library(dplyr)
And finally trying to left_join
left_join(df1, df2, by = c('var1' = 'var2'))
# Error: cannot join on column 'var1'
But it works with
df2$var1 <- df2$var2
left_join(df1, df2, by = c('var1' = 'var2'))
Source: local data frame [10 x 4]
var1 value1 var2 value2
1 A 37 A 48
2 B 90 B 18
3 C 13 C 36
4 D 94 D 75
5 E 14 E 12
6 F 95 F 52
7 G 60 G 55
8 H 69 H 72
9 I 25 I 49
10 J 47 J 10
My questions
Is dplyr ignoring the by argument in the second example?
Can this be a bug?
This is related to the previous question on SO: roll data.table with rollends
Given the data...
library(data.table)
dt1 = data.table(Date=seq(from=as.Date("2013-01-03"),
to=as.Date("2013-06-27"), by="1 week"),
key="Date")[, ind:=.I]
dt2 = data.table(Date=seq(from=as.Date("2013-01-01"),
to=as.Date("2013-06-30"), by="1 day"),
key="Date")
I try to carry the weekly datapoints one day forward and backward ...
dt1[dt2, roll=1][dt2, roll=-1]
...but only the first roll join (forward) seems to work and roll=-1 is ignored:
Date ind
1: 2013-01-01 NA
2: 2013-01-02 NA
3: 2013-01-03 1
4: 2013-01-04 1
5: 2013-01-05 NA
---
177: 2013-06-26 NA
178: 2013-06-27 26
179: 2013-06-28 26
180: 2013-06-29 NA
181: 2013-06-30 NA
The same effect when I reverse the order:
dt1[dt2, roll=-1][dt2, roll=1]
Date ind
1: 2013-01-01 NA
2: 2013-01-02 1
3: 2013-01-03 1
4: 2013-01-04 NA
5: 2013-01-05 NA
---
177: 2013-06-26 26
178: 2013-06-27 26
179: 2013-06-28 NA
180: 2013-06-29 NA
181: 2013-06-30 NA
I would like to achieve:
Date ind
1: 2013-01-01 NA
2: 2013-01-02 1
3: 2013-01-03 1
4: 2013-01-04 1
5: 2013-01-05 NA
---
177: 2013-06-26 26
178: 2013-06-27 26
179: 2013-06-28 26
180: 2013-06-29 NA
181: 2013-06-30 NA
EDIT:
I am using the fresh data.table version 1.8.11, note session details:
sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.11
loaded via a namespace (and not attached):
[1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 tools_3.0.0
Thanks
There is nothing to roll after the first join, each row in dt2 has a corresponding row in dt1[dt2, roll = .]. So just do the two rolls separately and combine them together, e.g.:
dt1[dt2, roll = 1][, ind := ifelse(is.na(ind), dt1[dt2, roll = -1]$ind, ind)]