I have a data.table where each row is an event with a start date and end date, but the number of days between each start and end is variable.
Therefore, I am attempting to count how many other events have already ended at the time each one begins.
I can do this using lapply, but when I try to use data.table with the by functionality I don't get the expected output. Sample code below:
library(data.table)
DT <- data.table(
start = as.Date(c("2018-07-01","2018-07-03","2018-07-06","2018-07-08","2018-07-12","2018-07-15")),
end = as.Date(c("2018-07-10","2018-07-04","2018-07-09","2018-07-20","2018-07-14","2018-07-27")),
group_id = c("a", "a", "a", "b", "b", "b"))
# This produces the expected output (0,0,1,1,3,4):
lapply(DT$start, function(x) sum(x > DT$end))
# This also works using data.table:
DT[, count := lapply(DT$start, function(x) sum(x > DT$end))]
# However, I don't get the expected output (0,0,1,0,0,1) when I attempt to do this by group_id
DT[, count_by_group := lapply(DT$start, function(x) sum(x > DT$end)), by = group_id]
With the following output, where count_by_group is not the expected result:
start end group_id count count_by_group
1: 2018-07-01 2018-07-10 a 0 0
2: 2018-07-03 2018-07-04 a 0 0
3: 2018-07-06 2018-07-09 a 1 0
4: 2018-07-08 2018-07-20 b 1 0
5: 2018-07-12 2018-07-14 b 3 0
6: 2018-07-15 2018-07-27 b 4 0
Can someone help me understand how by changes the behavior? I've also tried to use different versions of the .SD feature, but wasn't able to get that to work either.
unlist()
unlist() works as well:
DT[, count_by_group := unlist(lapply(start, function(x) sum(x > end))), by = group_id]
Non-equi join
Alternatively, this can also be solved by aggregating in a non-equi self join:
DT[, count_by_group := DT[DT, on = .(group_id, end < start), .N, by = .EACHI]$N]
DT
start end group_id count_by_group
1: 2018-07-01 2018-07-10 a 0
2: 2018-07-03 2018-07-04 a 0
3: 2018-07-06 2018-07-09 a 1
4: 2018-07-08 2018-07-20 b 0
5: 2018-07-12 2018-07-14 b 0
6: 2018-07-15 2018-07-27 b 1
Benchmark
The non-equi join is also the fastest method for cases with more than a few hundred rows:
library(bench)
bm <- press(
n_grp = c(2L, 5L, 10L),
n_row = 10^(2:4),
{
set.seed(1L)
DT = data.table(
group_id = sample.int(n_grp, n_row, TRUE),
start = as.Date("2018-07-01") + rpois(n_row, 20L))
DT[, end := start + rpois(n_row, 10L)]
setorder(DT, group_id, start, end)
mark(
unlist = copy(DT)[, count_by_group := unlist(lapply(start, function(x) sum(x > end))), by = group_id],
sapply = copy(DT)[, count_by_group := sapply(start, function(x) sum(x > end)), by = group_id],
vapply = copy(DT)[, count_by_group := vapply(start, function(x) sum(x > end), integer(1)), by = group_id],
nej = copy(DT)[, count_by_group := DT[DT, on = .(group_id, end < start), .N, by = .EACHI]$N]
)
}
)
ggplot2::autoplot(bm)
For 10000 rows, the non-equi join is about 10 times faster than the other methods.
As DT is being updated, copy() is used to create a fresh, unmodified copy of DT for each benchmark run.
DT[, count_by_group := vapply(start, function(x) sum(x > end), integer(1)), by = group_id]
To refer to start and end by group, we need to leave the DT$ prefix out.
We use vapply() rather than lapply() because if the right hand side of := is a list, it is interpreted as a list of columns (and since only one column is expected, only the first element, a 0, is taken into account and recycled).
Related
I have a data.table with which I'd like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular example, I'd like to multiply all of these columns by -1.
Some toy data and a vector specifying relevant columns:
library(data.table)
dt <- data.table(a = 1:3, b = 1:3, d = 1:3)
cols <- c("a", "b")
Right now I'm doing it this way, looping over the character vector:
for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
}
Is there a way to do this directly without the for loop?
This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
.SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).
EDIT: Here's another way that is probably faster, as #Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])
I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.
cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]
UPDATE: Following is a neat way to do it without for loop
dt[,(cols):= - dt[,..cols]]
It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result
mbm = microbenchmark(
base = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_solution2 = for (j in cols) set(dt, j = j, value = -dt[[j]]),
hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
orhans_solution = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_solution2 = dt[,(cols):= - dt[,..cols]],
times=1000
)
mbm
Unit: microseconds
expr min lq mean median uq max neval
base_solution 3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789 1000
franks_solution1 313.846 349.1285 448.4770 379.8970 447.384 5654.149 1000
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229 9723.070 1000
hannes_solution 326.154 405.5385 561.8263 495.1795 576.000 12432.400 1000
orhans_solution 3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202 1000
orhans_solution2 752.000 831.5900 1061.6974 897.6405 1026.872 9913.018 1000
as shown in below chart
My Previous Answer:
The following also works
for (j in cols)
dt[,(j):= -1 * dt[, ..j]]
None of above solutions seems to work with calculation by group. Following is the best I got:
for(col in cols)
{
DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g]
}
dplyr functions work on data.tables, so here's a dplyr solution that also "avoids the for-loop" :)
dt %>% mutate(across(all_of(cols), ~ -1 * .))
I benchmarked it using orhan's code (adding rows and columns) and you'll see dplyr::mutate with across mostly executes faster than most of the other solutions and slower than the data.table solution using lapply.
library(data.table); library(dplyr)
dt <- data.table(a = 1:100000, b = 1:100000, d = 1:100000) %>%
mutate(a2 = a, a3 = a, a4 = a, a5 = a, a6 = a)
cols <- c("a", "b", "a2", "a3", "a4", "a5", "a6")
dt %>% mutate(across(all_of(cols), ~ -1 * .))
#> a b d a2 a3 a4 a5 a6
#> 1: -1 -1 1 -1 -1 -1 -1 -1
#> 2: -2 -2 2 -2 -2 -2 -2 -2
#> 3: -3 -3 3 -3 -3 -3 -3 -3
#> 4: -4 -4 4 -4 -4 -4 -4 -4
#> 5: -5 -5 5 -5 -5 -5 -5 -5
#> ---
#> 99996: -99996 -99996 99996 -99996 -99996 -99996 -99996 -99996
#> 99997: -99997 -99997 99997 -99997 -99997 -99997 -99997 -99997
#> 99998: -99998 -99998 99998 -99998 -99998 -99998 -99998 -99998
#> 99999: -99999 -99999 99999 -99999 -99999 -99999 -99999 -99999
#> 100000: -100000 -100000 100000 -100000 -100000 -100000 -100000 -100000
library(microbenchmark)
mbm = microbenchmark(
base_with_forloop = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_soln1_w_lapply = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_soln2_w_forloop = for (j in cols) set(dt, j = j, value = -dt[[j]]),
orhans_soln_w_forloop = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_soln2 = dt[,(cols):= - dt[,..cols]],
dplyr_soln = (dt %>% mutate(across(all_of(cols), ~ -1 * .))),
times=1000
)
library(ggplot2)
ggplot(mbm) +
geom_violin(aes(x = expr, y = time)) +
coord_flip()
Created on 2020-10-16 by the reprex package (v0.3.0)
To add example to create new columns based on a string vector of columns. Based on Jfly answer:
dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10)))
col0 <- c("a", "b", "c")
col1 <- paste0("max.", col0)
for(i in seq_along(col0)) {
dt[, (col1[i]) := max(get(col0[i])), g]
}
dt[,.N, c("g", col1)]
library(data.table)
(dt <- data.table(a = 1:3, b = 1:3, d = 1:3))
Hence:
a b d
1: 1 1 1
2: 2 2 2
3: 3 3 3
Whereas (dt*(-1)) yields:
a b d
1: -1 -1 -1
2: -2 -2 -2
3: -3 -3 -3
How to do a special type of lookup join in R data.table ?
Suppose there are two tables in R as under:
library(data.table)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
They look like:
> dt1
a b
1: p 1,2
2: q 1,2,3
3: r 4,5
> dt2
code desc
1: 1 good
2: 2 better
3: 3 best
4: 4 bad
5: 5 worst
The goal is join dt1 and dt2 in such a way the result looks like
> result
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
Can anyone show how this type of join can be accomplished in R ?
That's not really a join but as dt1$b contains convoluted values anyway here is my ugly hack:
dt2[, code := as.character(code)]
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
Edit:
The replacement has to be done from the longest to the shortest code (string lengths or number of characters) and desc must not contain any digits.
So, with setorder(dt2, -code) added to the code and the new use case provided by the OP in the comment:
dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))
setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
Edit 2:
According to OP's comment the requirement for the ugly hack no digits in desc aren't fulfilled in the production data. (As it almost always happens when a quick & dirty solution meets real world's data :-) ).
So here is a concise data.table solution which does what all the others answers do as well: split column b, join or look up the matching desc, and recombine:
dt2[, code := as.character(code)][
dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
, .(desc = paste(desc, collapse = ",")), by = .(a, b)]
Using OP's new use case
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
Note that grouping uses both columns a and b for two reasons: 1) convenience (to keep both columns in the final result), 2) in case a is not a unique identifier
Idea is to get column b as list of integers and then subset column desc in dt2 (note that code is just row number, otherwise use function match).
library(purrr)
library(stringr)
dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>%
.[, b := type.convert(b)]
dt2[dt1, on = c("code == b")] %>%
.[, lapply(.SD, toString), by = "a"]
#> a code desc
#> 1: p 1, 2 good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r 4, 5 bad, worst
Created on 2021-07-27 by the reprex package (v2.0.0)
You can split the string on comma and do a join.
library(dplyr)
library(tidyr)
dt1 %>%
separate_rows(b, sep = ',\\s*', convert = TRUE) %>%
left_join(dt2, by = c('b' = 'code')) %>%
group_by(a) %>%
summarise(desc = toString(desc))
# a desc
# <chr> <chr>
#1 p good, better
#2 q good, better, best
#3 r bad, worst
So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]
I usually work with dplyr but face a rather large data set and my approach is very slow. I basically need to filter a df group it by dates and count the occurrence within
sample data (turned already everything into data.table)
library(data.table)
library(dplyr)
set.seed(123)
df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)
df2 <- data.table(id = sample(1:10, 5, replace = T),
start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
end = df$startmonth + sample(10:90,5, replace = T)
)
#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
My dplyr approach works but is slow
res %>% filter(start <=endmonth & end>= startmonth) %>%
group_by(startmonth,endmonth) %>%
summarise(countmonth=n())
My data.table knowledge is limited but I guess we would setkeys() on the date columns and something like res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] to get the counts by group but I'm not sure how the filter goes in there.
Appreciate your help!
You could do the counting inside the join:
df2[df, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]
start end N
1: 2014-07-31 2014-07-01 1
2: 2014-08-31 2014-08-01 4
3: 2014-09-30 2014-09-01 5
4: 2014-10-31 2014-10-01 3
5: 2014-11-30 2014-11-01 3
or add it as a new column in df:
df[, n :=
df2[.SD, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]$N
]
How it works. The syntax is x[i, on=, allow.cartesian=, j, by=.EACHI]. Each row if i is used to look up values in x. The symbol .EACHI indicates that aggregation (j=.N) will be done for each row of i.
data.table is graceful and intuitive with the chains rule. Everything is just lined up like a machine. But sometimes we have to introduce some operation like dcast or melt.
How can I integrate all operation into the []? Simply because it's more graceful, I admit.
DT <- data.table(A = rep(letters[1:3],4), B = rep(1:4,3), C = rep(c("OK", "NG"),6))
DT.1 <- DT[,.N, by = .(B,C)] %>% dcast(B~C)
DT.2 <- DT.1[,.N, by = .(NG)]
# NG N
#1: NA 2
#2: 3 2
#same
DT <- data.table(A = rep(letters[1:3],4), B = rep(1:4,3), C = rep(c("OK", "NG"),6))[,.N, by = .(B, C)] %>%
dcast(B~C) %>% .[,.N, by =.(NG)]
Can I remove the %>% and integrate into the []?
Thanks
What about using .SD to this end:
DT[, .N, by = .(B, C)
][, dcast(.SD, B ~ C)
][, .N, by = .(NG)]
NG N
1: NA 2
2: 3 2