Data.Table rolling join by group - r

How can I find the last value, prior to test.day, for each (loc.x, loc.y) pair?
dt <- data.table(
loc.x = as.integer(c(1, 1, 3, 1, 3, 1)),
loc.y = as.integer(c(1, 2, 1, 2, 1, 2)),
time = as.IDate(c("2015-03-11", "2015-05-10", "2015-09-27",
"2015-11-25", "2014-09-13", "2015-08-19")),
value = letters[1:6]
)
setkey(dt, loc.x, loc.y, time)
test.day <- as.IDate("2015-10-01")
Required output:
loc.x loc.y value
1: 1 1 a
2: 1 2 f
3: 3 1 c

Another option is to use the last function:
dt[, last(value[time < test.day]), by = .(loc.x, loc.y)]
which gives:
loc.x loc.y V1
1: 1 1 a
2: 1 2 f
3: 3 1 c

You can first subset the rows where time < test.day (which should be quite efficient because it is not done by group) and then select the last value per group. To do that you can either use tail(value, 1L) or, as suggested by Floo0, value[.N], resulting in:
dt[time < test.day, tail(value, 1L), by = .(loc.x, loc.y)]
# loc.x loc.y V1
#1: 1 1 a
#2: 1 2 f
#3: 3 1 c
or
dt[time < test.day, value[.N], by = .(loc.x, loc.y)]
Note that this works because the data is sorted due to setkey(dt, loc.x, loc.y, time).

Here's another option using a rolling join after creating a lookup table
indx <- data.table(unique(dt[ ,.(loc.x, loc.y)]), time = test.day)
dt[indx, roll = TRUE, on = names(indx)]
# loc.x loc.y time value
# 1: 1 1 2015-10-01 a
# 2: 1 2 2015-10-01 f
# 3: 3 1 2015-10-01 c
Or a very similar option suggested by #eddi
dt[dt[, .(time = test.day), by = .(loc.x, loc.y)], roll = T, on = c('loc.x', 'loc.y', 'time')]
Or a one liner which will be less efficient as it will call [.data.table by group
dt[,
.SD[data.table(test.day), value, roll = TRUE, on = c(time = "test.day")],
by = .(loc.x, loc.y)
]
# loc.x loc.y V1
# 1: 1 1 a
# 2: 1 2 f
# 3: 3 1 c

Related

How to join data to only the first matching row with {data.table} in R

I have a look-up table of "firsts" in column d. For example, the first time the patient was admitted because of a specific disease. I would like to join this back into the main data frame via data.table on multiple other conditions.
My problem is that, unfortunately, the main data.table could have multiple records with identical joining criteria that results in multiple "firsts" per patient after the join. Real world data is messy, people!
Is it possible to do a {data.table} join on only the first matching record?
This is similar to this question, but the multiple-matches are on the main data table. I think that mult only works on when there are several entries on the table being joined in.
repex:
library(data.table)
set.seed(1724)
d1 <- data.table(a = c(1, 1, 1),
b = c(1, 1, 2),
c = sample(1:10, 3))
d2 <- data.table(a = 1, b = 1, d = TRUE)
d2[d1, on = c("a", "b")]
a b d c
1: 1 1 TRUE 4
2: 1 1 TRUE 8
3: 1 2 NA 2
desired output
a b d c
1: 1 1 TRUE 4
2: 1 1 NA 8
3: 1 2 NA 2
library(data.table)
set.seed(1724)
d1 = data.table(a = c(1, 1, 1), b = c(1, 1, 2), c = sample(1:10, 3))
d2 = data.table(a = 1, b = 1, d = TRUE)
d1[, i1:=seq_len(.N), by=c("a","b")]
d2[, i2:=seq_len(.N), by=c("a","b")]
d2[d1, on = c("a","b","i2==i1")][, "i2":=NULL][]
# a b d c
# <num> <num> <lgcl> <int>
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
One way would be to turn the values to NA after join.
library(data.table)
d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3
# a b d c
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
The easy solution would be to index every row and join on this also (the d2 is a filtered version of d1):
library(data.table)
set.seed(1724)
d1 <- data.table(a = c(1, 1, 1),
b = c(1, 1, 2),
c = sample(1:10, 3))
d1[, rid := seq(to = .N)]
d2 <- d1[, .SD[1], by = c("a"), .SDcols = c("b", "rid")][, d := TRUE] # UPDATE
d2[d1, on = c("a", "b", "rid")]

Nearest "n" rolling join in R data table

With data.table, we can join a value in one data set with the nearest value in another by using roll = "nearest". Some example data:
dt1 <- data.table(x = c(15,101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10,50,100,200), id2 = c("a","b","c","d"))
Using roll = "nearest", I can join each 'x' in 'dt1' with the 'x' in dt2 which is nearest:
dt2[dt1, roll = "nearest", on = "x"]
# x id2 id1
# 1: 15 a x
# 2: 101 c y
E.g. for x = 15 in 'dt1', the nearest x value in 'dt2' is x = 10, and we get the corresponding 'id2' which is "a".
But what if instead of getting one nearest value, I want to get n nearest values? For example, if I want the 2 nearest x values, the result would be:
x id2 id1 roll
1: 15 a x nr1
2: 15 b x nr2
3: 101 c y nr1
4: 101 b y nr2
("nr" stands for "nearest")
I want a general approach that I can apply to any "n" (e.g. 2 nearest points, 3 nearest points, etc).
EDIT
I wonder if it is possible to also apply this to multi columns join where the join will match on the preceding column before getting the nearest on the last join column. For example:
dt1 <- data.table(group=c(1,2), x=(c(15,101)), id1=c("x","y"))
dt2 <- data.table(group=c(1,2,2,3), x=c(10,50,100,200),id2=c("a","b","c","d"))
If I join on=c("group","x"), the join will first match on "group" and then get the nearest on "x", so I would expect the result to be something like:
x group id2 id1 roll
1: 15 1 a x nr1
2: 101 2 c y nr1
3: 101 2 b y nr2
Here is something very raw (we go row by row):
n <- 2L
sen <- 1L:n
for (i in 1:nrow(dt1)) {
set(dt1, i, j = "nearest", list(which(frank(abs(dt1$x[i] - dt2$x)) %in% sen)))
}
dt1[, .(id1, nearest = unlist(nearest)), by = x
][, id2 := dt2$id2[nearest]
][, roll := paste0("nr", frank(abs(dt2$x[nearest] - x))), by = x][]
# x id1 nearest id2 roll
# 1: 15 x 1 a nr1
# 2: 15 x 2 b nr2
# 3: 101 y 2 b nr2
# 4: 101 y 3 c nr1
Slightly cleaner:
dt1[,
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1] # assumes unique ids.
Data:
dt1 <- data.table(x = c(15, 101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10, 50, 100, 200), id2 = c("a", "b", "c", "d"))
EDIT (as suggested/written by OP)
Joining with multiple keys:
dt1[,
{
g <- group
dt_tmp <- dt2[dt2$group == g]
nrank <- frank(abs(x - dt_tmp$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt_tmp$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1]
Edited for corrected ordering.
I don't know that roll= is going to allow nearest-n, but here's a possible workaround:
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
as.data.table(tidyr::unnest(dt1, id2))
# x id1 id2 nr
# 1: 15 x a 1
# 2: 15 x b 2
# 3: 101 y c 2
# 4: 101 y b 1
(I'm using tidyr::unnest because I think it fits and works well here, and data.table/#3672 is still open.)
Second batch of data:
dt1 = data.table(x = c(1, 5, 7), id1 = c("x", "y", "z"))
dt2 = data.table(x = c(2, 5, 6, 10), id2 = c(2, 5, 6, 10))
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
as.data.table(tidyr::unnest(dt1, id2))
# x id1 id2 nr
# 1: 1 x 2 1
# 2: 1 x 5 2
# 3: 5 y 5 1
# 4: 5 y 6 2
# 5: 7 z 6 2
# 6: 7 z 5 1
Here is another option using rolling join without an additional grouping key (an improvement on my initial naive cross join idea):
#for differentiating rows from both data.tables
dt1[, ID := .I]
dt2[, rn := .I]
#perform rolling join to find closest and
#then retrieve the +-n rows around that index from dt2
n <- 2L
adjacent <- dt2[dt1, on=.(x), roll="nearest", nomatch=0L, by=.EACHI,
c(.(ID=ID, id1=i.id1, val=i.x), dt2[unique(pmin(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N))])][,
(1L) := NULL]
#extract nth nearest
adjacent[order(abs(val-x)), head(.SD, n), keyby=ID]
output:
ID id1 val x id2 rn
1: 1 x 15 10 a 1
2: 1 x 15 50 b 2
3: 2 y 101 100 c 3
4: 2 y 101 50 b 2
And using Henrik's dataset:
dt1 = data.table(x = c(1, 5, 7), id1 = c("x", "y", "z"))
dt2 = data.table(x = c(2, 5, 6, 10), id2 = c(2, 5, 6, 10))
output:
ID id1 val x id2 rn
1: 1 x 1 2 2 1
2: 1 x 1 5 5 2
3: 2 y 5 5 5 2
4: 2 y 5 6 6 3
5: 3 z 7 6 6 3
6: 3 z 7 5 5 2
And also Henrik's 2nd dataset:
dt1 = data.table(x = 3L, id1="x")
dt2 = data.table(x = 1:2, id2=c("a","b"))
output:
ID id1 val x id2 rn
1: 1 x 3 2 b 2
2: 1 x 3 1 a 1
And also joining on an additional grouping key:
dt2[, rn := .I]
#perform rolling join to find closest and
#then retrieve the +-n rows around that index from dt2
n <- 2L
adjacent <- dt2[dt1, on=.(group, x), roll="nearest", by=.EACHI, {
xrn <- unique(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N)
c(.(id1=id1, x1=i.x),
dt2[.(group=i.group, rn=xrn), on=.(group, rn), nomatch=0L])
}][, (1L:2L) := NULL]
#extract nth nearest
adjacent[order(abs(x1-x)), head(.SD, 2L), keyby=id1] #use id1 to identify rows if its unique, otherwise create ID column like prev section
output:
id1 x1 group x id2 rn
1: x 15 1 10 a 1
2: y 101 2 100 c 3
3: y 101 2 50 b 2
data:
library(data.table)
dt1 <- data.table(group=c(1,2), x=(c(15,101)), id1=c("x","y"))
dt2 <- data.table(group=c(1,2,2,3), x=c(10,50,100,200), id2=c("a","b","c","d"))
A k nearest neighbour alternative using nabor::knn:
library(nabor)
k = 2L
dt1[ , {
kn = knn(dt2$x2, x, k)
c(.SD[rep(seq.int(.N), k)],
dt2[as.vector(kn$nn.idx),
.(x2 = x, id2, nr = rep(seq.int(k), each = dt1[ ,.N]))])
}]
# x id1 x2 id2 nr
# 1: 15 x 10 a 1
# 2: 101 y 100 c 1
# 3: 15 x 50 b 2
# 4: 101 y 50 b 2
In common with the answers by #sindri_baldur and #r2evans, an actual join (on = ) is not performed, we "only" do something in j.
Timings
On data of rather modest size (nrow(dt1): 1000; nrow(dt2): 10000), knn seems faster:
# Unit: milliseconds
# expr min lq mean median uq max neval
# henrik 8.09383 10.19823 10.54504 10.2835 11.00029 13.72737 20
# chinsoon 2140.48116 2154.15559 2176.94620 2171.5824 2192.54536 2254.20244 20
# r2evans 4496.68625 4562.03011 4677.35214 4680.0699 4751.35237 4935.10655 20
# sindri 4194.93867 4397.76060 4406.29278 4402.7913 4432.76463 4490.82789 20
I also tried one evaluation on 10 times larger data, and the differences were then even more pronounced.
Code for the timing:
v = 1:1e7
n1 = 10^3
n2 = n1 * 10
set.seed(1)
dt1_0 = data.table(x = sample(v, n1))
dt2_0 = data.table(x = sample(v, n2))
setorder(dt1_0, x)
setorder(dt2_0, x)
# unique row id
dt1_0[ , id1 := 1:.N]
# To make it easier to see which `x` values are joined in `dt1` and `dt2`
dt2_0[ , id2 := x]
bm = microbenchmark(
henrik = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
k = 2L
d_henrik = dt1[ , {
kn = knn(dt2$x, x, k)
c(.SD[as.vector(row(kn$nn.idx))],
dt2[as.vector(kn$nn.idx),
.(id2, nr = as.vector(col(kn$nn.idx)))])
}]
},
chinsoon = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
dt1[, ID := .I]
dt2[, rn := .I]
n <- 2L
adjacent <- dt2[dt1, on=.(x), roll="nearest", nomatch=0L, by=.EACHI,
c(.(ID=ID, id1=i.id1, val=i.x),
dt2[unique(pmin(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N))])][,(1L) := NULL]
d_chinsoon = adjacent[order(abs(val-x)), head(.SD, n), keyby=ID]
},
r2evans = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
d_r2evans = as.data.table(tidyr::unnest(dt1, id2))
},
sindri = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
n <- 2L
sen <- 1:n
d_sindri = dt1[ ,
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
}, by = id1]
}
, times = 20L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# henrik 8.09383 10.19823 10.54504 10.2835 11.00029 13.72737 20
# chinsoon 2140.48116 2154.15559 2176.94620 2171.5824 2192.54536 2254.20244 20
# r2evans 4496.68625 4562.03011 4677.35214 4680.0699 4751.35237 4935.10655 20
# sindri 4194.93867 4397.76060 4406.29278 4402.7913 4432.76463 4490.82789 20
Check for equality, after some sorting:
setorder(d_henrik, x)
all.equal(d_henrik$id2, d_chinsoon$id2)
# TRUE
all.equal(d_henrik$id2, d_r2evans$id2)
# TRUE
setorder(d_sindri, x, roll)
all.equal(d_henrik$id2, d_sindri$id2)
# TRUE
Additional grouping variable
A quick and dirty work-around for an additional join variable; the knn is done by group:
d1 = data.table(g = 1:2, x = c(1, 5))
d2 = data.table(g = c(1L, 1L, 2L, 2L, 2L, 3L),
x = c(2, 5, 2, 3, 6, 10))
d1
# g x
# 1: 1 4
# 2: 2 4
d2
# g x
# 1: 1 2
# 2: 1 4 # nr 1
# 3: 1 5 # nr 2
# 4: 2 0
# 5: 2 1 # nr 2
# 6: 2 6 # nr 1
# 7: 3 10
d1[ , {
gg = g
kn = knn(d2[g == gg, x], x, k)
c(.SD[rep(seq.int(.N), k)],
d2[g == gg][as.vector(kn$nn.idx),
.(x2 = x, nr = rep(seq.int(k), each = d1[g == gg, .N]))])
}, by = g]
# g x x2 nr
# 1: 1 4 4 1
# 2: 1 4 5 2
# 3: 2 4 6 1
# 4: 2 4 1 2
You can use the package distances to get n nearest neighbours:
library(distances)
n <- 2
x <- nearest_neighbor_search(distances(c(dt2$x, dt1$x)), n
, nrow(dt2) + seq_len(nrow(dt1)), seq_len(nrow(dt2)))
x[] <- dt2$id2[x]
x <- t(x)
colnames(x) <- paste0("nr", seq_len(n))
cbind(dt1, x)
# x id1 nr1 nr2
#1: 15 x a b
#2: 101 y c b

Ranking multiple columns by different orders using data table

Using my example below, how can I rank multiple columns using different orders, so for example rank y as descending and z as ascending?
require(data.table)
dt <- data.table(x = c(rep("a", 5), rep("b", 5)),
y = abs(rnorm(10)) * 10, z = abs(rnorm(10)) * 10)
cols <- c("y", "z")
dt[, paste0("rank_", cols) := lapply(.SD, function(x) frankv(x, ties.method = "min")), .SDcols = cols, by = .(x)]
data.table's frank() function has some useful features which aren't available in base R's rank() function (see ?frank). E.g., we can reverse the order of the ranking by prepending the variable with a minus sign:
library(data.table)
# create reproducible data
set.seed(1L)
dt <- data.table(x = c(rep("a", 5), rep("b", 5)),
y = abs(rnorm(10)) * 10, z = abs(rnorm(10)) * 10)
# rank y descending, z ascending
dt[, rank_y := frank(-y), x][, rank_z := frank(z), x][]
x y z rank_y rank_z
1: a 6.264538 15.1178117 3 4
2: a 1.836433 3.8984324 5 1
3: a 8.356286 6.2124058 2 2
4: a 15.952808 22.1469989 1 5
5: a 3.295078 11.2493092 4 3
6: b 8.204684 0.4493361 1 2
7: b 4.874291 0.1619026 4 1
8: b 7.383247 9.4383621 2 5
9: b 5.757814 8.2122120 3 4
10: b 3.053884 5.9390132 5 3
If there are many columns which are to be ranked individually, some descending, some ascending, we can do this in two steps
# first rank all columns in descending order
cols_desc <- c("y")
dt[, paste0("rank_", cols_desc) := lapply(.SD, frankv, ties.method = "min", order = -1L),
.SDcols = cols_desc, by = x][]
# then rank all columns in ascending order
cols_asc <- c("z")
dt[, paste0("rank_", cols_asc) := lapply(.SD, frankv, ties.method = "min", order = +1L),
.SDcols = cols_asc, by = x][]
x y z rank_y rank_z
1: a 6.264538 15.1178117 3 4
2: a 1.836433 3.8984324 5 1
3: a 8.356286 6.2124058 2 2
4: a 15.952808 22.1469989 1 5
5: a 3.295078 11.2493092 4 3
6: b 8.204684 0.4493361 1 2
7: b 4.874291 0.1619026 4 1
8: b 7.383247 9.4383621 2 5
9: b 5.757814 8.2122120 3 4
10: b 3.053884 5.9390132 5 3

Get number of same individuals for different groups

I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

data.table aggregation with rolling subset on date

I have a set of data along these lines
d1 <- data.frame(
cat1 = sample(c('a', 'b', 'c'), 100, replace = TRUE),
date = rep(Sys.Date() - sample(1:100)),
val = rnorm(100, 50, 5)
)
require(data.table)
d2 <- data.table(d1)
I can get a daily sum without problem
d2[ , list(.N, sum(val)), by = c("cat1", "date")]
I want to get a sum over 2 days (and then 7 days)
This works:
d.list <- sort(unique(d2$date))
o.list <- list()
for(i in seq_along(d.list)){
o.list[[i]] <- d2[d2$date >= d.list[i] - 1 & d2$date <= d.list[i], list(.N, sum(val), max(date)), by = c("cat1")]
}
do.call(rbind, o.list)
But slows down on a bigger data set, and doesn't seem to be the best use of data.table.
Is there a more efficient way?
This is a bit faster:
First we join for exact matches and obtain the last index (in case of multiple matches)
setkey(d2, cat1, date)
tmp1 = d2[unique(d2, by=key(d2)), which=TRUE, mult="last", allow.cartesian=TRUE]
Then, we construct a copy of d2 and change date to date-1 by reference. Then, we perform a join with roll=-Inf - which is next observation carried backwards. In other words, if there's no exact match, it'll fill the next available value.
d3 = copy(d2)[, date := date-1]
setkey(d3, cat1, date)
tmp2 = d2[unique(d3, by=key(d2)), roll=-Inf, which=TRUE, allow.cartesian=TRUE]
From here, we put together the indices:
idx1 = tmp1-tmp2+1L
idx2 = data.table:::vecseq(tmp2, idx1, sum(idx1))
Subset d2 from idx2 and generate unique ids from idx1:
ans1 = d2[idx2][, grp := rep(seq_along(idx1), idx1)]
Finally aggregate by grp and get the desired result:
ans1 = ans1[, list(cat1=cat1[1L], date=date[.N],
N = .N, val=sum(val)), by=grp][, grp:=NULL]
> head(ans1, 10L)
# cat1 date N val
# 1: a 2014-01-20 1 47.69178
# 2: a 2014-01-25 1 52.01006
# 3: a 2014-02-01 1 46.82132
# 4: a 2014-02-06 1 44.62404
# 5: a 2014-02-11 1 49.63218
# 6: a 2014-02-14 1 48.80676
# 7: a 2014-02-22 1 49.27800
# 8: a 2014-02-23 2 96.17617
# 9: a 2014-02-26 1 49.20623
# 10: a 2014-02-28 1 46.72708
The results are identical as in your solution. This one took 0.02 seconds on my laptop, where as yours took 0.58 seconds.
For 7 days, just change:
d3 = copy(d2)[, date := date-1]
to
d3 = copy(d2)[, date := date-6]
It's very poorly explained in the OP what you want, but this seems to be it:
# generate the [date-1,date] sequences for each date
# adjust length.out to suit your needs
dates = d2[, list(date.seq = seq(date, by = -1, length.out = 2)), by = date]
setkey(dates, date.seq)
setkey(d2, date)
# merge and extract info needed
dates[d2][, list(.N, sum(val), date.seq[.N]), by = list(date, cat1)][, !"date"]
# cat1 N V2 V3
# 1: a 1 38.95774 2014-01-21
# 2: a 1 38.95774 2014-01-21
# 3: c 1 55.68445 2014-01-22
# 4: c 2 102.20806 2014-01-23
# 5: c 1 46.52361 2014-01-23
# ---
#164: c 1 50.17986 2014-04-27
#165: b 1 51.43489 2014-04-28
#166: b 2 100.91982 2014-04-29
#167: b 1 49.48493 2014-04-29
#168: c 1 54.93311 2014-04-30
Would it be possible to set up a binned date, and then do by on that?
d2$day7 <- as.integer(d2$date) %/% 7
d2[ , list(.N, sum(val)), by = c("cat1", "day7")]
That would give a binned value - if you want a sliding 7 day window, I'd need to think again. Also, for a binned approach, you might need to subtract an offset before doing the %/% if you want to chose the day of the week the groups start at.

Resources