For each column, sum scores by group over prior window of time - r

I have a large panel dataset (10,000,000 x 53) with about 50 columns of scores. I have aggregated each score by group (there are about 15,000) and date.
Now I want to calculate a rolling sum of three values including the prior two dates' and the current date's scores, creating a new corresponding sum column.
The sums should be calculated for each score column by date and group.
For 1st and 2nd dates within a group, fewer than 3 values is allowed.
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David's answer from this post covered most of my questions on summing rolling windows by groups but I'm still missing a couple pieces.
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
My question has two parts:
What code should replace "Score" to calculate time window sums for each column in a range of columns?
Is the solution provided the most efficient version for fast calculation on large dataset?

A possible solution:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 4 9
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 4 5
6: B 2017-04-08 2017-04-06 3 1 7 6
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
When you want to take a time window into account as well, you could do (assuming LAGGED is the start of the time-window):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 3 7
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 2 2
6: B 2017-04-08 2017-04-06 3 1 3 1
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
Used data:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')

Related

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)

Match rows with the same or close start and end date in data.table r

Following data.table
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
variable1=c("a","c","c","d","a",NA,"a","a","b"))
df
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-08-01 2019-12-01 c
3: 2 2019-07-12 2019-07-30 c
4: 2 2017-05-24 2017-11-24 d
5: 3 2016-05-08 2017-07-25 a
6: 3 2017-08-01 2018-08-01 <NA>
7: 4 2019-06-12 2019-12-12 a
8: 4 2017-02-24 2017-08-24 a
9: 4 2017-08-24 2018-08-24 b
Within the same ID, I want to compare the start_date and end_date. If the end_date of one row is within 30 days of the start_date of another row, I want to combine the rows. So that it looks like this:
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-07-12 2019-12-01 c
3: 2 2017-05-24 2017-11-24 d
4: 3 2016-05-08 2018-08-01 a
5: 4 2019-06-12 2019-12-12 a
6: 4 2017-02-24 2017-08-24 a
7: 4 2017-08-24 2018-08-24 b
If the other variables of the rows are the same, rows should be combined with the earliest start_date and latest end_date as id number 2. If the variable1 is NA it should be replaced with values from the matching row as id number 3. If the variable1 has different values, rows should remain separate as id number 4.
The data.table contains more variables and objects than displayed here. Preferable a function in data.table.
Not clear what happens if an id has 3 overlapping rows with variable1 = c('a', NA, 'b'), what should the variable1 be for the NA for this case? a or b?
If we just choose the first variable1 when there are multiple matches, here is an option to first fill the NA and then borrow the idea from David Aurenburg's solution here
setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]
df[is.na(variable1), variable1 :=
df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]
df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
.(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]
output:
id variable1 g start_date end_date
1: 1 a 0 2019-05-08 2019-09-08
2: 2 d 0 2017-05-24 2017-11-24
3: 2 c 1 2019-07-12 2019-12-01
4: 3 a 0 2016-05-08 2018-08-01
5: 4 a 0 2017-02-24 2017-08-24
6: 4 b 0 2017-08-24 2018-08-24
7: 4 a 1 2019-06-12 2019-12-12
data:
library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
variable1=c("a","c","c","d","a",NA,"a","a","b"))

Wrong column on data.table merge

Let's say I have these two tables:
library(data.table)
x <- data.table(Date = as.Date(c("1990-01-29", "1990-02-30",
"1990-01-31", "1990-02-01",
"1990-02-02", "1990-02-05",
"1990-02-06", "1990-02-07",
"1990-02-08", "1990-02-09")),
a = c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55))
y <- data.table(Date1 = as.Date(c("1990-01-31", "1990-02-06", "1990-02-07")),
Date2 = as.Date(c("1990-02-06", "1990-02-07", "1990-02-09")),
b = c(5, 2, 4))
Table y is really a descriptor of different "periods" starting at Date1 and ending at Date2 (such that one row's Date2 is the next row's Date1), with a (non-unique) descriptor of that period.
I'd now like to merge these tables, such that for each date of x have both a and the respective y$b (dates outside of the period should be dropped). I tried the following, but it's not right:
x[y, on = .(Date > Date1, Date <= Date2)]
# Date x Date.1 y
# 1: 1990-01-31 3 1990-02-06 5
# 2: 1990-01-31 5 1990-02-06 5
# 3: 1990-01-31 8 1990-02-06 5
# 4: 1990-01-31 13 1990-02-06 5
# 5: 1990-02-06 21 1990-02-07 2
# 6: 1990-02-07 34 1990-02-09 4
# 7: 1990-02-07 55 1990-02-09 4
Specifically, the Date column isn't x$Date, but actually y$Date1, repeated as necessary, while the Date.1 column is Date2.
Meanwhile, the expected output would be
# Date x y
# 1: 1990-02-01 3 5
# 2: 1990-02-02 5 5
# 3: 1990-02-05 8 5
# 4: 1990-01-06 13 5
# 5: 1990-02-07 21 2
# 6: 1990-02-08 34 4
# 7: 1990-02-09 55 4
It may be better to create a duplicate column
x[,.(Daten = Date, Date, a)][y,
on = .(Date > Date1, Date <= Date2)][, .(Date = Daten, a, b)]
# Date a b
#1: 1990-02-01 3 5
#2: 1990-02-02 5 5
#3: 1990-02-05 8 5
#4: 1990-02-06 13 5
#5: 1990-02-07 21 2
#6: 1990-02-08 34 4
#7: 1990-02-09 55 4
You can refer to the columns of each table using x. and i.
x[y,
on = .(Date > Date1, Date <= Date2),
.(Date = x.Date, x = x.a, y = i.b)]
Date x y
1: 1990-02-01 3 5
2: 1990-02-02 5 5
3: 1990-02-05 8 5
4: 1990-02-06 13 5
5: 1990-02-07 21 2
6: 1990-02-08 34 4
7: 1990-02-09 55 4

creating a unique variable based on row differences of another variable considering groups

By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10

Aggregate one data frame by time intervals from another data frame

I'm trying to aggregate two data frames (df1 and df2).
The first contains 3 variables: ID, Date1 and Date2.
df1
ID Date1 Date2
1 2016-03-01 2016-04-01
1 2016-04-01 2016-05-01
2 2016-03-14 2016-04-15
2 2016-04-15 2016-05-17
3 2016-05-01 2016-06-10
3 2016-06-10 2016-07-15
The second also contains 3 variables: ID, Date3 and Value.
df2
ID Date3 Value
1 2016-03-15 5
1 2016-04-04 7
1 2016-04-28 7
2 2016-03-18 3
2 2016-03-27 5
2 2016-04-08 9
2 2016-04-20 2
3 2016-05-05 6
3 2016-05-25 8
3 2016-06-13 3
The idea is to get, for each df1 row, the sum of df2$Value that have the same ID and for which Date3 is between Date1 and Date2:
ID Date1 Date2 SumValue
1 2016-03-01 2016-04-01 5
1 2016-04-01 2016-05-01 14
2 2016-03-14 2016-04-15 17
2 2016-04-15 2016-05-17 2
3 2016-05-01 2016-06-10 14
3 2016-06-10 2016-07-15 3
I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table, plyr and dplyr but could not find a solution.
A couple of data.table solutions that should scale well (and a good stop-gap until non-equi joins are implemented):
Do the comparison in J using by=EACHI.
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df1[ df2,
{
idx = Date1 <= i.Date3 & i.Date3 <= Date2
.(Date1 = Date1[idx],
Date2 = Date2[idx],
Date3 = i.Date3,
Value = i.Value)
},
on=c("ID"),
by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
foverlap join (as suggested in the comments)
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df2[, Date4 := Date3]
setkey(df1, ID, Date1, Date2)
foverlaps(df2,
df1,
by.x=c("ID", "Date3", "Date4"),
type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
Further reading
Rolling join on data.table with duplicate keys
foverlap joins in data.table
With the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done as follows:
dt2[dt1, .(sum = sum(Value)), on=.(ID, Date3>=Date1, Date3<=Date2), by=.EACHI]
# ID Date3 Date3 sum
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
The column names needs some fixing.. will work on it later.
Here's a base R solution using sapply():
df1 <- data.frame(ID=c(1L,1L,2L,2L,3L,3L),Date1=as.Date(c('2016-03-01','2016-04-01','2016-03-14','2016-04-15','2016-05-01','2016-06-01')),Date2=as.Date(c('2016-04-01','2016-05-01','2016-04-15','2016-05-17','2016-06-15','2016-07-15')));
df2 <- data.frame(ID=c(1L,1L,1L,2L,2L,2L,2L,3L,3L,3L),Date3=as.Date(c('2016-03-15','2016-04-04','2016-04-28','2016-03-18','2016-03-27','2016-04-08','2016-04-20','2016-05-05','2016-05-25','2016-06-13')),Value=c(5L,7L,7L,3L,5L,9L,2L,6L,8L,3L));
cbind(df1,SumValue=sapply(seq_len(nrow(df1)),function(ri) sum(df2$Value[df1$ID[ri]==df2$ID & df1$Date1[ri]<=df2$Date3 & df1$Date2[ri]>df2$Date3])));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3
Note that your df1 and expected output have slightly different dates in some cases; I used the df1 dates.
Here's another approach that attempts to be more vectorized: Precompute a cartesian product of indexes into the two frames, then perform a single vectorized conditional expression using the index vectors to get matching pairs of indexes, and finally use the matching indexes to aggregate the desired result:
cbind(df1,SumValue=with(expand.grid(i1=seq_len(nrow(df1)),i2=seq_len(nrow(df2))),{
x <- df1$ID[i1]==df2$ID[i2] & df1$Date1[i1]<=df2$Date3[i2] & df1$Date2[i1]>df2$Date3[i2];
tapply(df2$Value[i2[x]],i1[x],sum);
}));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3

Resources