Calculate number of distinct instances occurring in a given time period - r

I have dummy data
structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7,
7), policy_num = c(41551662L, 50966414L, 43077202L, 46927463L,
57130236L, 57050065L, 26196559L, 33545119L, 52304024L, 73953064L,
50340507L, 50491162L, 76577511L, 108067534L), product = c("apple",
"apple", "pear", "apple", "apple", "apple", "plum", "apple",
"pear", "apple", "apple", "apple", "pear", "pear"), start_date =
structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483,
16351, 15429, 15421, 16474, 17205), class = "Date"), end_date = structure(c(15068,
16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447,
15794, 15786, 17205, 17570), class = "Date")), .Names = c("id",
"policy_num", "product", "start_date", "end_date"), row.names = c(NA,
-14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
id policy_num product start_date end_date
1 41551662 apple 2007-04-04 2011-04-04
1 50966414 apple 2012-04-04 2014-04-04
2 43077202 pear 2009-02-01 2018-02-01
3 46927463 apple 2011-11-16 2012-11-16
3 57130236 apple 2012-11-16 2012-11-16
3 57050065 apple 2012-11-16 2014-11-16
4 26196559 plum 1998-11-17 2007-11-17
5 33545119 apple 2011-05-23 2014-06-13
5 52304024 pear 2012-05-23 2018-05-23
5 73953064 apple 2014-10-08 2017-10-08
6 50340507 apple 2012-03-30 2013-03-30
7 50491162 apple 2012-03-22 2013-03-22
7 76577511 pear 2015-02-08 2017-02-08
7 108067534 pear 2017-02-08 2018-02-08
Based on it, I'd like to calculate the following variables (grouped by user_id):
1) Number of currently held product (no_prod_now) - number of distinct products, whose end_date > currently evaluated start_date. Simply, number of products held by user_id at the time of start_date
2) Number of currently held active policies (no_policies_now) - as above, but applied to policy_num
3) Number of policies opened within 3 months prior the current start_date (policies_open_3mo)
4) policies_closed_3mo - as above, but number of closed policies in the past 3 months
The desirable output would look like this:
id policy_num product start_date end_date no_prod_now no_policies_now policies_closed_3mo
1 41551662 apple 2007-04-04 2011-04-04 1 1 0
1 50966414 apple 2012-04-04 2014-04-04 1 1 0
2 43077202 pear 2009-02-01 2018-02-01 1 1 0
3 46927463 apple 2011-11-16 2012-11-16 1 1 0
3 57130236 apple 2012-11-16 2012-11-16 1 1 1
3 57050065 apple 2012-11-16 2014-11-16 1 1 2
4 26196559 plum 1998-11-17 2007-11-17 1 1 0
5 33545119 apple 2011-05-23 2014-06-13 1 1 0
5 52304024 pear 2012-05-23 2018-05-23 2 2 0
5 73953064 apple 2014-10-08 2017-10-08 2 2 0
6 50340507 apple 2012-03-30 2013-03-30 1 1 0
7 50491162 apple 2012-03-22 2013-03-22 1 1 0
7 76577511 pear 2015-02-08 2017-02-08 1 1 0
7 108067534 pear 2017-02-08 2018-02-08 1 1 1
policies_open_3mo
0
0
0
0
0
1
0
0
1
0
0
0
0
0
I'm looking for the solution implemented ideally in data.table, as I'm going to apply it to big data volumes, but base R or dplyr solutions I could always convert to data.table, o would be also valuable, thanks!

This is quite tricky but can be solved with a number of non-equi self-joins.
Edit: It has turned out that update on join doesn't work together with non-equi self-joins as I had expected (see here). So, I had to revise the code completely to avoid updates in place.
Instead, the four additional columns are created by three separate non-equi self-joins and are combined for the final result.
library(data.table)
library(lubridate)
result <-
# create helper column for previous three months periods.
# lubridate's month arithmetic avoids NAs at end of month, e.g., February
DT[, start_date_3mo := start_date %m-% period(month = 3L)][
# start "cbind()" with original columns
, c(.SD,
# count number of products and policies held at time of start_date
DT[DT, on = c("id", "start_date<=start_date", "end_date>start_date"),
.(no_prod_now = uniqueN(product), no_pols_now = uniqueN(policy_num)),
by = .EACHI][, c("no_prod_now", "no_pols_now")],
# policies closed within previous 3 months of start_date
DT[DT, on = c("id", "end_date>=start_date_3mo", "end_date<=start_date"),
.(pols_closed_3mo = .N), by = .EACHI][, "pols_closed_3mo"],
# additional policies opened within previous 3 months of start_date
DT[DT, on = c("id", "start_date>=start_date_3mo", "start_date<=start_date"),
.(pols_opened_3mo = .N - 1L), by = .EACHI][, "pols_opened_3mo"])][
# omit helper column
, -"start_date_3mo"]
result
id policy_num product start_date end_date no_prod_now no_pols_now pols_closed_3mo pols_opened_3mo
1: 1 41551662 apple 2007-04-04 2011-04-04 1 1 0 0
2: 1 50966414 apple 2012-04-04 2014-04-04 1 1 0 0
3: 2 43077202 pear 2009-02-01 2018-02-01 1 1 0 0
4: 3 46927463 apple 2011-11-16 2012-11-16 1 1 0 0
5: 3 57130236 apple 2012-11-16 2012-11-16 1 1 2 1
6: 3 57050065 apple 2012-11-16 2014-11-16 1 1 2 1
7: 4 26196559 plum 1998-11-17 2007-11-17 1 1 0 0
8: 5 33545119 apple 2011-05-23 2014-06-13 1 1 0 0
9: 5 52304024 pear 2012-05-23 2018-05-23 2 2 0 0
10: 5 73953064 apple 2014-10-08 2017-10-08 2 2 0 0
11: 6 50340507 apple 2012-03-30 2013-03-30 1 1 0 0
12: 7 50491162 apple 2012-03-22 2013-03-22 1 1 0 0
13: 7 76577511 pear 2015-02-08 2017-02-08 1 1 0 0
14: 7 108067534 pear 2017-02-08 2018-02-08 1 1 1 0
Note that there are discrepancies for policies opened within 3 previous months before start_date between OP's expected result and the result here. For id == 3, there are 2 policies starting both on 2012-11-16, so it's one additional policy to count for each row. For id == 5, all start_date differ by more than 3 months, so there shouldn't be an overlap.
Also, rows 5 and 6 both show a value of 2 for policies closed within 3 previous months before start_date because id == 3 has two policies ending on 2012-11-16.

Related

Find un-arrangeable consecutive time intervals with exactly n days difference

I have a data as follow and I need to group them based on dates that time_right + 1 = time_left (in other rows). The group id is equal to the minimum id of those records that satisfy this condition.
input = data.frame(id = c(1:6),
time_left = c("2016-01-01", "2016-09-05", "2016-09-06","2016-09-08", "2016-09-12","2016-09-15"),
time_right = c("2016-09-07", "2016-09-11", "2016-09-12", "2016-09-14", "2016-09-18","2016-09-21"))
Input
id time_left time_right
1 1 2016-01-01 2016-09-07
2 2 2016-09-05 2016-09-11
3 3 2016-09-06 2016-09-12
4 4 2016-09-08 2016-09-14
5 5 2016-09-12 2016-09-18
6 6 2016-09-15 2016-09-21
Output:
id time_left time_right group_id
1 1 2016-01-01 2016-09-07 1
2 2 2016-09-05 2016-09-11 2
3 3 2016-09-06 2016-09-12 3
4 4 2016-09-08 2016-09-14 1
5 5 2016-09-12 2016-09-18 2
6 6 2016-09-15 2016-09-21 1
Is there anyway to do it with dplyr?

Referring to the row above when using mutate() in R

I want to create a new variable in a dataframe that refers to the value of the same new variable in the row above. Here's an example of what I want to do:
A horse is in a field divided into four zones. The horse is wearing a beacon that signals every minute, and the signal is picked up by one of four sensors, one for each zone. The field has a fence that runs most of the way down the middle, such that the horse can pass easily between zones 2 and 3, but to get between zones 1 and 4 it has to go via 2 and 3. The horse cannot jump over the fence.
|________________|
| |
sensor 2 | X | | sensor 3
| | |
| | |
| | |
sensor 1 | Y| | sensor 4
| | |
|----------------|
In the schematic above, if the horse is at position X, it will be picked up by sensor 2. If the horse is near the middle fence at position Y, however, it may be picked up by either sensor 1 or sensor 4, the ranges of which overlap slightly.
In the toy example below, I have a dataframe where I have location data each minute for 20 minutes. In most cases, the horse moves one zone at a time, but in several instances, it switches back and forth between zone 1 and 4. This should be impossible: the horse cannot jump the fence, and neither can it run around in the space of a minute.
I therefore want to calculate a new variable in the dataset that provides the "true" location of the animal, accounting for the impossibility of travelling between 1 and 4.
Here's the data:
library(tidyverse)
library(reshape2)
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,1,4,1,4))
example
Create two new variables: "prevloc" is where the animal was in the previous minute, and "diffloc" is the number differences between the animal's current and previous location.
example <- example %>% mutate(prevloc = lag(location),
diffloc = abs(location - prevloc))
example
Next, just change the first value of "diffloc" from NA to zero:
example <- example %>% mutate(diffloc = ifelse(is.na(diffloc), 0, diffloc))
example
Now we have a dataframe where diffloc is either 0 (animal didn't move), 1 (animal moved one zone), or 3 (animal apparently moved from zone 1 to zone 4 or vice versa). Where diffloc = 3, I want to create a "true" location taking account of the fact that such a change in location is impossible.
In my example, the animal went from zone 1 -> 4 -> 1 -> 4 -> 1 -> 4. Based on the fact that the animal started in zone 1, my assumption is that the animal just stayed in zone 1 the whole time.
My attempt to solve this below, which doesn't work:
example <- example %>%
mutate(returnloc = ifelse(diffloc < 3, location, lag(returnloc)))
I wonder whether anyone can help me to solve this? I've been trying for a couple of days and haven't even got close...
Best wishes,
Adam
One possible solution is to, when diffloc == 3, look at the previous value that is not 1 nor 4. If it is 2, then the horse is certainly in 1 afterwards, if it is 3, then the horse is certainly in 4.
example %>%
mutate(trueloc = case_when(diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 2) ~ 1,
diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 3) ~ 4,
T ~ location))
time location prevloc diffloc trueloc
1 2022-01-01 09:00:00 1 NA 0 1
2 2022-01-01 09:01:00 1 1 0 1
3 2022-01-01 09:02:00 1 1 0 1
4 2022-01-01 09:03:00 1 1 0 1
5 2022-01-01 09:04:00 2 1 1 2
6 2022-01-01 09:05:00 3 2 1 3
7 2022-01-01 09:06:00 3 3 0 3
8 2022-01-01 09:07:00 3 3 0 3
9 2022-01-01 09:08:00 4 3 1 4
10 2022-01-01 09:09:00 4 4 0 4
11 2022-01-01 09:10:00 4 4 0 4
12 2022-01-01 09:11:00 3 4 1 3
13 2022-01-01 09:12:00 3 3 0 3
14 2022-01-01 09:13:00 2 3 1 2
15 2022-01-01 09:14:00 1 2 1 1
16 2022-01-01 09:15:00 1 1 0 1
17 2022-01-01 09:16:00 4 1 3 1
18 2022-01-01 09:17:00 1 4 3 1
19 2022-01-01 09:18:00 4 1 3 1
20 2022-01-01 09:19:00 1 4 3 1
21 2022-01-01 09:20:00 4 1 3 1
Here is an approach using a funciton containing a for-loop.
You cannot rely on diff, because this will not pick up sequences of (wrong) zone 4's.
c(1,1,4,4,4,1,1,1) should be converted to c(1,1,1,1,1,1,1,1) if I understand your question correctly.
So, you need to iterate (I think).
library(data.table)
# custom sample data set
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,4,4,1,4))
# Make it a data.table, make sure the time is ordered
setDT(example, key = "time")
# function
fixLocations <- function(x) {
for(i in 2:length(x)) {
if (abs(x[i] - x[i-1]) > 1) x[i] <- x[i-1]
}
return(x)
}
NB that this function only works if the location in the first row is correct. If it start with (wrong) zone 4's, it will go awry.
example[, locationNew := fixLocations(location)][]
# time location locationNew
# 1: 2022-01-01 09:00:00 1 1
# 2: 2022-01-01 09:01:00 1 1
# 3: 2022-01-01 09:02:00 1 1
# 4: 2022-01-01 09:03:00 1 1
# 5: 2022-01-01 09:04:00 2 2
# 6: 2022-01-01 09:05:00 3 3
# 7: 2022-01-01 09:06:00 3 3
# 8: 2022-01-01 09:07:00 3 3
# 9: 2022-01-01 09:08:00 4 4
#10: 2022-01-01 09:09:00 4 4
#11: 2022-01-01 09:10:00 4 4
#12: 2022-01-01 09:11:00 3 3
#13: 2022-01-01 09:12:00 3 3
#14: 2022-01-01 09:13:00 2 2
#15: 2022-01-01 09:14:00 1 1
#16: 2022-01-01 09:15:00 1 1
#17: 2022-01-01 09:16:00 4 1
#18: 2022-01-01 09:17:00 4 1
#19: 2022-01-01 09:18:00 4 1
#20: 2022-01-01 09:19:00 1 1
#21: 2022-01-01 09:20:00 4 1
# time location locationNew

Calculate maximum date interval - R

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
)
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
library(magrittr)
library(data.table)
setDT(data)
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
}
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days

Calculate average number of individuals present on each date in R

I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857

For each column, sum scores by group over prior window of time

I have a large panel dataset (10,000,000 x 53) with about 50 columns of scores. I have aggregated each score by group (there are about 15,000) and date.
Now I want to calculate a rolling sum of three values including the prior two dates' and the current date's scores, creating a new corresponding sum column.
The sums should be calculated for each score column by date and group.
For 1st and 2nd dates within a group, fewer than 3 values is allowed.
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David's answer from this post covered most of my questions on summing rolling windows by groups but I'm still missing a couple pieces.
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
My question has two parts:
What code should replace "Score" to calculate time window sums for each column in a range of columns?
Is the solution provided the most efficient version for fast calculation on large dataset?
A possible solution:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 4 9
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 4 5
6: B 2017-04-08 2017-04-06 3 1 7 6
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
When you want to take a time window into account as well, you could do (assuming LAGGED is the start of the time-window):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 3 7
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 2 2
6: B 2017-04-08 2017-04-06 3 1 3 1
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
Used data:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')

Resources