Aggregate one data frame by time intervals from another data frame - r

I'm trying to aggregate two data frames (df1 and df2).
The first contains 3 variables: ID, Date1 and Date2.
df1
ID Date1 Date2
1 2016-03-01 2016-04-01
1 2016-04-01 2016-05-01
2 2016-03-14 2016-04-15
2 2016-04-15 2016-05-17
3 2016-05-01 2016-06-10
3 2016-06-10 2016-07-15
The second also contains 3 variables: ID, Date3 and Value.
df2
ID Date3 Value
1 2016-03-15 5
1 2016-04-04 7
1 2016-04-28 7
2 2016-03-18 3
2 2016-03-27 5
2 2016-04-08 9
2 2016-04-20 2
3 2016-05-05 6
3 2016-05-25 8
3 2016-06-13 3
The idea is to get, for each df1 row, the sum of df2$Value that have the same ID and for which Date3 is between Date1 and Date2:
ID Date1 Date2 SumValue
1 2016-03-01 2016-04-01 5
1 2016-04-01 2016-05-01 14
2 2016-03-14 2016-04-15 17
2 2016-04-15 2016-05-17 2
3 2016-05-01 2016-06-10 14
3 2016-06-10 2016-07-15 3
I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table, plyr and dplyr but could not find a solution.

A couple of data.table solutions that should scale well (and a good stop-gap until non-equi joins are implemented):
Do the comparison in J using by=EACHI.
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df1[ df2,
{
idx = Date1 <= i.Date3 & i.Date3 <= Date2
.(Date1 = Date1[idx],
Date2 = Date2[idx],
Date3 = i.Date3,
Value = i.Value)
},
on=c("ID"),
by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
foverlap join (as suggested in the comments)
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df2[, Date4 := Date3]
setkey(df1, ID, Date1, Date2)
foverlaps(df2,
df1,
by.x=c("ID", "Date3", "Date4"),
type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
Further reading
Rolling join on data.table with duplicate keys
foverlap joins in data.table

With the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done as follows:
dt2[dt1, .(sum = sum(Value)), on=.(ID, Date3>=Date1, Date3<=Date2), by=.EACHI]
# ID Date3 Date3 sum
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
The column names needs some fixing.. will work on it later.

Here's a base R solution using sapply():
df1 <- data.frame(ID=c(1L,1L,2L,2L,3L,3L),Date1=as.Date(c('2016-03-01','2016-04-01','2016-03-14','2016-04-15','2016-05-01','2016-06-01')),Date2=as.Date(c('2016-04-01','2016-05-01','2016-04-15','2016-05-17','2016-06-15','2016-07-15')));
df2 <- data.frame(ID=c(1L,1L,1L,2L,2L,2L,2L,3L,3L,3L),Date3=as.Date(c('2016-03-15','2016-04-04','2016-04-28','2016-03-18','2016-03-27','2016-04-08','2016-04-20','2016-05-05','2016-05-25','2016-06-13')),Value=c(5L,7L,7L,3L,5L,9L,2L,6L,8L,3L));
cbind(df1,SumValue=sapply(seq_len(nrow(df1)),function(ri) sum(df2$Value[df1$ID[ri]==df2$ID & df1$Date1[ri]<=df2$Date3 & df1$Date2[ri]>df2$Date3])));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3
Note that your df1 and expected output have slightly different dates in some cases; I used the df1 dates.
Here's another approach that attempts to be more vectorized: Precompute a cartesian product of indexes into the two frames, then perform a single vectorized conditional expression using the index vectors to get matching pairs of indexes, and finally use the matching indexes to aggregate the desired result:
cbind(df1,SumValue=with(expand.grid(i1=seq_len(nrow(df1)),i2=seq_len(nrow(df2))),{
x <- df1$ID[i1]==df2$ID[i2] & df1$Date1[i1]<=df2$Date3[i2] & df1$Date2[i1]>df2$Date3[i2];
tapply(df2$Value[i2[x]],i1[x],sum);
}));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3

Related

Match rows with the same or close start and end date in data.table r

Following data.table
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
variable1=c("a","c","c","d","a",NA,"a","a","b"))
df
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-08-01 2019-12-01 c
3: 2 2019-07-12 2019-07-30 c
4: 2 2017-05-24 2017-11-24 d
5: 3 2016-05-08 2017-07-25 a
6: 3 2017-08-01 2018-08-01 <NA>
7: 4 2019-06-12 2019-12-12 a
8: 4 2017-02-24 2017-08-24 a
9: 4 2017-08-24 2018-08-24 b
Within the same ID, I want to compare the start_date and end_date. If the end_date of one row is within 30 days of the start_date of another row, I want to combine the rows. So that it looks like this:
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-07-12 2019-12-01 c
3: 2 2017-05-24 2017-11-24 d
4: 3 2016-05-08 2018-08-01 a
5: 4 2019-06-12 2019-12-12 a
6: 4 2017-02-24 2017-08-24 a
7: 4 2017-08-24 2018-08-24 b
If the other variables of the rows are the same, rows should be combined with the earliest start_date and latest end_date as id number 2. If the variable1 is NA it should be replaced with values from the matching row as id number 3. If the variable1 has different values, rows should remain separate as id number 4.
The data.table contains more variables and objects than displayed here. Preferable a function in data.table.
Not clear what happens if an id has 3 overlapping rows with variable1 = c('a', NA, 'b'), what should the variable1 be for the NA for this case? a or b?
If we just choose the first variable1 when there are multiple matches, here is an option to first fill the NA and then borrow the idea from David Aurenburg's solution here
setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]
df[is.na(variable1), variable1 :=
df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]
df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
.(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]
output:
id variable1 g start_date end_date
1: 1 a 0 2019-05-08 2019-09-08
2: 2 d 0 2017-05-24 2017-11-24
3: 2 c 1 2019-07-12 2019-12-01
4: 3 a 0 2016-05-08 2018-08-01
5: 4 a 0 2017-02-24 2017-08-24
6: 4 b 0 2017-08-24 2018-08-24
7: 4 a 1 2019-06-12 2019-12-12
data:
library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
variable1=c("a","c","c","d","a",NA,"a","a","b"))

Wrong column on data.table merge

Let's say I have these two tables:
library(data.table)
x <- data.table(Date = as.Date(c("1990-01-29", "1990-02-30",
"1990-01-31", "1990-02-01",
"1990-02-02", "1990-02-05",
"1990-02-06", "1990-02-07",
"1990-02-08", "1990-02-09")),
a = c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55))
y <- data.table(Date1 = as.Date(c("1990-01-31", "1990-02-06", "1990-02-07")),
Date2 = as.Date(c("1990-02-06", "1990-02-07", "1990-02-09")),
b = c(5, 2, 4))
Table y is really a descriptor of different "periods" starting at Date1 and ending at Date2 (such that one row's Date2 is the next row's Date1), with a (non-unique) descriptor of that period.
I'd now like to merge these tables, such that for each date of x have both a and the respective y$b (dates outside of the period should be dropped). I tried the following, but it's not right:
x[y, on = .(Date > Date1, Date <= Date2)]
# Date x Date.1 y
# 1: 1990-01-31 3 1990-02-06 5
# 2: 1990-01-31 5 1990-02-06 5
# 3: 1990-01-31 8 1990-02-06 5
# 4: 1990-01-31 13 1990-02-06 5
# 5: 1990-02-06 21 1990-02-07 2
# 6: 1990-02-07 34 1990-02-09 4
# 7: 1990-02-07 55 1990-02-09 4
Specifically, the Date column isn't x$Date, but actually y$Date1, repeated as necessary, while the Date.1 column is Date2.
Meanwhile, the expected output would be
# Date x y
# 1: 1990-02-01 3 5
# 2: 1990-02-02 5 5
# 3: 1990-02-05 8 5
# 4: 1990-01-06 13 5
# 5: 1990-02-07 21 2
# 6: 1990-02-08 34 4
# 7: 1990-02-09 55 4
It may be better to create a duplicate column
x[,.(Daten = Date, Date, a)][y,
on = .(Date > Date1, Date <= Date2)][, .(Date = Daten, a, b)]
# Date a b
#1: 1990-02-01 3 5
#2: 1990-02-02 5 5
#3: 1990-02-05 8 5
#4: 1990-02-06 13 5
#5: 1990-02-07 21 2
#6: 1990-02-08 34 4
#7: 1990-02-09 55 4
You can refer to the columns of each table using x. and i.
x[y,
on = .(Date > Date1, Date <= Date2),
.(Date = x.Date, x = x.a, y = i.b)]
Date x y
1: 1990-02-01 3 5
2: 1990-02-02 5 5
3: 1990-02-05 8 5
4: 1990-02-06 13 5
5: 1990-02-07 21 2
6: 1990-02-08 34 4
7: 1990-02-09 55 4

Assign values between dates to a dataframe in r

How do I manage to use a date I have in a dataframe, let's say dataframe 1, as reference for selecting a value that is in other dataframe, dataframe2, when my date in dataframe 1 is between a start date variable and an end date variable in dataframe 2?
For example, I have two dataframes. The first one is a dataframe that only has dates, we will call it "dates".
library(lubridate)
date <- ymd(c("2017-06-01", "2013-01-01", "2014-03-01", "2008-01-01","2011-03-01","2009-03-01","2012-03-01","2015-08-01","2008-08-01"))
date <- as.data.frame(date)
> date
date
1 2017-06-01
2 2013-01-01
3 2014-03-01
4 2008-01-01
5 2011-03-01
6 2009-03-01
7 2012-03-01
8 2015-08-01
9 2008-08-01
My other dataframe,"df2" , contains the start and end dates and a value that is to be assigned to the dataframe"dates" in case a date$date falls between the start date and the end date of the dataframe "df2" .
start_date <- dmy(c("1/6/2001","1/6/2002","1/6/2003","1/10/2011","1/11/2015","1/1/2016","1/1/2017","1/1/2018"))
end_date <-dmy(c("1/5/2002","1/5/2003","1/9/2011","1/10/2015","1/12/2015","1/12/2016","1/12/2017","1/12/2018"))
value <- c(2400,3600,4800,7000,7350,7717.5,8103.38,8508.54)
df2 <- data.frame(start_date, end_date, value)
> df2
start_date end_date value
1 2001-06-01 2002-05-01 2400.00
2 2002-06-01 2003-05-01 3600.00
3 2003-06-01 2011-09-01 4800.00
4 2011-10-01 2015-10-01 7000.00
5 2015-11-01 2015-12-01 7350.00
6 2016-01-01 2016-12-01 7717.50
7 2017-01-01 2017-12-01 8103.38
8 2018-01-01 2018-12-01 8508.54
In the end i would have this result :
date value
1 2017-06-01 8103.38
2 2013-01-01 7000.00
3 2014-03-01 7000.00
4 2008-01-01 4800.00
5 2011-03-01 4800.00
6 2009-03-01 4800.00
7 2012-03-01 7000.00
8 2015-08-01 7000.00
9 2008-08-01 4800.00
Using data.table, you can specify the join condition of the fly:
library(data.table)
setDT(date1) # date data frame
setDT(df1)
date1[df2, on = .(date >= start_date, date <= end_date), value := i.value]
print(date1)
date value
1: 2008-01-01 4800.00
2: 2008-08-01 4800.00
3: 2009-03-01 4800.00
4: 2011-03-01 4800.00
5: 2012-03-01 7000.00
6: 2013-01-01 7000.00
7: 2014-03-01 7000.00
8: 2015-08-01 7000.00
9: 2017-06-01 8103.38

Splitting time stamps into two columns

I have a column with ID and for each ID several even dates. I want to create two columns with rows for each id one column with the first date and the other with the next consecutive date. The next row for the ID should have the entry in the previous row second column and the next consecutive date for this ID. An example:
This is the data I have
id date
1 1 2015-01-01
2 1 2015-01-18
3 1 2015-08-02
4 2 2015-01-01
5 2 2015-01-13
6 3 2015-01-01
This is data I want
id date1 date2
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using dplyr:
library(dplyr)
df %>% group_by(id) %>%
mutate(date2 = lead(date))
id date date2
(int) (fctr) (fctr)
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using data.table, you can do as follow:
require(data.table)
DT[, .(date1 = date, date2 = shift(date, type = "lead")), by = id]
Or simply (also mentioned by #docendodiscimus)
DT[, date2 := shift(date, type = "lead"), by = id]
Also, if you are interested on making a recursive n columns (edited, taking advantage of #docendodiscimus comment to simplify the code)
i = 1:5
DT[, paste0("date", i+1) := shift(date, i, type = "lead"), by = id]
Base R solution using transform() and ave():
transform(df,date1=date,date2=ave(date,id,FUN=function(x) c(x[-1L],NA)),date=NULL);
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
The above line of code produces a copy of the data.frame. The return value can be assigned over the original df, assigned to a new variable, or passed as an argument/operand to a function/operator. If you want to modify it in-place, which would be a more efficient way to overwrite df, you can do this:
df$date2 <- ave(df$date,df$id,FUN=function(x) c(x[-1L],NA));
colnames(df)[colnames(df)=='date'] <- 'date1';
df;
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
df$date2 = ifelse(df$id==c(df$id[-1],-1), c(df$date[-1],NA), NA)

"Unnesting" a dataframe in R

I have the following data.frame:
df <- data.frame(id=c(1,2,3),
first.date=as.Date(c("2014-01-01", "2014-03-01", "2014-06-01")),
second.date=as.Date(c("2015-01-01", "2015-03-01", "2015-06-1")),
third.date=as.Date(c("2016-01-01", "2017-03-01", "2018-06-1")),
fourth.date=as.Date(c("2017-01-01", "2018-03-01", "2019-06-1")))
> df
id first.date second.date third.date fourth.date
1 1 2014-01-01 2015-01-01 2016-01-01 2017-01-01
2 2 2014-03-01 2015-03-01 2017-03-01 2018-03-01
3 3 2014-06-01 2015-06-01 2018-06-01 2019-06-01
Each row represents three timespans; i.e. the time spans between first.date and second.date, second.date and third.date, and third.date and fourth.date respectively.
I would like to, in lack of a better word, unnest the dataframe to obtain this instead:
id StartDate EndDate
1 1 2014-01-01 2015-01-01
2 1 2015-01-01 2016-01-01
3 1 2016-01-01 2017-01-01
4 2 2014-03-01 2015-03-01
5 2 2015-03-01 2017-03-01
6 2 2017-03-01 2018-03-01
7 3 2014-06-01 2015-06-01
8 3 2015-06-01 2018-06-01
9 3 2018-06-01 2019-06-01
I have been playing around with the unnest function from the tidyr package, but I came to the conclusion that I don't think it's what I'm really looking for.
Any suggestions?
You can try tidyr/dplyr as follows:
library(tidyr)
library(dplyr)
df %>% gather(DateType, StartDate, -id) %>% select(-DateType) %>% arrange(id) %>% group_by(id) %>% mutate(EndDate = lead(StartDate))
You can eliminate the last row in each id group by adding:
%>% slice(-4)
To the above pipeline.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), then melt the dataset to long format, use shift with type='lead' grouped by 'id' and then remove the NA elements.
library(data.table)
na.omit(melt(setDT(df), id.var='id')[, shift(value,0:1, type='lead') , id])
# id V1 V2
#1: 1 2014-01-01 2015-01-01
#2: 1 2015-01-01 2016-01-01
#3: 1 2016-01-01 2017-01-01
#4: 2 2014-03-01 2015-03-01
#5: 2 2015-03-01 2017-03-01
#6: 2 2017-03-01 2018-03-01
#7: 3 2014-06-01 2015-06-01
#8: 3 2015-06-01 2018-06-01
#9: 3 2018-06-01 2019-06-01
The column names can be changed by using either setnames or earlier in the shift step.

Resources