Merging multiple dataframes based on daterange - r

I have this data
id<-c("1","3")
Outcome<-c("Balanced","Balanced")
FromDate<-as.Date(c("2016-01-01","2016-01-01"),'%Y-%m-%d')
ToDate<-as.Date(c("2017-01-01","2017-01-01"),'%Y-%m-%d')
type<-c("ccc")
data<-data.frame(id,Outcome,FromDate,ToDate,type)
R> data
id Outcome FromDate ToDate type
1 Balanced 2016-01-01 2017-01-01 ccc
3 Balanced 2016-01-01 2017-01-01 ccc
refno<-c("1","2","1","1")
sedolnumber<-c("ABC123","XYZ12","ABC123","ZZZ123")
order_placement_date<-as.Date(c("2016-02-01","2017-02-05","2017-02-01","2016-04-01"),'%Y-%m-%d')
units_buyed<-c("1000","200","1000","1000")
buy<-data.frame(refno,sedolnumber,order_placement_date,units_buyed)
R> buy
refno sedolnumber order_placement_date units_buyed
1 ABC123 2016-02-01 1000
2 XYZ12 2017-02-05 200
1 ABC123 2017-02-01 1000
1 ZZZ123 2016-04-01 1000
refno<-c("1","1")
sedolnumber<-c("ABC123","ABC123")
sell_placement_date<-as.Date(c("2016-05-01","2017-05-01"),'%Y-%m-%d')
units_sold<-c("500","500")
sell<-data.frame(refno,sedolnumber,sell_placement_date,units_sold)
R> sell
refno sedolnumber sell_placement_date units_sold
1 ABC123 2016-05-01 500
1 ABC123 2017-05-01 500
I want to join all three tables based on the conditions and add one more columns Units_Retained which would be subtraction of columns buy.units_buyed-sell.units_sold:
data.id=buy.refno
buy.order_placement_date >= data.FromDate AND buy.order_placement_date < data.ToDate
data.id=sell.refno
buy.sedolnumber=sell.sedolnumber
sell.sell_placement_date >= data.FromDate AND sell.sell_placement_date < data.ToDate
Units retained columns should show value of units_buyed if units_sold is null and 0 if both units_buyed and units_sold are null
I am doing this with sqldf package. Are there any functions in R to achieve this without using sqldf. data table is my parent table and should show blank values if the matching records in buy and sell tables are not found.
R>sqldf("SELECT a.id,a.outcome,a.FromDate,a.ToDate,a.type,b.sedolnumber,b.order_placement_date,b.units_buyed,c.units_sold,c.sell_placement_date,(b.units_buyed-c.units_sold) as Units_Retained
FROM data a LEFT JOIN buy b ON (a.id=b.refno AND b.order_placement_date>=a.FromDate AND b.order_placement_date<a.ToDate)
LEFT JOIN sell c ON(a.id=c.refno AND c.sell_placement_date>=a.FromDate AND c.sell_placement_date<a.ToDate AND b.sedolnumber=c.sedolnumber) ")
R> id Outcome FromDate ToDate type sedolnumber order_placement_date units_buyed units_sold sell_placement_date Units_Retained
1 Balanced 2016-01-01 2017-01-01 ccc ABC123 2016-02-01 1000 500 2016-05-01 500
1 Balanced 2016-01-01 2017-01-01 ccc ZZZ123 2016-04-01 1000 <NA> <NA> NA
3 Balanced 2016-01-01 2017-01-01 ccc <NA> <NA> <NA> <NA> <NA> NA

You can use dplyr verbs
library(dplyr)
get_units_retained <- function(units_buyed, units_sold) {
units_buyed <- as.numeric(as.character(units_buyed))
units_sold <- as.numeric(as.character(units_sold))
if_else(is.na(units_buyed), 0, units_buyed) - if_else(is.na(units_sold), 0, units_sold)
}
left_join(data, buy, by = c("id" = "refno")) %>%
left_join(sell, by = c("id" = "refno", "sedolnumber")) %>%
filter(
(order_placement_date >= FromDate & order_placement_date < ToDate) | is.na(order_placement_date),
(sell_placement_date >= FromDate & sell_placement_date < ToDate) | is.na(sell_placement_date)
) %>%
mutate(Units_Retained = get_units_retained(units_buyed, units_sold))

You can use data.table if you operate with a really large data :
require(data.table)
data <- as.data.table(data)
buy <- as.data.table(buy)
sell <- as.data.table(sell)
setkey(data,id)
setkey(buy,refno)
setkey(sell,refno,sedolnumber)
dd <- setkey(data[buy,nomatch = 0],
id,
sedolnumber
)[
sell,nomatch = 0
][
order_placement_date >= FromDate & order_placement_date < ToDate &
sell_placement_date >= FromDate & sell_placement_date < ToDate,
][,
Units_Retained := as.numeric(as.character(units_buyed)) - as.numeric(as.character(units_sold))
]

Related

Filter data by last 12 Months of the total data available in R

R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?

efficient way of selecting rows with a minimum time spacing between dates while grouping

I want to select rows of data with dates such that the dates have a minimum time difference of 3 months.
Here is an example:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-01-05
5: 2 numser2 2020-01-08
6: 2 numser2 2020-01-20
7: 2 numser2 2020-03-15
8: 2 numser2 2020-03-18
9: 2 numser3 2020-03-13
10: 2 numser3 2020-03-18
11: 3 numser3 2020-01-22
12: 4 numser4 2018-01-02
I want, by patient and numsermed, keep the date that have at least 3 months difference. I cannot use simply the successive differences. Expected result is:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-03-15
5: 2 numser3 2020-03-13
6: 3 numser3 2020-01-22
7: 4 numser4 2018-01-02
Here, for numsermed2 and patient 2, after 2019-12-12, the next date 3 months a least later is 2020-03-15, that I keep. I thus remove 2020-01-05, 2020-01-08, 2020-01-20.
I then remove 2020-03-18, which is 3 days after 2020-03-15.
Here is my solution with data.table:
library(data.table)
library(lubridate)
setkeyv(test,c("numsermed","patient","date"))
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
max(test[,.N,by = .(numsermed,patient)]$N)
Nmax <- max(test[,.N,by = .(numsermed,patient)]$N)
test[,supp := 0]
for(i in 1:Nmax){
test[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
test <- test2[supp != 1 ]
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
The idea is for each row, to test the condition and then perform the subset. It seems to work, but on a million row table, it is rather slow (few hours). I am sure there is an efficient way with semi equi join or rolling join in data.table, but I did not manage to write it. Could someone come up with a more efficient solution ? dplyr solutions are of course welcome too.
The data:
library(data.table)
library(lubridate) test<-setDT(list(patient=c(1:3,2),numsermed=c(paste0("numser",1:3),"numser2"),date=as_date(c("2020-01-08","2020-01-20","2020-01-22","2019-12-12"))))
test<-rbind(test,data.table(patient=4,numsermed="numser4",date=as_date("2018-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2015-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-15")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-05")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-08")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-13")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-18")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-18")))
Edit
I propose I comparison of the solution proposed, #Ben 's solution, #chinsoon12 's and #astrofunkswag 's .
Here is the test data:
library(data.table)
library(lubridate)
library(magrittr)
set.seed(1234)
origin <- "1970-01-01"
dt <- data.table(numsermed = sample(paste0("numsermed",1:30),10000,replace = T))
dt[,patient := sample(1:10000,.N,replace = T),by = numsermed]
dt[,date := sample((dmy("01.01.2019") %>% as.numeric()):(dmy("01.01.2020") %>% as.numeric()),.N),by = .(patient)]
and here the 4 functions, including mine:
ben = function(dt){
dt[, c("idx", "date2") := list(.I, date - 90L)]
dt_final <- unique(dt[dt, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(dt_final, c("patient", "numsermed", "i.date"))
return(dt_final[,.(patient,numsermed,date = i.date)])
}
chinson = function(dt){
dt[, d := as.integer(date)]
setkey(dt,date)
return( dt[dt[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1][,.(patient,numsermed,date)])
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
library(tidyverse); library(zoo)
astrofun = function(dt){
return(
dt %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
)
}
denis = function(dt){
df <- copy(dt)
setkeyv(dt,c("numsermed","patient","date"))
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
df[,N := .N,by = .(numsermed,patient)]
Nmax <- max(df[,N])
df[,supp := 0]
for(i in 1:Nmax){
df[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
df <- df[supp != 1 ]
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
return(df[,.(patient,numsermed,date)])
}
First, none of them produce the same result! denis(dt) output 9833 lines, ben(dt) 9928, chinson(dt) 9929, and #astrofunkswag solution astrofun(dt) output 9990 lines. I am not sur why this does not produce the same output, nor what solution is the good one (I would say mine just to be pretentious, but I am not even sure).
Then a benchmarking to compare efficiency.
library(microbenchmark)
microbenchmark(ben(dt),
chinson(dt),
astrofun(dt),
denis(dt),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
ben(dt) 17.3841 19.8321 20.88349 20.9609 21.8815 23.5125 10
chinson(dt) 230.8868 232.6298 275.16637 236.8482 239.0144 544.2292 10
astrofun(dt) 4460.2159 4565.9120 4795.98600 4631.3251 5007.8055 5687.7717 10
denis(dt) 68.0480 68.4170 88.88490 80.9636 90.0514 142.9553 10
#Ben 's solution with rolling join is the fastest of course. Mine is not that bad, and #astrofunkswag 's solution is super slow because of the cumulative sum I guess.
With data.table you could try the following. This would involve creating a second date 90 days prior and then doing a rolling join.
library(data.table)
setDT(test[, c("idx", "date2") := list(.I, date - 90L)])
test_final <- unique(test[test, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(test_final, c("patient", "numsermed", "i.date"))
test_final
Output
(i.date has the final date desired)
patient numsermed date idx date2 i.date i.idx
1: 1 numser1 2019-10-10 1 2019-10-10 2020-01-08 1
2: 2 numser2 2014-10-04 6 2014-10-04 2015-01-02 6
3: 2 numser2 2019-09-13 4 2019-09-13 2019-12-12 4
4: 2 numser2 2019-12-16 8 2019-10-07 2020-03-15 7
5: 2 numser3 2019-12-14 10 2019-12-14 2020-03-13 10
6: 3 numser3 2019-10-24 3 2019-10-24 2020-01-22 3
7: 4 numser4 2017-10-04 5 2017-10-04 2018-01-02 5
Here is a solution with dplyr and purrr. I use 2 helper functions, one to calculate month difference and one to calculate a cumulative sum that resets when a threshold is reached, credit to this post.
I calculate the month difference with the lagging date value, but you want to include the first one which will be NA. One weird part is that to include NA the easiest for me was to convert NA to some value 3 or greater. I arbitrarily made it 300. You could likely modify the sum_reset_at function to handle NA the way you want. You might also want to condense the code in some way since I do multiple mutate calls and then deselect those column, but I did it all in seperate lines to make it more clear what was happening. I think this functional programming solution will be quicker, but I haven't tested it on a large dataset compared to your current solution.
test <- test %>% arrange(patient, numsermed, date)
library(tidyverse); library(zoo)
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
test %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
test
<dbl> <chr> <date>
1 1 numser1 2020-01-08
2 2 numser2 2015-01-02
3 2 numser2 2019-12-12
4 2 numser2 2020-03-15
5 2 numser3 2020-03-13
6 3 numser3 2020-01-22
7 4 numser4 2018-01-02
Another option using findInterval to group:
library(data.table)
DT[, d := as.integer(date)]
DT[DT[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1]
output:
patient numsermed date d g
1: 1 numser1 2020-01-08 18269 1
2: 2 numser2 2015-01-02 16437 1
3: 2 numser2 2019-12-12 18242 21
4: 2 numser2 2020-03-15 18336 22
5: 2 numser3 2020-03-13 18334 1
6: 3 numser3 2020-01-22 18283 1
7: 4 numser4 2018-01-02 17533 1
If you have many groups of patient and numsermed, Ben's solution using rolling join will be faster. And another way of coding the rolling join by chaining:
DT[, .(patient, numsermed, date=date+90L)][
DT, on=.NATURAL, roll=-Inf, .(patient, numsermed, x.date, i.date)][,
.(date=i.date[1L]), .(patient, numsermed, x.date)][,
x.date := NULL][]
Or more succinctly:
DT[, c("rn", "qtrago") := .(.I, date - 90L)]
DT[DT[DT, on=.(patient, numsermed, date=qtrago), roll=-Inf, unique(rn)]]
data:
library(data.table)
DT <- fread("patient numsermed date
1 numser1 2020-01-08
2 numser2 2015-01-02
2 numser2 2019-12-12
2 numser2 2020-01-05
2 numser2 2020-01-08
2 numser2 2020-01-20
2 numser2 2020-03-15
2 numser2 2020-03-18
2 numser3 2020-03-13
2 numser3 2020-03-18
3 numser3 2020-01-22
4 numser4 2018-01-02")
DT[, date := as.IDate(date, format="%Y-%m-%d")]

R data.table if then sumif lookup using join

I am looking to look up the individual id in events_table and calculate the total_duration as the sum of the duration of all events prior to date.
The duration is the time between the date_start and date (table1), unless the event ended (i.e. has a date_end), in which case if date_end < date, duration = date_end - date_start.
In pseudo code:
IF (date>date_start) Then{
IF(date_end < date & date_end != NA) Then{
duration = date_end-date_start
} else if (date_start < date) {
duration = date - date_start
}
}
Then sum all the durations separately for each "individual_id" and "date" combo
I am using data.tables as I have large tables (>1m rows).
My data looks a bit like this:
table1 <- fread(
"individual id | date
1 | 2019-01-02
1 | 2019-01-03
2 | 2019-01-02
2 | 2019-01-03",
sep ="|"
)
events_table<- fread(
"individual id | date_start | date_end
1 | 2018-01-02 | NA
1 | 2018-01-04 | 2018-07-01
1 | 2018-01-05 | NA
2 | 2018-01-01 | NA
2 | 2018-01-02 | NA
2 | 2018-01-05 | 2018-11-21",
sep = "|"
)
The output should be the following:
table1 <- fread(
"individual id | date | total_duration
1 | 2019-01-02 | 905
1 | 2019-01-03 | 907
2 | 2019-01-02 | 1051
2 | 2019-01-03 | 1053",
sep ="|"
)
My best guess at starting the query comes from:
table1[, total_duration:= events_table[table1,
on = .(`individual id`, date>date_start),
sum(date-date_start),
by = .EACHI][["V1"]]]
But I dont know the syntax for including the if condition.
Thanks for any help.
# formatting
table1[, date := as.IDate(date)]
events_table[, `:=`(date_start = as.IDate(date_start), date_end = as.IDate(date_end))]
# list max dur
events_table[, dur := date_end - date_start]
# add up completed events
table1[, v1 :=
events_table[.SD, on=.(`individual id`, date_end <= date), sum(x.dur, na.rm = TRUE), by=.EACHI]$V1
]
# add on incomplete events
table1[, v2 :=
events_table[!is.na(date_end)][.SD, on=.(`individual id`, date_start <= date, date_end > date), sum(i.date - x.date_start, na.rm = TRUE), by=.EACHI]$V1
]
# add on ill-defined events
table1[, v3 :=
events_table[is.na(date_end)][.SD, on=.(`individual id`, date_start <= date), sum(i.date - x.date_start, na.rm = TRUE), by=.EACHI]$V1
]
table1[, v := v1 + v2 + v3]
individual id date total_duration v1 v2 v3 v
1: 1 2019-01-02 905 178 0 727 905
2: 1 2019-01-03 907 178 0 729 907
3: 2 2019-01-02 1051 320 0 731 1051
4: 2 2019-01-03 1053 320 0 733 1053
You don't have to define three distinct columns, though it is easier for debugging. Instead, you could initialize table1[, v := 0] and for each step do table1[, v := v + ...].

Excel's AVERAGEIFS() in R

I'm trying to duplicate what you'd do with AVERAGEIFS function in Excel on my dataset:
EG_df <- data.frame(id = c("red_blue", "white_blue", "red_yellow","white_yellow", "brown_blue", "brown_yellow"),
StartDate = as.Date(c('2019-1-1','2019-3-1','2019-7-1','2018-1-1','2018-3-1','2018-7-1')),
EndDate = as.Date(c('2019-6-1','2019-12-1','2019-8-1','2018-1-1','2018-3-1','2018-7-1')),
avg_Value = NA
)
source <- data.frame(source.id = c("red_blue", "red_blue", "red_blue","brown_yellow", "brown_yellow", "brown_yellow"),
source.Date = as.Date(c('2019-1-1','2019-2-1','2019-3-1','2018-7-1','2018-8-1','2018-9-1')),
source.Value = c(22,56,32,31,14,7)
)
Logic I need to fill in EG.df$avg_Value :
For each row in EG_df, return the average value of source.value when source.Date is between StartDate and EndDate.
The Excel Formula, for clarification:
=AVERAGEIFS(source.value, source.id, id, source.Date, ">="&StartDate, source.Date, ">="&EndDate)
Any help would be greatly appreciated!
You can do this pretty efficiently with a non-equi join:
library(data.table)
setDT(source); setDT(EG_df)
EG_df[, avg_Value :=
source[copy(.SD), on=.(source.id = id, source.Date >= StartDate, source.Date <= EndDate), mean(x.source.Value), by=.EACHI]$V1
]
id StartDate EndDate avg_Value
1: red_blue 2019-01-01 2019-06-01 36.66667
2: white_blue 2019-03-01 2019-12-01 NA
3: red_yellow 2019-07-01 2019-08-01 NA
4: white_yellow 2018-01-01 2018-01-01 NA
5: brown_blue 2018-03-01 2018-03-01 NA
6: brown_yellow 2018-07-01 2018-07-01 31.00000
(There are NAs since I'm just using the excerpt source provided rather than the full table.)
How it works
x[i, j] subsets using i and then evaluates j, inside of which .SD refers to the Subset of Data.
When x and i are both tables, x[i, on=, j, by=.EACHI] is a join, with on= specifying the join conditions, and j evaluated for each row of i.
Because j = mean(x.source.Value) returns an unnamed column, it gets the default name of V1.
Inside j of x[i, j], v := val creates or modifies column v by assigning val to it.
Using the dplyr Librarie
library(dyplr)
df = EG_df %>%
left_join(source, by = c('id' = 'source.id')) %>%
filter((StartDate <= source.Date) & (source.Date <= EndDate)) %>%
group_by(id, StartDate, EndDate) %>%
summarise(value = mean(source.Value))
using the tidyverse
dplyr::inner_join(source,EG_df,by = c("source.id"="id")) %>%
dplyr::filter(source.Date >= StartDate,
source.Date <= EndDate) %>%
dplyr::group_by(source.id,StartDate,EndDate) %>%
dplyr::summarise(avg_Value = mean(source.Value))
Consider the base package running a merge > subset > aggregate for the averages by id group and date range. Then merge this resultset back to original dataset.
# MERGE > SUBSET > AGGREGATE
agg_df <- aggregate(cbind(avgValue=source.Value) ~ id + StartDate + EndDate,
subset(merge(EG_df, source, by.x="id", by.y="source.id", all.x=TRUE),
source.Date >= StartDate & source.Date <= EndDate),
FUN=mean)
# MERGE WITH ORIGINAL DATASET
merge(EG_df, agg_df, by=c("id", "StartDate", "EndDate"), all.x=TRUE)
# id StartDate EndDate avgValue
# 1 brown_blue 2018-03-01 2018-03-01 NA
# 2 brown_yellow 2018-07-01 2018-07-01 31.00000
# 3 red_blue 2019-01-01 2019-06-01 36.66667
# 4 red_yellow 2019-07-01 2019-08-01 NA
# 5 white_blue 2019-03-01 2019-12-01 NA
# 6 white_yellow 2018-01-01 2018-01-01 NA
Rextester Demo
Aside - This is similar to SQL's greatest-n-per-group problem (official StackOverflow tag) where the agg_df would be a subquery or CTE joined back to original table.

Week Range Output in R

I'm working on an R script that would display the weekday range and week the dates fall in, in a data frame.
output i'm trying to display
--------------------------------------------------
DateRange | Week
--------------------------------------------------
1/7/2018 - 1/13/2018 | 2
--------------------------------------------------
1/14/2018 - 1/20/2018 | 3
--------------------------------------------------
1/21/2018 - 1/26/2018 | 4
--------------------------------------------------
library(data.table)
dd <- seq(as.IDate("2018-01-01"), as.IDate("2018-04-10"), 1)
dt <- data.table(i = 1:length(dd),
day = dd,
weekday = weekdays(dd),
day_rounded = round(dd, "weeks"))
## Now let's add the weekdays for the "rounded" date
dt[ , weekday_rounded := weekdays(day_rounded)]
## This seems to make internal sense with the "week" calculation
dt[ , weeknumber := week(day)]
dt$weekday_rounded <- NULL
dt$day_rounded <- NULL
dt
If I am not mistaken, then lubridate can help:
library(lubridate)
library(data.table)
dd <- seq(as.IDate("2018-01-01"), as.IDate("2018-04-10"), 1)
dt <- data.table(i = 1:length(dd), day = dd)
dt[, week := week(day),]
dt[, week_start := floor_date(day, unit = "week"),]
dt[, week_range := interval(week_start, week_start + days(6))]
dt[, week_start := NULL,]
Output:
i day week week_range
1: 1 2018-01-01 1 2017-12-31 UTC--2018-01-06 UTC
2: 2 2018-01-02 1 2017-12-31 UTC--2018-01-06 UTC
3: 3 2018-01-03 1 2017-12-31 UTC--2018-01-06 UTC
4: 4 2018-01-04 1 2017-12-31 UTC--2018-01-06 UTC
5: 5 2018-01-05 1 2017-12-31 UTC--2018-01-06 UTC
6: 6 2018-01-06 1 2017-12-31 UTC--2018-01-06 UTC
7: 7 2018-01-07 1 2018-01-07 UTC--2018-01-13 UTC
8: 8 2018-01-08 2 2018-01-07 UTC--2018-01-13 UTC
9: 9 2018-01-09 2 2018-01-07 UTC--2018-01-13 UTC
10: 10 2018-01-10 2 2018-01-07 UTC--2018-01-13 UTC
.......
If you have some table dt with a day column and other arbitrary columns, you can add the i, weekday, weeknumber and WeekRange in a single call in data.table:
dt[, ':='(
i = .I,
weekday = weekdays(day),
WeekRange = paste(min(day), max(day), sep = ' - ')
), .(weeknumber = week(day))]
The way this works is to group the days by weeknumber, or week(day), and then to paste the min date and max date in those groups together to create a date range for every week.
Here's a mock table to experiment with:
n <- 100
dt <-
data.table(
day = seq.Date(as.Date('2018-01-01'), by = 'day', length.out = n),
a = runif(n),
b = runif(n)
)

Resources