Bi-weekly binning with data table - r

Looking to add a bi-weekly date column to a data table. I have a working solution but it seems messy. Also, I have the feeling rolling joins should do the trick, but I'm not sure how. Are there any better solutions to creating a grouping for bi-weekly dates?
# Mock data table
dt <- data.table(value = runif(20), date = seq(as.Date("2015-01-01"), as.Date("2015-01-20"), by = "days"))
# Bi-weekly dates starting with most recent date and working backwards
bidates <- data.table(bi = seq(dt[, max(date)], dt[, min(date)], by = -14))
# Expand out bi-weekly dates to match up with every date in that range
bidates <- bidates[, seq(bi - 13, bi, by = "days"), by = bi]
# Key and merge
setkey(dt, date)
setkey(bidates, V1)
dt[bidates, bi := i.bi]

Here's how you can use rolling joins:
bis = dt[, .(date = seq(max(date), min(date), by = -14))][, bi := date]
setkey(bis, date)
setkey(dt, date)
bis[dt, roll = -Inf]
# date bi value
# 1: 2015-01-01 2015-01-06 0.2433854
# 2: 2015-01-02 2015-01-06 0.5454916
# 3: 2015-01-03 2015-01-06 0.3334531
# 4: 2015-01-04 2015-01-06 0.9134877
# 5: 2015-01-05 2015-01-06 0.4557901
# 6: 2015-01-06 2015-01-06 0.3459536
# 7: 2015-01-07 2015-01-20 0.8024527
# 8: 2015-01-08 2015-01-20 0.1833166
# 9: 2015-01-09 2015-01-20 0.1024560
#10: 2015-01-10 2015-01-20 0.4052751
#11: 2015-01-11 2015-01-20 0.9564279
#12: 2015-01-12 2015-01-20 0.6413953
#13: 2015-01-13 2015-01-20 0.7614291
#14: 2015-01-14 2015-01-20 0.2176500
#15: 2015-01-15 2015-01-20 0.3352939
#16: 2015-01-16 2015-01-20 0.4847095
#17: 2015-01-17 2015-01-20 0.8450636
#18: 2015-01-18 2015-01-20 0.8513685
#19: 2015-01-19 2015-01-20 0.2012410
#20: 2015-01-20 2015-01-20 0.3847956
Starting from version 1.9.5+ you don't need to set the keys and can do:
bis[dt, roll = -Inf, on = 'date']

Related

R data.table apply date by variable last

I have a data.table in R. I have to decrement date from last variable within by group. So in the example below, the date "2012-01-21" should be the 10th row when id = "A" and then decrement until the 1st row. and then for id="B" the date should be "2012-01-21" for 5th row and then decrement by 1 until it reaches first row. Basically the the decrement should start from last value by "id". How could I accomplish in R data.table?
The code below does the opposite. The date starts from 1st row and decrements, how would you start the date by last row and then decrement.
end<- as.Date("2012-01-21")
dt <- data.table(id = c(rep("A",10),rep("B",5)),sales=10+rnorm(15))
dtx <- dt[,date := seq(end,by = -1,length.out = .N),by=list(id)]
> dtx
id sales date
1: A 12.008514 2012-01-21
2: A 10.904740 2012-01-20
3: A 9.627039 2012-01-19
4: A 11.363810 2012-01-18
5: A 8.533913 2012-01-17
6: A 10.041074 2012-01-16
7: A 11.006845 2012-01-15
8: A 10.775066 2012-01-14
9: A 9.978509 2012-01-13
10: A 8.743829 2012-01-12
11: B 8.434640 2012-01-21
12: B 9.489433 2012-01-20
13: B 10.011354 2012-01-19
14: B 8.681002 2012-01-18
15: B 9.264915 2012-01-17
We could reverse the sequence generated above.
library(data.table)
dt[,date := rev(seq(end,by = -1,length.out = .N)),id]
dt
# id sales date
# 1: A 10.886312 2012-01-12
# 2: A 9.803543 2012-01-13
# 3: A 9.063694 2012-01-14
# 4: A 9.762628 2012-01-15
# 5: A 8.764109 2012-01-16
# 6: A 11.095826 2012-01-17
# 7: A 8.735148 2012-01-18
# 8: A 9.227285 2012-01-19
# 9: A 12.024336 2012-01-20
#10: A 9.976514 2012-01-21
#11: B 8.488753 2012-01-17
#12: B 9.141837 2012-01-18
#13: B 11.435365 2012-01-19
#14: B 10.817839 2012-01-20
#15: B 8.427098 2012-01-21
Similarly,
dt[,date := seq(end - .N + 1,by = 1,length.out = .N),id]

R: time series monthly max adjusted by group

I have a df like that (head):
date Value
1: 2016-12-31 169361280
2: 2017-01-01 169383153
3: 2017-01-02 169494585
4: 2017-01-03 167106852
5: 2017-01-04 166750164
6: 2017-01-05 164086438
I would like to calculate a ratio, for that reason I need the max of every period. The max it´s normally the last day of the month but sometime It could be some days after and before (28,29,30,31,01,02).
In order to calculate it properly I would like to assign to my reference date (the last day of the month) the max value of this group of days to be sure that the ratio reflects what it supossed to.
This could be a reproducible example:
Start<-as.Date("2016-12-31")
End<-Sys.Date()
window<-data.table(seq(Start,End,by='1 day'))
dt<-cbind(window,rep(rnorm(nrow(window))))
colnames(dt)<-c("date","value")
# Create a Dateseq
DateSeq <- function(st, en, freq) {
st <- as.Date(as.yearmon(st))
en <- as.Date(as.yearmon(en))
as.Date(as.yearmon(seq(st, en, by = paste(as.character(12/freq),
"months"))), frac = 1)
}
# df to be fulfilled with the group max.
Value.Max.Month<-data.frame(DateSeq(Start,End,12))
colnames(Value.Max.Month)<-c("date")
date
1 2016-12-31
2 2017-01-31
3 2017-02-28
4 2017-03-31
5 2017-04-30
6 2017-05-31
7 2017-06-30
8 2017-07-31
9 2017-08-31
10 2017-09-30
11 2017-10-31
12 2017-11-30
13 2017-12-31
14 2018-01-31
15 2018-02-28
16 2018-03-31
You could use data.table:
library(lubridate)
library(zoo)
Start <- as.Date("2016-12-31")
End <- Sys.Date()
window <- data.table(seq(Start,End,by='1 day'))
dt <- cbind(window,rep(rnorm(nrow(window))))
colnames(dt) <- c("date","value")
dt <- data.table(dt)
dt[,period := as.Date(as.yearmon(date)) %m+% months(1) - 1,][, maximum:=max(value), by=period][, unique(maximum), by=period]
In the first expression we create a new column called period. Then we group by this new column and look for the maximum in value. In the last expression we just output these unique rows.
Notice that to get the last day of each period we add one month using lubridate and then substract 1 day.
The output is:
period V1
1: 2016-12-31 -0.7832116
2: 2017-01-31 2.1988660
3: 2017-02-28 1.6644812
4: 2017-03-31 1.2464980
5: 2017-04-30 2.8268820
6: 2017-05-31 1.7963104
7: 2017-06-30 1.3612476
8: 2017-07-31 1.7325457
9: 2017-08-31 2.7503439
10: 2017-09-30 2.4369036
11: 2017-10-31 2.4544802
12: 2017-11-30 3.1477730
13: 2017-12-31 2.8461506
14: 2018-01-31 1.8862944
15: 2018-02-28 1.8946470
16: 2018-03-31 0.7864341

aggregate/merge over date range using data.table

Suppose I have two data.tables:
summary <- data.table(period = c("A","B","C","D"),
from_date = ymd(c("2017-01-01", "2017-01-03", "2017-02-08", "2017-03-07")),
to_date = ymd(c("2017-01-31", "2017-04-01", "2017-03-08", "2017-05-01"))
)
log <- data.table(date = ymd(c("2017-01-03","2017-01-20","2017-02-01","2017-03-03",
"2017-03-15","2017-03-28","2017-04-03","2017-04-23")),
event1 = c(4,8,8,4,3,4,7,3), event2 = c(1,8,7,3,8,4,6,3))
which look like this:
> summary
period from_date to_date
1: A 2017-01-01 2017-01-31
2: B 2017-01-03 2017-04-01
3: C 2017-02-08 2017-03-08
4: D 2017-03-07 2017-05-01
> log
date event1 event2
1: 2017-01-03 4 1
2: 2017-01-20 8 8
3: 2017-02-01 8 7
4: 2017-03-03 4 3
5: 2017-03-15 3 8
6: 2017-03-28 4 4
7: 2017-04-03 7 6
8: 2017-04-23 3 3
I would like to get the sum of event1 and event2 for each time period in the table summary.
I know I can do this:
summary[, c("event1","event2") := .(sum(log[date>=from_date & date<=to_date, event1]),
sum(log[date>=from_date & date<=to_date, event2]))
, by=period][]
to get the desired result:
period from_date to_date event1 event2
1: A 2017-01-01 2017-01-31 12 9
2: B 2017-01-03 2017-04-01 31 31
3: C 2017-02-08 2017-03-08 4 3
4: D 2017-03-07 2017-05-01 17 21
Now, in my real-life problem, I have about 30 columns to be summed, which I may want to change later, and summary has ~35,000 rows, log has ~40,000,000 rows. Is there an efficient way to achieve this?
Note: This is my first post here. I hope my question is clear and specific enough, please do make suggestions if there is anything I should do to improve the question. Thanks!
Yes, you can perform a non-equi join.
(Note I've changed log and summary to Log and Summary as the originals are already functions in R.)
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch=0L,
allow.cartesian = TRUE][, .(from_date = date[1],
to_date = date.1[1],
event1 = sum(event1),
event2 = sum(event2)),
keyby = "period"]
To sum over a pattern of columns, use lapply with .SD:
joined_result <-
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch = 0L,
allow.cartesian = TRUE]
cols <- grep("event[a-z]?[0-9]", names(joined_result), value = TRUE)
joined_result[, lapply(.SD, sum),
.SDcols = cols,
keyby = .(period,
from_date = date,
to_date = date.1)]
With data.table, it is possible to aggregate during a non-equi join using by = .EACHI.
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
lapply(.SD, sum), by = .EACHI]
date date event1 event2
1: 2017-01-01 2017-01-31 12 9
2: 2017-01-03 2017-04-01 31 31
3: 2017-02-08 2017-03-08 4 3
4: 2017-03-07 2017-05-01 17 21
With some additional clean-up:
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
c(period = period, lapply(.SD, sum)), by = .EACHI][
, setnames(.SD, 1:2, c("from_date", "to_date"))]
from_date to_date period event1 event2
1: 2017-01-01 2017-01-31 A 12 9
2: 2017-01-03 2017-04-01 B 31 31
3: 2017-02-08 2017-03-08 C 4 3
4: 2017-03-07 2017-05-01 D 17 21

Calculate cumulative sum between two dates with a data.table in R

I have a data.table with the following shape:
date_from date_until value
2015-01-01 2015-01-03 100
2015-01-02 2015-01-05 50
2015-01-02 2015-01-04 10
...
What I want to do is: I want to calculate for every date in the year the cumulative sum. For the first row the value 100 would be relevant for every day from 2015-01-01 until 2015-01-03. I want to add all values which are relevant for a certain date.
So, in the end there would be a data.table like this:
date value
2015-01-01 100
2015-01-02 160
2015-01-03 160
2015-01-04 60
2015-01-05 50
Is there any easy way with the data.table to do this?
dt[, .(date = seq(as.Date(date_from, '%Y-%m-%d'),
as.Date(date_until, '%Y-%m-%d'),
by='1 day'),
value), by = 1:nrow(dt)][, sum(value), by = date]
# date V1
#1: 2015-01-01 100
#2: 2015-01-02 160
#3: 2015-01-03 160
#4: 2015-01-04 60
#5: 2015-01-05 50
And another option using foverlaps:
# convert to Date for ease
dt[, date_from := as.Date(date_from, '%Y-%m-%d')]
dt[, date_until := as.Date(date_until, '%Y-%m-%d')]
# all of the dates
alldates = dt[, do.call(seq, c(as.list(range(c(date_from, date_until))), by = '1 day'))]
# foverlaps to find the intersections
foverlaps(dt, data.table(date_from = alldates, date_until = alldates,
key = c('date_from', 'date_until')))[,
sum(value), by = date_from]
# date_from V1
#1: 2015-01-01 100
#2: 2015-01-02 160
#3: 2015-01-03 160
#4: 2015-01-04 60
#5: 2015-01-05 50

merge data tables by time intervals overlap

Suppose I have two tables. One with appointments and second with receptions. Each table has filial ID, medic ID, start and end time (plan for appointments and fact for receptions) and some other data. I want to count how much of appointments have receptions inside time interval of appointment period. Reception fact can begin before appointment start time, after, it can be inside app. interval, etc.
Below I made two tables. One for appointments and one for receptions. I wrote nested loop but it works very slow. My tables contains approximately 50 mio rows each. I need fast solution for this problem. How can I do this without loop? Thanks in advance!
library(data.table)
date <- as.POSIXct('2015-01-01 14:30:00')
# appointments data table
app <- data.table(med.id = 1:10,
filial.id = rep(c(100,200), each = 5),
start.time = rep(seq(date, length.out = 5, by = "hours"),2),
end.time = rep(seq(date+3599, length.out = 5, by = "hours"),2),
A = rnorm(10))
# receptions data table
re <- data.table(med.id = c(1,11,3,4,15,6,7),
filial.id = c(rep(100, 5), 200,200),
start.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('14:25:00', '14:25:00','16:32:00', '17:25:00', '16:10:00', '15:35:00','15:50:00'))),
end.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('15:25:00', '15:20:00','17:36:00', '18:40:00', '16:10:00', '15:49:00','16:12:00'))),
B = rnorm(7))
app$count <- 0
for (i in 1:dim(app)[1]){
for (j in 1:dim(re)[1]){
if ((app$med.id[i] == re$med.id[j]) & # med.id is equal and
app$filial.id[i] == re$filial.id[j]) { # filial.id is equal
if ((re$start.time[j] < app$start.time[i]) & (re$end.time[j] > app$start.time[i])) { # reception starts before appointment start time and ends after appointment start time OR
app$count[i] <- app$count[i] + 1
} else if ((re$start.time[j] < app$end.time[i]) & (re$start.time[j] > app$start.time[i])) { # reception starts before appointment end time and after app. start time
app$count[i] <- app$count[i] + 1
}
}
}
}
Using foverlaps():
setkey(re, med.id, filial.id, start.time, end.time)
olaps = foverlaps(app, re, which=TRUE, nomatch=0L)[, .N, by=xid]
app[, count := 0L][olaps$xid, count := olaps$N]
app
# med.id filial.id start.time end.time A count
# 1: 1 100 2015-01-01 14:30:00 2015-01-01 15:29:59 0.60878560 1
# 2: 2 100 2015-01-01 15:30:00 2015-01-01 16:29:59 -0.11545284 0
# 3: 3 100 2015-01-01 16:30:00 2015-01-01 17:29:59 0.68992084 1
# 4: 4 100 2015-01-01 17:30:00 2015-01-01 18:29:59 0.04703938 1
# 5: 5 100 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.95315419 0
# 6: 6 200 2015-01-01 14:30:00 2015-01-01 15:29:59 0.26193554 0
# 7: 7 200 2015-01-01 15:30:00 2015-01-01 16:29:59 1.55206077 1
# 8: 8 200 2015-01-01 16:30:00 2015-01-01 17:29:59 0.44517362 0
# 9: 9 200 2015-01-01 17:30:00 2015-01-01 18:29:59 0.11475881 0
# 10: 10 200 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.66139828 0
PS: please go through the vignettes and learn to use data tables effectively.
I actually don't think you need to merge by time overlap at all: your code is actually merging by med.id and filial.id then performing a simple comparison.
First, for clarity, let's rename the start.time and end.time fields:
setnames(app, c("start.time", "end.time"), c("app.start.time", "app.end.time"))
setnames(re, c("start.time", "end.time"), c("re.start.time", "re.end.time"))
You should then merge the two data.tables on the keys med.id and filial.id, like this:
app_re <- re[app, on=c("med.id", "filial.id")]
# med.id filial.id re.start.time re.end.time B
# 1: 1 100 2015-01-01 14:25:00 2015-01-01 15:25:00 0.4307760
# 2: 2 100 <NA> <NA> NA
# 3: 3 100 2015-01-01 16:32:00 2015-01-01 17:36:00 -1.2933755
# 4: 4 100 2015-01-01 17:25:00 2015-01-01 18:40:00 -1.2374469
# 5: 5 100 <NA> <NA> NA
# 6: 6 200 2015-01-01 15:35:00 2015-01-01 15:49:00 -0.8054822
# 7: 7 200 2015-01-01 15:50:00 2015-01-01 16:12:00 2.5742241
# 8: 8 200 <NA> <NA> NA
# 9: 9 200 <NA> <NA> NA
# 10: 10 200 <NA> <NA> NA
# app.start.time app.end.time A
# 1: 2015-01-01 14:30:00 2015-01-01 15:29:59 -0.26828337
# 2: 2015-01-01 15:30:00 2015-01-01 16:29:59 0.24246341
# 3: 2015-01-01 16:30:00 2015-01-01 17:29:59 1.55824948
# 4: 2015-01-01 17:30:00 2015-01-01 18:29:59 1.25829302
# 5: 2015-01-01 18:30:00 2015-01-01 19:29:59 1.14244558
# 6: 2015-01-01 14:30:00 2015-01-01 15:29:59 -0.41234563
# 7: 2015-01-01 15:30:00 2015-01-01 16:29:59 0.07710022
# 8: 2015-01-01 16:30:00 2015-01-01 17:29:59 -1.46421985
# 9: 2015-01-01 17:30:00 2015-01-01 18:29:59 1.21682394
# 10: 2015-01-01 18:30:00 2015-01-01 19:29:59 1.11197318
You can then create your count variable with the same conditions as before:
app_re[, count :=
as.numeric(re.start.time < app.start.time & re.end.time > app.start.time) |
(re.start.time < app.end.time & re.start.time > app.start.time)]
# Convert the NAs to 0
app_re[, count := ifelse(is.na(count), 0, count)]
This should be much faster than the for loops.

Resources