I have a large dataset that includes date periods with different disease states per id and reference date. I would like to add a 'healthy' state for all missing date periods within +/- 5 years from the reference date per id.
I have tried to modify the solution here: Fill in missing date ranges but failed. Preferably, I would like to keep to the data.table framework. Any advice is greatly appreciated!
Sample data:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
Desired output:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2005-03-16 2008-10-10 0
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2008-10-13 2014-11-04 0
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2008-05-10 2012-01-31 0
2 2013-05-10 2012-02-01 2012-03-01 2
2 2013-05-10 2012-03-02 2018-05-10 0
2 2014-06-11 2009-06-11 2012-01-31 0
2 2014-06-11 2012-02-01 2012-03-01 2
2 2014-06-11 2012-03-02 2019-06-11 0
3 2011-08-14 2006-08-14 2016-08-14 0
")
Comment:
For the first row, the +/-5 year date interval is from 2005-01-10 to 2015-01-10. However, because of the ongoing disease state that ends 2005-03-15, the "healthy" period starts at 2005-03-16. Because there can be several reference dates per id, duplicate date periods (as observed for id 2: 2012-02-01-2012-03-01) will be present and are OK. Finally, ids with no disease states are represented by NA (as id 3).
EDIT: I had some problems with the real data, so I tweaked the solution a bit; also added so that the status is collapsed per date interval:
DT2 <- DT[,{
# +/-5 years from t0
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(start[1L])) {
# replace NA with full time interval for 'healthy'
.(period_start=sdt, period_end=edt, status='notsick')
} else{
# Add date for -5 years if it is the minimum, otherwise use existing minimum
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
# Add date for +5 years if it is the maximum, otherwise use existing maximum
if (edt > period_end[.N]) {
period_end <- c(period_end,edt)
}
dates=unique(sort(c(period_start, period_end+1L)))
.(start=dates[-length(dates)],end=dates[-1L]-1,status='')
}
},
.(id,reference_date)]
## (c). Collapse status for overlapping periods
DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
status <- paste(status, collapse = ";")
.(status=status)},
by = .EACHI, allow.cartesian = TRUE]
here is an option:
interweave <- function(x, y) c(rbind(x, y)) #see ref
ans <- DT[, {
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(period_start[1L])) {
.(period_start=sdt, period_end=edt, Status=0L)
} else {
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
ps <- as.IDate(sort(interweave(period_start, period_end+1L)))
if (period_end[.N] > edt) {
ps <- ps[-length(ps)]
pe <- period_end[.N]
} else {
pe <- edt
}
.(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
}
},
.(id, reference_date)]
ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
ans
data:
library(data.table)
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
cols <- c("reference_date","period_start","period_end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
Reference:
Alternate, interweave or interlace two vectors
Related
I've this function to generate monthly ranges, it should consider years where february has 28 or 29 days:
starts ends
1 2017-01-01 2017-01-31
2 2017-02-01 2017-02-28
3 2017-03-01 2017-03-31
It works with:
make_date_ranges(as.Date("2017-01-01"), Sys.Date())
But gives error with:
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Why?
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Error in data.frame(starts, ends) :
arguments imply differing number of rows: 38, 36
add_months <- function(date, n){
seq(date, by = paste (n, "months"), length = 2)[2]
}
make_date_ranges <- function(start, end){
starts <- seq(from = start,
to = Sys.Date()-1 ,
by = "1 month")
ends <- c((seq(from = add_months(start, 1),
to = end,
by = "1 month" ))-1,
(Sys.Date()-1))
data.frame(starts,ends)
}
## useage
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
1) First, define start of month, som, and end of month, eom functions which take a Date class object, date string in standard Date format or yearmon object and produce a Date class object giving the start or end of its year/months.
Using those, create a monthly Date series s using the start of each month from the month/year of from to that of to. Use pmax to ensure that the series does not extend before from and pmin so that it does not extend past to.
The input arguments can be strings in standard Date format, Date class objects or yearmon class objects. In the yearmon case it assumes the user wanted the full month for every month. (The if statement can be omitted if you don't need to support yearmon inputs.)
library(zoo)
som <- function(x) as.Date(as.yearmon(x))
eom <- function(x) as.Date(as.yearmon(x), frac = 1)
date_ranges2 <- function(from, to) {
if (inherits(to, "yearmon")) to <- eom(to)
s <- seq(som(from), eom(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges2("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges2(as.yearmon("2000-01"), as.yearmon("2000-06"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
2) This alternative takes the same approach but defines start of month (som) and end of month (eom) functions without using yearmon so that only base R is needed. It takes character strings in standard Date format or Date class inputs and gives the same output as (1).
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
date_ranges3 <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges3("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges3(som("2000-01-10"), eom("2000-06-20"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
You don't need to use seq twice -- you can subtract 1 day from the firsts of each month to get the ends, and generate one too many starts, then shift & subset:
make_date_ranges = function(start, end) {
# format(end, "%Y-%m-01") essentially truncates end to
# the first day of end's month; 32 days later is guaranteed to be
# in the subsequent month
starts = seq(from = start, to = as.Date(format(end, '%Y-%m-01')) + 32, by = 'month')
data.frame(starts = head(starts, -1L), ends = tail(starts - 1, -1L))
}
x = make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
rbind(head(x), tail(x))
# starts ends
# 1 2017-01-01 2017-01-31
# 2 2017-02-01 2017-02-28
# 3 2017-03-01 2017-03-31
# 4 2017-04-01 2017-04-30
# 5 2017-05-01 2017-05-31
# 6 2017-06-01 2017-06-30
# 31 2019-07-01 2019-07-31
# 32 2019-08-01 2019-08-31
# 33 2019-09-01 2019-09-30
# 34 2019-10-01 2019-10-31
# 35 2019-11-01 2019-11-30
# 36 2019-12-01 2019-12-31
I have some transactional records, like the following:
library(data.table)
customers <- 1:75
purchase_dates <- seq( as.Date('2016-01-01'),
as.Date('2018-12-31'),
by=1 )
n <- 500L
set.seed(1)
# Assume the data are already ordered and 1 row per cust_id/purch_dt
df <- data.table( cust_id = sample(customers, n, replace=TRUE),
purch_dt = sample(purchase_dates, n, replace=TRUE),
purch_amt = sample(500:50000, n, replace=TRUE)/100
)[, .(purch_amt = sum(purch_amt)),
keyby=.(cust_id, purch_dt) ]
df
# cust_id purch_dt purch_amt
# 1 2016-03-20 69.65
# 1 2016-05-17 413.60
# 1 2016-12-25 357.18
# 1 2017-03-20 256.21
# 2 2016-05-26 49.14
# 2 2018-05-31 261.87
# 2 2018-12-27 293.28
# 3 2016-12-10 204.12
# 3 2018-09-21 8.70
I would like to know the prior transaction count and total amount, within a 365-day prior window (i.e., at d-365 through d-1 for a transaction on date d).
I thought of using a rolling join, but that would match to at most one prior purchase, and there could be multiple purchases.
I was able to get the desired result using a Cartesian self-join with a date filter (see answer below), but that's not a very memory-efficient approach.
Desired output:
cust_id purch_dt prior_purch_cnt prior_purch_amt purch_amt
1 2016-03-20 0 0.00 69.65
1 2016-05-17 1 69.65 413.60
1 2016-12-25 2 483.25 357.18
1 2017-03-20 3 840.43 256.21
2 2016-05-26 0 0.00 49.14
2 2018-05-31 0 0.00 261.87
2 2018-12-27 1 261.87 293.28
3 2016-12-10 0 0.00 204.12
3 2018-09-21 0 0.00 8.70
I would like to know the prior transaction count and total amount, within a 365-day prior window (i.e., at d-365 through d-1 for a transaction on date d).
I think the idiomatic way is:
df[, c("ppn", "ppa") :=
df[.(cust_id = cust_id, d_dn = purch_dt-365, d_up = purch_dt),
on=.(cust_id, purch_dt >= d_dn, purch_dt < d_up),
.(.N, sum(purch_amt, na.rm=TRUE))
, by=.EACHI][, .(N, V2)]
]
cust_id purch_dt purch_amt ppn ppa
1: 1 2016-03-20 69.65 0 0.00
2: 1 2016-05-17 413.60 1 69.65
3: 1 2016-12-25 357.18 2 483.25
4: 1 2017-03-20 256.21 3 840.43
5: 2 2016-05-26 49.14 0 0.00
---
494: 75 2018-01-12 381.24 2 201.04
495: 75 2018-04-01 65.83 3 582.28
496: 75 2018-06-17 170.30 4 648.11
497: 75 2018-07-22 60.49 5 818.41
498: 75 2018-10-10 66.12 4 677.86
This is a "non-equi join".
Here's the Cartesian self-join with date-range filter:
df_prior <- df[df, on=.(cust_id), allow.cartesian=TRUE
][i.purch_dt < purch_dt &
i.purch_dt >= purch_dt - 365
][, .(prior_purch_cnt = .N,
prior_purch_amt = sum(i.purch_amt)),
keyby=.(cust_id, purch_dt)]
df2 <- df_prior[df, on=.(cust_id, purch_dt)]
df2[is.na(prior_purch_cnt), `:=`(prior_purch_cnt=0,
prior_purch_amt=0
)]
df2
# cust_id purch_dt prior_purch_cnt prior_purch_amt purch_amt
# 1 2016-03-20 0 0.00 69.65
# 1 2016-05-17 1 69.65 413.60
# 1 2016-12-25 2 483.25 357.18
# 1 2017-03-20 3 840.43 256.21
# 2 2016-05-26 0 0.00 49.14
I'm concerned about how this could blow up prior to filtering on datasets where customers have many prior transactions.
I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2
New to R and learning lot and I love it.
My sales dataset has product, bpid, date. I would like to calculate the business difference between same bpid (just excluding Saturday & Sunday) .
If product or bpid changes (Or new bpid/product introduced), we need to skip the computations.
df <- data.frame(product=c('milk','milk','milk','milk','eggs','eggs','eggs','eggs'),
bpid=c(400,400,500,500,400,400,500,500),
date=c("2016-08-03","2016-08-10","2016-08-04","2016-08-10","2016-08-10","2016-08-16","2016-08-11","2016-08-15"));
df$date <- as.Date(df$date, format = "%Y-%m-%d");
My Desired result would be like below. Please help....
product bpid date compute-result
milk 400 2016-08-03 0
milk 400 2016-08-10 5
milk 500 2016-08-04 0
milk 500 2016-08-10 5
eggs 400 2016-08-10 0
eggs 400 2016-08-16 4
eggs 500 2016-08-11 0
eggs 500 2016-08-15 2
Real Data code (getting zeros at result column)
df <- data.frame(product=c('Keyt','Keyt','Keyt','Keyt','Keyt','Keyt'),
bpid=c(30057,30057,30057,30058,30058,30058),
date=c("2014-11-21","2015-05-05","2015-05-11","2014-10-16","2014-11-03","2016-03-15"));
df$date <- as.Date(df$date, format = "%Y-%m-%d");
cal <- Calendar(weekdays=c("saturday", "sunday"))
df$`compute-result` <- 0
idx <- seq(1, nrow(df),2)
df$`compute-result`[idx+1] <- bizdays(df$date[idx], df$date[idx+1], cal)
df
For example:
# install.packages("bizdays")
library(bizdays)
cal <- create.calendar(name = "mycal", weekdays=c("saturday", "sunday"))
df$`compute-result` <- 0
idx <- seq(1, nrow(df),2)
df$`compute-result`[idx+1] <- bizdays(df$date[idx], df$date[idx+1], cal)
df
# product bpid date compute-result
# 1 milk 400 2016-08-03 0
# 2 milk 400 2016-08-10 5
# 3 milk 500 2016-08-04 0
# 4 milk 500 2016-08-10 4
# 5 eggs 400 2016-08-10 0
# 6 eggs 400 2016-08-16 4
# 7 eggs 500 2016-08-11 0
# 8 eggs 500 2016-08-15 2
if you want to group by product and bpid, you could try
# install.packages("bizdays")
library(bizdays)
cal <- create.calendar(name="mycal", weekdays=c("saturday", "sunday"))
with(df, ave(as.integer(date), product, bpid, FUN=function(x) {
x <- as.Date(x, origin="1970-01-01")
c(0, bizdays(head(x, -1), tail(x, -1), cal))
})) -> df$result
df
# product bpid date result
# 1 Keyt 30057 2014-11-21 0
# 2 Keyt 30057 2015-05-05 117
# 3 Keyt 30057 2015-05-11 4
# 4 Keyt 30058 2014-10-16 0
# 5 Keyt 30058 2014-11-03 12
# 6 Keyt 30058 2016-03-15 356
Note that is converted date to integer and back to Date inside the function because otherwise ave throws an error:
Error in as.Date.numeric(value) : 'origin' must be supplied
and I dunno how to supply that origin argument here.
Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.
In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.
This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3
Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6