Combine datasets by date range and categorical variable - r

Suppose I have two datasets. One contains a list of promotions with start/end dates, and the other contains monthly sales data for each program.
promotions = data.frame(
start.date = as.Date(c("2012-01-01", "2012-06-14", "2012-02-01", "2012-03-31", "2012-07-13")),
end.date = as.Date(c("2014-04-05", "2014-11-13", "2014-02-25", "2014-08-02", "2014-09-30")),
program = c("a", "a", "a", "b", "b"))
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200))
Note that sales$year.month.day is used to indicate year/month. Day is included so R can more simply treat the column as a vector of date objects, but it isn't relevant to the actual sales.
I need to determine the number of promotions that occurred per month for each program. Here's an example of a loop that produces the output I want:
sales$count = rep(0, nrow(sales))
sub = list()
for (i in 1:nrow(sales)) {
sub[[i]] = promotions[which(promotions$program == sales$program[i]),]
if (nrow(sub[[i]]) > 1) {
for (j in 1:nrow(sub[[i]])) {
if (sales$year.month.day[i] %in% seq(from = as.Date(sub[[i]]$start.date[j]), to = as.Date(sub[[i]]$end.date[j]), by = "day")) {
sales$count[i] = sales$count[i] + 1
}
}
}
}
Example output:
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200),
count = c(3, 1, 3, 3, 2)
)
However since my actual datasets are very large, this loop crashes when I run it in R.
Is there a more efficient way to achieve the same result? Perhaps something with dplyr?

Using the newly implemented non-equi joins from the current development version of data.table:
require(data.table) # v1.9.7+
setDT(promotions) # convert to data.table by reference
setDT(sales)
ans = promotions[sales, .(monthly.sales, .N), by=.EACHI, allow.cartesian=TRUE,
on=.(program, start.date<=year.month.day, end.date>=year.month.day), nomatch=0L]
ans[, end.date := NULL]
setnames(ans, "start.date", "year.month.date")
# program year.month.date monthly.sales N
# 1: a 2013-02-01 200 3
# 2: b 2014-09-01 200 1
# 3: a 2013-08-01 200 3
# 4: a 2013-04-01 400 3
# 5: b 2012-11-01 200 2
See installation instructions for development version here.

You can do this with sql.
library(sqldf)
sqldf("select s.ymd,p.program,s.monthlysales, count(*) from promotions p outer left join sales s on p.program=s.program
where s.ymd between p.startdate and p.enddate and p.program=s.program group by s.ymd, s.program" )
This would first join the 2 data set where ymd in sales is between the start and end date of promotion and program in both the data are same. then it would group by ymd and count the instance. I have removed the periods from the names of variable.

I am a fan of Hadley's packages:
library(dplyr)
library(lubridate)
Floor dates so they are in the same format as the sales dataframe:
df <- promotions %>%
mutate(start.date = floor_date(start.date, unit = "month"),
end.date = floor_date(end.date, unit = "month"))
Expand the date intervals:
df$output <- mapply(function(x,y) seq(x, y, by = "month"),
df$start.date,
df$end.date)
Expand the data frame based on the date ranges, group and count, and merge to sales on date and program:
df %>% tidyr::unnest(output) %>%
group_by(output, program) %>%
summarise(prom_num = n()) %>%
merge(sales, .,
by.x = c("year.month.day", "program"),
by.y = c("output", "program"))
Output:
year.month.day program monthly.sales prom_num
1 2012-11-01 b 200 2
2 2013-02-01 a 200 3
3 2013-04-01 a 400 3
4 2013-08-01 a 200 3
5 2014-09-01 b 200 1

Could try ?data.table::foverlaps for that
library(data.table)
setDT(sales)[, c("start.date", "end.date") := year.month.day] # Add overlap cols
setkey(sales, program, start.date, end.date) # Key for join
res <- foverlaps(setDT(promotions), sales)[, .N, by = year.month.day] # Count joins
sales[res, count := i.N, on = "year.month.day"] # Update `sales` with results
sales
# year.month.day program monthly.sales start.date end.date count
# 1: 2013-02-01 a 200 2013-02-01 2013-02-01 3
# 2: 2013-04-01 a 400 2013-04-01 2013-04-01 3
# 3: 2013-08-01 a 200 2013-08-01 2013-08-01 3
# 4: 2012-11-01 b 200 2012-11-01 2012-11-01 2
# 5: 2014-09-01 b 200 2014-09-01 2014-09-01 1
This is basically creates interval column in sales, joins by them + by program, counts overlaps, and joins back to sales. You could remove the additional columns by doing sales[, c("start.date", "end.date") := NULL] if it really bothers you. Google foverlaps and data.table for more examples

Related

Counting data.table entries between dates

I have a data table with a bunch of entries that have start and end dates, like so:
id
start
end
1
1958-01-03
1962-10-11
2
1961-02-23
2012-04-28
etc.
I want to count how many of these items were in operation, by month. So I tried to do this:
data.table(
month = seq(as.Date('1950-01-01','%Y-%m-%d'), as.Date('2021-09-01','%Y-%m-%d'), 'months'),
month_end = seq(as.Date('1950-02-01','%Y-%m-%d'), as.Date('2021-10-01', '%Y-%m-%d'), 'months') -1
) %>%
.[,count := satcat[start >= month & month_end <= end,.N]] %>%
.[]
What I get, however, is an error:
Warning message in `>.default`(start, month):
“longer object length is not a multiple of shorter object length”
Warning message in `<=.default`(month_end, end):
“longer object length is not a multiple of shorter object length”
and count is the same number for all rows. Why is this happening, and what is the correct way of doing this? I feel like there should be some apply solution, but I can't work it out.
This is a job for foverlaps:
library(data.table)
DT <- data.table(id = 1:2,
start = as.Date(c("1958-01-03", "1961-02-23")),
end = as.Date(c("1961-10-11", "2012-04-28")))
periods <- data.table(start = seq(as.Date('1950-01-01','%Y-%m-%d'), as.Date('2021-09-01','%Y-%m-%d'), 'months'),
end = seq(as.Date('1950-02-01','%Y-%m-%d'), as.Date('2021-10-01', '%Y-%m-%d'), 'months') -1)
setkey(DT, start, end)
setkey(periods, start, end)
res <- foverlaps(periods, DT, nomatch = NA)[, .(N = sum(!is.na(id))), by = .(i.start, i.end)]
plot(N ~ i.start, data = res, type = "s")
res[N == 2]
# i.start i.end N
#1: 1961-02-01 1961-02-28 2
#2: 1961-03-01 1961-03-31 2
#3: 1961-04-01 1961-04-30 2
#4: 1961-05-01 1961-05-31 2
#5: 1961-06-01 1961-06-30 2
#6: 1961-07-01 1961-07-31 2
#7: 1961-08-01 1961-08-31 2
#8: 1961-09-01 1961-09-30 2
#9: 1961-10-01 1961-10-31 2

choose semi-last observations based on date in data.table in R

I have a data.table with dates in it (as factor variables). I am getting the lag values from these. How can I tell R to run the get the lag values only for the observations dated semi-last? In this case this would be start == "01.01.2015"?
example data:
ID <- rep("A5", 15)
product <- rep(c("prod1","prod2","prod3", "prod55", "prod4", "prod9", "prod83"),3)
start <- c(rep("01.01.2016", 3), rep("01.01.2015", 3), rep("01.01.2014",3),
rep("01.01.2013",3), rep("01.01.2012",3))
prodID <- c(3,1,2,3,1,2,3,1,2,3,2,1,3,1,2)
mydata <- cbind(ID, product[1:15], start, prodID)
mydata <- as.data.table(mydata)
mydata[, (nameCols) := shift(.SD, 3, fill = "NA", "lead"), .SDcols= c("start", "V2"), by = "prodID"]
For now I have used this to get to my results:
mydata[start == "01.01.2015"]
The problem is that the semi-last date is not always the same date. I will be repeating this procedure many times and i want to avoid having to specify this by hand. Any ideas?
Convert the data to date object and sort to select semi-last date.
library(data.table)
mydata[, start := as.IDate(start, '%d.%m.%Y')]
mydata[start == sort(unique(start), decreasing = TRUE)[2]]
# ID V2 start prodID
#1: A5 prod55 2015-01-01 3
#2: A5 prod4 2015-01-01 1
#3: A5 prod9 2015-01-01 2

The data.table way to perform operation on columns grouped by last n years

Below is an example of what I want to achieve with a reproducible example.
I have a data.table with months as the time id. I want to make some computations on the data for the last 5 yrs, last 10 yrs etc. to the last month. (i.e. the last 5*12 months, last 10*12 months, etc)
I have a way of doing it, but I suspect it goes through many unnecessary intermediate variables.
library(lubridate) #For easy creation of time-series
library(data.table)
set.seed(5)
DT <- data.table(
Month = as.Date(sapply(0:329, function(i)(as.Date('1990-01-01')%m+%months(i))), origin = '1970-01-01'),
Value = round(runif(330, min = 20, max = 40), digits = 2)
)
> DT
Month Value
1: 1990-01-01 24.00
2: 1990-02-01 33.70
3: 1990-03-01 38.34
4: 1990-04-01 25.69
5: 1990-05-01 22.09
---
326: 2017-02-01 20.91
327: 2017-03-01 38.96
328: 2017-04-01 28.91
329: 2017-05-01 26.09
330: 2017-06-01 35.16
## Create a vector of the first months marking the start of the 60 or 120 month period
last.month <- max(DT[['Month']])
first.months <- as.Date(sapply(seq(5, 25, by = 5), function(i)(last.month
%m-% months(i*12 - 1))), origin = '1970-01-01')
## Construction of table of interest
yrs <- paste0(seq(5, 25, by = 5), 'Yrs')
features <- data.table(
Period = factor(yrs, levels = yrs), Feature.1 = as.numeric(NA),
Feature.2 = as.numeric(NA)
)
for(i in 1:nrow(features)){
DT_n <- DT[Month>=first.months[i], ]
set(features, i, 'Feature.1', DT_n[, mean(Value)]) #mean used as an example operation
set(features, i, 'Feature.2', DT_n[, var(Value)]) #var used as an example operation
}
Finally, this is the table I am interested in -
> features
Period Feature.1 Feature.2
1: 5Yrs 29.68817 35.80375
2: 10Yrs 29.25542 39.50981
3: 15Yrs 29.64950 37.41900
4: 20Yrs 29.63454 34.51793
5: 25Yrs 29.84373 35.90916
What might be the best way in the data.table parlance to achieve this goal? Any improvement in terms of unnecessary variable reduction or efficiency is appreciated.
Thank you!
Another approach:
rbindlist(lapply(first.months,
function(m) data.table(val_mean = mean(DT[Month >= m]$Value),
val_var = var(DT[Month >= m]$Value)))
)[, Period := yrs][]
which gives:
val_mean val_var Period
1: 29.68817 35.80375 5Yrs
2: 29.25542 39.50981 10Yrs
3: 29.64950 37.41900 15Yrs
4: 29.63454 34.51793 20Yrs
5: 29.84373 35.90916 25Yrs
Or a variation on the approach above with setNames and the idcol-parameter of rbindlist:
rbindlist(setNames(lapply(first.months,
function(m) data.table(val_mean = mean(DT$Value[DT$Month >= m]),
val_var = var(DT$Value[DT$Month >= m]))),
yrs),
idcol = 'Period')
which gives:
Period val_mean val_var
1: 5Yrs 29.68817 35.80375
2: 10Yrs 29.25542 39.50981
3: 15Yrs 29.64950 37.41900
4: 20Yrs 29.63454 34.51793
5: 25Yrs 29.84373 35.90916
Here's another data.table approach you can try out. After constructing the first.months and yrs vectors, you can put them into a separate data.table:
m <- data.table(firstmonths = first.months, yrs = yrs, key = "yrs")
And then use non-equi joins to compute the results:
rbindlist(lapply(yrs, function(y) {
DT[m[y], on = .(Month >= firstmonths), .(mean = mean(Value),
var = var(Value),
Period = y)]
}))
# mean var Period
#1: 29.68817 35.80375 5Yrs
#2: 29.25542 39.50981 10Yrs
#3: 29.64950 37.41900 15Yrs
#4: 29.63454 34.51793 20Yrs
#5: 29.84373 35.90916 25Yrs

Setting a value in one dataframe by looking its value in another dataframe based on dates

I have 2 data frames in R.Data1 has 2 columns id, date and Data2 has 3 columns id, date, level.I want to set level column in Data1 based on level and date columns in Data2.
Data1 = data.frame(id = c(1,1,1), dates = c("2014-06","2016-02","2016-05"))
id date
1 2014-06
1 2016-02
1 2016-05
Data2 = data.frame(id = c(1,1,1), dates = c("2015-07","2016-04","2016-07"), level=c(3,4,5))
id date level
1 2015-07 3
1 2016-04 4
1 2016-07 5
So resulting data frame should be:
id date level
1 2014-06 NULL
1 2016-02 3
1 2016-05 4
You can accomplish this with the rolling joins from the data.table-package and converting the dates-columns to a date-class (see the note at the end of this post):
library(data.table)
setDT(Data1, key = c('id','dates'))
setDT(Data2, key = c('id','dates'))
Data1[Data2, lev := level, roll = -Inf, rollends = c(TRUE,FALSE)][]
which gives:
> Data1
id dates lev
1: 1 2014-06-01 NA
2: 1 2016-02-01 3
3: 1 2016-05-01 4
Explanation:
Convert the dataframes to datatables with setDT and set the key to the columns which are needed for the join
Join and create a new variable in Data1 with lev := level. With roll = -Inf you roll backwards and with rollends = c(TRUE,FALSE) you only roll the first value backwards.
Setting the keys beforehand isn't necessary. You could also do:
setDT(Data1)
setDT(Data2)
Data1[Data2, on = c('id','dates'), lev := level, roll = -Inf, rollends = c(TRUE,FALSE)][]
Used data:
Data1 = data.frame(id = c(1,1,1), dates = c("2014-06","2016-02","2016-05"))
Data2 = data.frame(id = c(1,1,1), dates = c("2015-07","2016-04","2016-07"), level=c(3,4,5))
Data1$dates <- as.Date(paste0(Data1$dates,'-01'))
Data2$dates <- as.Date(paste0(Data2$dates,'-01'))
NOTE: I converted the dates-columns to a date-format by adding the first day to each month. This necessary in order to properly do a rolling join as specified.

Count number of occurences in date range in R

I have a dataframe with a number of accounts, their status and the start and endtime for that status. I would like to report on the number of accounts in each of these statuses over a date range. The data looks like the df below, with the resulting report. (Actual data contains more state values. N/A values are shown with a dummy date far in the future.)
df <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6),
number.open = c(2,2,2,1,1,1)
)
I have looked at options involving rowwise() and mutate from dplyr and foverlaps from data.table, but haven't been able to code it up so it works.
(See Checking if Date is Between two Dates in R)
We can use sapply to do this for us:
report$NumberOpen <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$state == 'Open'))
# report
# date NumberOpen
# 1 2016-04-01 2
# 2 2016-04-02 2
# 3 2016-04-03 2
# 4 2016-04-04 1
# 5 2016-04-05 1
# 6 2016-04-06 1
data
df1 <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6)
)

Resources