I have a data table with a bunch of entries that have start and end dates, like so:
id
start
end
1
1958-01-03
1962-10-11
2
1961-02-23
2012-04-28
etc.
I want to count how many of these items were in operation, by month. So I tried to do this:
data.table(
month = seq(as.Date('1950-01-01','%Y-%m-%d'), as.Date('2021-09-01','%Y-%m-%d'), 'months'),
month_end = seq(as.Date('1950-02-01','%Y-%m-%d'), as.Date('2021-10-01', '%Y-%m-%d'), 'months') -1
) %>%
.[,count := satcat[start >= month & month_end <= end,.N]] %>%
.[]
What I get, however, is an error:
Warning message in `>.default`(start, month):
“longer object length is not a multiple of shorter object length”
Warning message in `<=.default`(month_end, end):
“longer object length is not a multiple of shorter object length”
and count is the same number for all rows. Why is this happening, and what is the correct way of doing this? I feel like there should be some apply solution, but I can't work it out.
This is a job for foverlaps:
library(data.table)
DT <- data.table(id = 1:2,
start = as.Date(c("1958-01-03", "1961-02-23")),
end = as.Date(c("1961-10-11", "2012-04-28")))
periods <- data.table(start = seq(as.Date('1950-01-01','%Y-%m-%d'), as.Date('2021-09-01','%Y-%m-%d'), 'months'),
end = seq(as.Date('1950-02-01','%Y-%m-%d'), as.Date('2021-10-01', '%Y-%m-%d'), 'months') -1)
setkey(DT, start, end)
setkey(periods, start, end)
res <- foverlaps(periods, DT, nomatch = NA)[, .(N = sum(!is.na(id))), by = .(i.start, i.end)]
plot(N ~ i.start, data = res, type = "s")
res[N == 2]
# i.start i.end N
#1: 1961-02-01 1961-02-28 2
#2: 1961-03-01 1961-03-31 2
#3: 1961-04-01 1961-04-30 2
#4: 1961-05-01 1961-05-31 2
#5: 1961-06-01 1961-06-30 2
#6: 1961-07-01 1961-07-31 2
#7: 1961-08-01 1961-08-31 2
#8: 1961-09-01 1961-09-30 2
#9: 1961-10-01 1961-10-31 2
Related
I have a data.table with dates in it (as factor variables). I am getting the lag values from these. How can I tell R to run the get the lag values only for the observations dated semi-last? In this case this would be start == "01.01.2015"?
example data:
ID <- rep("A5", 15)
product <- rep(c("prod1","prod2","prod3", "prod55", "prod4", "prod9", "prod83"),3)
start <- c(rep("01.01.2016", 3), rep("01.01.2015", 3), rep("01.01.2014",3),
rep("01.01.2013",3), rep("01.01.2012",3))
prodID <- c(3,1,2,3,1,2,3,1,2,3,2,1,3,1,2)
mydata <- cbind(ID, product[1:15], start, prodID)
mydata <- as.data.table(mydata)
mydata[, (nameCols) := shift(.SD, 3, fill = "NA", "lead"), .SDcols= c("start", "V2"), by = "prodID"]
For now I have used this to get to my results:
mydata[start == "01.01.2015"]
The problem is that the semi-last date is not always the same date. I will be repeating this procedure many times and i want to avoid having to specify this by hand. Any ideas?
Convert the data to date object and sort to select semi-last date.
library(data.table)
mydata[, start := as.IDate(start, '%d.%m.%Y')]
mydata[start == sort(unique(start), decreasing = TRUE)[2]]
# ID V2 start prodID
#1: A5 prod55 2015-01-01 3
#2: A5 prod4 2015-01-01 1
#3: A5 prod9 2015-01-01 2
I have a data set with time periods, that may overlap, showing me if somebody was present (example_df). I want to get a data set that splits a large time period (from 2014-01-01 to 2014-10-31) into smaller time periods where somebody was present (present = 1) and time periods where nobody was present (present = 0).
The result should look like result_df
Example data frame
example_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-03-05"), as.Date("2014-06-13"), as.Date("2014-08-15")),
end = c(as.Date("2014-04-07"), as.Date("2014-04-12"), as.Date("2014-08-05"), as.Date("2014-10-02")),
present = 1)
Result should look like this
result_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02")),
end = c(as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02"), as.Date("2014-10-31")),
present = c(1, 0, 1, 0, 1, 0))
I have no idea how to tackle this problem as it requires to split time periods or add rows (or something else?). Any help is much appreciated!
I hope I can be helpful, as I have struggled with this as well.
As in IceCreamToucan's example, this assumes independence by person ID. This approach uses dplyr to look at overlap in date ranges and then flattens them. Other examples of this approach have been described in stackoverflow and use dplyr. The end result includes time ranges where the person is present.
library(tidyr)
library(dplyr)
pres <- example_df %>%
group_by(ID) %>%
arrange(start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end), present = 1) %>%
select(-indx)
Then, additional rows can be added to indicate time period when not present. In these cases, for a given ID, it will determine gaps between an older end date and a newer (more recent) start date. Then finally the result is ordered by ID and the start date.
result <- pres
for (i in unique(pres$ID)) {
pres_i <- subset(pres, ID == i)
if (nrow(pres_i) > 1) {
adding <- data.frame(ID = i, start = pres_i$end[-nrow(pres_i)]+1, end = pres_i$start[-1]-1, present = 0)
adding <- adding[adding$start <= adding$end, ]
result <- bind_rows(result, adding)
}
}
result[order(result$ID, result$start), ]
# A tibble: 5 x 4
# Groups: ID [1]
ID start end present
<dbl> <date> <date> <dbl>
1 1 2014-01-01 2014-04-12 1
2 1 2014-04-13 2014-06-12 0
3 1 2014-06-13 2014-08-05 1
4 1 2014-08-06 2014-08-14 0
5 1 2014-08-15 2014-10-02 1
Assuming you want to do it separately for each ID, you can create a data table with all dates for which someone was present, and join that with a table of all dates over that time period. The result is not exactly the same, because the present and not-present periods don't overlap.
library(data.table)
setDT(example_df)
example_df[, {
pres <- unique(unlist(Map(`:`, start, end)))
class(pres) <- 'Date'
all <- min(pres):max(pres)
class(all) <- 'Date'
pres <- data.table(day = pres)
all <- data.table(day = all)
out.full <- pres[all, on = .(day), .(day = i.day, present = +!is.na(x.day))]
out.full[, .(start = min(day), end = max(day)),
by = .(present, rid = rleid(present))][, -'rid']
}, by = ID]
# ID present start end
# 1: 1 1 2014-01-01 2014-04-12
# 2: 1 0 2014-04-13 2014-06-12
# 3: 1 1 2014-06-13 2014-08-05
# 4: 1 0 2014-08-06 2014-08-14
# 5: 1 1 2014-08-15 2014-10-02
Below is an example of what I want to achieve with a reproducible example.
I have a data.table with months as the time id. I want to make some computations on the data for the last 5 yrs, last 10 yrs etc. to the last month. (i.e. the last 5*12 months, last 10*12 months, etc)
I have a way of doing it, but I suspect it goes through many unnecessary intermediate variables.
library(lubridate) #For easy creation of time-series
library(data.table)
set.seed(5)
DT <- data.table(
Month = as.Date(sapply(0:329, function(i)(as.Date('1990-01-01')%m+%months(i))), origin = '1970-01-01'),
Value = round(runif(330, min = 20, max = 40), digits = 2)
)
> DT
Month Value
1: 1990-01-01 24.00
2: 1990-02-01 33.70
3: 1990-03-01 38.34
4: 1990-04-01 25.69
5: 1990-05-01 22.09
---
326: 2017-02-01 20.91
327: 2017-03-01 38.96
328: 2017-04-01 28.91
329: 2017-05-01 26.09
330: 2017-06-01 35.16
## Create a vector of the first months marking the start of the 60 or 120 month period
last.month <- max(DT[['Month']])
first.months <- as.Date(sapply(seq(5, 25, by = 5), function(i)(last.month
%m-% months(i*12 - 1))), origin = '1970-01-01')
## Construction of table of interest
yrs <- paste0(seq(5, 25, by = 5), 'Yrs')
features <- data.table(
Period = factor(yrs, levels = yrs), Feature.1 = as.numeric(NA),
Feature.2 = as.numeric(NA)
)
for(i in 1:nrow(features)){
DT_n <- DT[Month>=first.months[i], ]
set(features, i, 'Feature.1', DT_n[, mean(Value)]) #mean used as an example operation
set(features, i, 'Feature.2', DT_n[, var(Value)]) #var used as an example operation
}
Finally, this is the table I am interested in -
> features
Period Feature.1 Feature.2
1: 5Yrs 29.68817 35.80375
2: 10Yrs 29.25542 39.50981
3: 15Yrs 29.64950 37.41900
4: 20Yrs 29.63454 34.51793
5: 25Yrs 29.84373 35.90916
What might be the best way in the data.table parlance to achieve this goal? Any improvement in terms of unnecessary variable reduction or efficiency is appreciated.
Thank you!
Another approach:
rbindlist(lapply(first.months,
function(m) data.table(val_mean = mean(DT[Month >= m]$Value),
val_var = var(DT[Month >= m]$Value)))
)[, Period := yrs][]
which gives:
val_mean val_var Period
1: 29.68817 35.80375 5Yrs
2: 29.25542 39.50981 10Yrs
3: 29.64950 37.41900 15Yrs
4: 29.63454 34.51793 20Yrs
5: 29.84373 35.90916 25Yrs
Or a variation on the approach above with setNames and the idcol-parameter of rbindlist:
rbindlist(setNames(lapply(first.months,
function(m) data.table(val_mean = mean(DT$Value[DT$Month >= m]),
val_var = var(DT$Value[DT$Month >= m]))),
yrs),
idcol = 'Period')
which gives:
Period val_mean val_var
1: 5Yrs 29.68817 35.80375
2: 10Yrs 29.25542 39.50981
3: 15Yrs 29.64950 37.41900
4: 20Yrs 29.63454 34.51793
5: 25Yrs 29.84373 35.90916
Here's another data.table approach you can try out. After constructing the first.months and yrs vectors, you can put them into a separate data.table:
m <- data.table(firstmonths = first.months, yrs = yrs, key = "yrs")
And then use non-equi joins to compute the results:
rbindlist(lapply(yrs, function(y) {
DT[m[y], on = .(Month >= firstmonths), .(mean = mean(Value),
var = var(Value),
Period = y)]
}))
# mean var Period
#1: 29.68817 35.80375 5Yrs
#2: 29.25542 39.50981 10Yrs
#3: 29.64950 37.41900 15Yrs
#4: 29.63454 34.51793 20Yrs
#5: 29.84373 35.90916 25Yrs
I have a dataframe with a number of accounts, their status and the start and endtime for that status. I would like to report on the number of accounts in each of these statuses over a date range. The data looks like the df below, with the resulting report. (Actual data contains more state values. N/A values are shown with a dummy date far in the future.)
df <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6),
number.open = c(2,2,2,1,1,1)
)
I have looked at options involving rowwise() and mutate from dplyr and foverlaps from data.table, but haven't been able to code it up so it works.
(See Checking if Date is Between two Dates in R)
We can use sapply to do this for us:
report$NumberOpen <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$state == 'Open'))
# report
# date NumberOpen
# 1 2016-04-01 2
# 2 2016-04-02 2
# 3 2016-04-03 2
# 4 2016-04-04 1
# 5 2016-04-05 1
# 6 2016-04-06 1
data
df1 <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6)
)
Suppose I have two datasets. One contains a list of promotions with start/end dates, and the other contains monthly sales data for each program.
promotions = data.frame(
start.date = as.Date(c("2012-01-01", "2012-06-14", "2012-02-01", "2012-03-31", "2012-07-13")),
end.date = as.Date(c("2014-04-05", "2014-11-13", "2014-02-25", "2014-08-02", "2014-09-30")),
program = c("a", "a", "a", "b", "b"))
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200))
Note that sales$year.month.day is used to indicate year/month. Day is included so R can more simply treat the column as a vector of date objects, but it isn't relevant to the actual sales.
I need to determine the number of promotions that occurred per month for each program. Here's an example of a loop that produces the output I want:
sales$count = rep(0, nrow(sales))
sub = list()
for (i in 1:nrow(sales)) {
sub[[i]] = promotions[which(promotions$program == sales$program[i]),]
if (nrow(sub[[i]]) > 1) {
for (j in 1:nrow(sub[[i]])) {
if (sales$year.month.day[i] %in% seq(from = as.Date(sub[[i]]$start.date[j]), to = as.Date(sub[[i]]$end.date[j]), by = "day")) {
sales$count[i] = sales$count[i] + 1
}
}
}
}
Example output:
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200),
count = c(3, 1, 3, 3, 2)
)
However since my actual datasets are very large, this loop crashes when I run it in R.
Is there a more efficient way to achieve the same result? Perhaps something with dplyr?
Using the newly implemented non-equi joins from the current development version of data.table:
require(data.table) # v1.9.7+
setDT(promotions) # convert to data.table by reference
setDT(sales)
ans = promotions[sales, .(monthly.sales, .N), by=.EACHI, allow.cartesian=TRUE,
on=.(program, start.date<=year.month.day, end.date>=year.month.day), nomatch=0L]
ans[, end.date := NULL]
setnames(ans, "start.date", "year.month.date")
# program year.month.date monthly.sales N
# 1: a 2013-02-01 200 3
# 2: b 2014-09-01 200 1
# 3: a 2013-08-01 200 3
# 4: a 2013-04-01 400 3
# 5: b 2012-11-01 200 2
See installation instructions for development version here.
You can do this with sql.
library(sqldf)
sqldf("select s.ymd,p.program,s.monthlysales, count(*) from promotions p outer left join sales s on p.program=s.program
where s.ymd between p.startdate and p.enddate and p.program=s.program group by s.ymd, s.program" )
This would first join the 2 data set where ymd in sales is between the start and end date of promotion and program in both the data are same. then it would group by ymd and count the instance. I have removed the periods from the names of variable.
I am a fan of Hadley's packages:
library(dplyr)
library(lubridate)
Floor dates so they are in the same format as the sales dataframe:
df <- promotions %>%
mutate(start.date = floor_date(start.date, unit = "month"),
end.date = floor_date(end.date, unit = "month"))
Expand the date intervals:
df$output <- mapply(function(x,y) seq(x, y, by = "month"),
df$start.date,
df$end.date)
Expand the data frame based on the date ranges, group and count, and merge to sales on date and program:
df %>% tidyr::unnest(output) %>%
group_by(output, program) %>%
summarise(prom_num = n()) %>%
merge(sales, .,
by.x = c("year.month.day", "program"),
by.y = c("output", "program"))
Output:
year.month.day program monthly.sales prom_num
1 2012-11-01 b 200 2
2 2013-02-01 a 200 3
3 2013-04-01 a 400 3
4 2013-08-01 a 200 3
5 2014-09-01 b 200 1
Could try ?data.table::foverlaps for that
library(data.table)
setDT(sales)[, c("start.date", "end.date") := year.month.day] # Add overlap cols
setkey(sales, program, start.date, end.date) # Key for join
res <- foverlaps(setDT(promotions), sales)[, .N, by = year.month.day] # Count joins
sales[res, count := i.N, on = "year.month.day"] # Update `sales` with results
sales
# year.month.day program monthly.sales start.date end.date count
# 1: 2013-02-01 a 200 2013-02-01 2013-02-01 3
# 2: 2013-04-01 a 400 2013-04-01 2013-04-01 3
# 3: 2013-08-01 a 200 2013-08-01 2013-08-01 3
# 4: 2012-11-01 b 200 2012-11-01 2012-11-01 2
# 5: 2014-09-01 b 200 2014-09-01 2014-09-01 1
This is basically creates interval column in sales, joins by them + by program, counts overlaps, and joins back to sales. You could remove the additional columns by doing sales[, c("start.date", "end.date") := NULL] if it really bothers you. Google foverlaps and data.table for more examples