Aggregating timely data efficiently in R with dplyr - r

I have data like this:
set.seed(2020)
df_time = data.frame(Time = as.Date(1:100), value = round(runif(100, min = 0, 100)))
head(df_time)
Time value
1 1970-01-02 65
2 1970-01-03 39
3 1970-01-04 62
4 1970-01-05 48
5 1970-01-06 14
6 1970-01-07 7
7 1970-01-08 13
8 1970-01-09 39
9 1970-01-10 0
10 1970-01-11 62
And this:
df = data.frame(from= as.Date(c(3,6, 20)),to= as.Date(c(8,7, 24)),)
head(df)
My goal is to mutate the dataframe df such that it adds the sum of the values between the two given dates (excluding the from date and including the to date), i.e.
from to sum_value
1 1970-01-04 1970-01-09 121
2 1970-01-07 1970-01-08 13
3 1970-01-21 1970-01-25 204
Atthe moment I am doing it with a loop but that takes way too long for the amount of data I have.
Does anyone knows a better solution (e.g. with dplyr)?

You can use rowwise() in this case to tell dplyr to evaluate df row by row:
df %>%
rowwise() %>%
mutate(sum_value = df_time %>%
filter(Time > from, Time <= to) %>%
pull(value) %>%
sum()) %>%
ungroup() # ungroup is used to restore default behaviour of dplyr
#> # A tibble: 3 x 3
#> from to sum_value
#> <date> <date> <dbl>
#> 1 1970-01-04 1970-01-09 121
#> 2 1970-01-07 1970-01-08 13
#> 3 1970-01-21 1970-01-25 204
This is how I would do since I find the code easy to understand. But it's basically still a loop under the hood as far as I get it.
A different and potentially faster approach could be to use data.table. I'm not as familiar with the syntax so there might be a better way, but this works:
library(data.table)
# convert to data.table
df_time <- setDT(df_time)
df <- setDT(df)
# duplicate Time column since foverlaps needs two of them
df_time <- df_time[, Time2 := Time]
# Since from day should not be included
df <- df[, from := from + 1]
setkey(df, from, to)
res <- foverlaps(df_time, df, by.x = c("Time", "Time2"), by.y = c("from", "to"), type = "within")
res <- res[, .(value = sum(value)), keyby = .(from, to)]
res[, from := from - 1]
#> from to value
#> 1: <NA> <NA> 4622
#> 2: 1970-01-04 1970-01-09 121
#> 3: 1970-01-07 1970-01-08 13
#> 4: 1970-01-21 1970-01-25 204

Related

remove rows with overlaped dates and keep longest time interval in R using dplyr or sqldf

I need to remove rows with overlapped dates and keep the x value which is maximum among the overlapped dates. Here is a data frame:
data.frame(time_left = c("2011-08-05",
"2011-07-25",
"2017-08-20",
"2017-08-20",
"2017-10-09",
"2019-06-01"),
time_right= c("2011-09-14",
"2011-09-01",
"2017-09-12",
"2017-09-26",
"2017-10-15",
"2019-11-05"),
x = c(114,20,10,1,5,100) ) -> df
so my input is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
3 2017-08-20 2017-09-12 10
4 2017-08-20 2017-09-26 1
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
and my desired output is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
4 2017-08-20 2017-09-26 10
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
I appreciate any help.
#Maël brought this issue to my attention over on the ivs issue page https://github.com/DavisVaughan/ivs/issues/20.
I think this can be very elegantly and efficiently solved with ivs, but it is a bit hard to come up with the solution, so I'll probably add a helper to do this more easily.
This solution works with "recursive" containers too, i.e. where range A contains range B, but then range C also contains range A, so you really only want to list range C. I've described this in more detail with examples here https://github.com/DavisVaughan/ivs/issues/20#issuecomment-1234479783.
library(ivs)
library(dplyr)
library(vctrs)
df <- tibble(
time_left = as.Date(c(
"2011-08-05", "2011-07-25", "2017-08-20",
"2017-08-20", "2017-10-09", "2019-06-01"
)),
time_right = as.Date(c(
"2011-09-14", "2011-09-01", "2017-09-12",
"2017-09-26", "2017-10-15", "2019-11-05"
)),
x = c(114, 20, 10, 1, 5, 100)
)
df <- df %>%
mutate(range = iv(time_left, time_right), .keep = "unused")
df
#> # A tibble: 6 × 2
#> x range
#> <dbl> <iv<date>>
#> 1 114 [2011-08-05, 2011-09-14)
#> 2 20 [2011-07-25, 2011-09-01)
#> 3 10 [2017-08-20, 2017-09-12)
#> 4 1 [2017-08-20, 2017-09-26)
#> 5 5 [2017-10-09, 2017-10-15)
#> 6 100 [2019-06-01, 2019-11-05)
iv_locate_max_containment <- function(x) {
# Find all locations where the range "contains" any other range
# (including itself)
locs <- iv_locate_overlaps(x, x, type = "contains")
# Find the "top" ranges, i.e. the containers that aren't contained
# by any other containers
top <- !vec_duplicate_detect(locs$haystack)
top <- vec_slice(locs$haystack, top)
top <- vec_in(locs$needles, top)
locs <- vec_slice(locs, top)
locs
}
# i.e. row 4 "contains" rows 3 and 4
locs <- iv_locate_max_containment(df$range)
locs
#> needles haystack
#> 1 1 1
#> 2 2 2
#> 3 4 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
iv_align(df$range, df$x, locations = locs) %>%
rename(range = needles) %>%
group_by(range) %>%
summarise(x = max(haystack))
#> # A tibble: 5 × 2
#> range x
#> <iv<date>> <dbl>
#> 1 [2011-07-25, 2011-09-01) 20
#> 2 [2011-08-05, 2011-09-14) 114
#> 3 [2017-08-20, 2017-09-26) 10
#> 4 [2017-10-09, 2017-10-15) 5
#> 5 [2019-06-01, 2019-11-05) 100
Created on 2022-09-01 with reprex v2.0.2
This may sound a little verbose, however, this could also be a solution:
First we identify those observations that are potentially overlapped.
Then we group the similar ones.
In each group we choose the minimum time_left and maximum time_right and x.
library(tidyverse)
df %>%
mutate(across(starts_with('time'), ymd),
intv = interval(time_left, time_right),
id = row_number()) %>%
mutate(id2 = map2(intv, id, ~ if (any(.x %within% intv[intv != .x])) {
id[which(.x %within% intv[intv != .x]) + 1]
} else {
.y
})) %>%
group_by(id2) %>%
summarise(time_left = min(time_left),
across(c(time_right, x), max)) %>%
select(!(id2))
# A tibble: 4 × 3
time_left time_right x
<date> <date> <dbl>
1 2011-08-05 2011-09-14 114
2 2017-08-20 2017-09-26 10
3 2017-10-09 2017-10-15 5
4 2019-06-01 2019-11-05 100
I combined Anoushiravan's solution with this
How do I determine in R if a date interval overlaps another date interval for the same individual in a data frame?
and I think it is working now.
df %>%
mutate(id = row_number(), days = as.numeric(as.Date(time_right) - as.Date(time_left)) ) %>%
mutate(Int = interval(time_left, time_right),
within = map(seq_along(Int), function(x){
y = setdiff(seq_along(Int), x)
if(any(id[which((Int[x] %within% Int[y]))+1])){
return(id[days == max(days[which((Int[x] %within% Int[y]))+1])])
}else{ return(0)}
})
) %>%
mutate(within = ifelse(within > 0 , within, id)) %>%
group_by(within) %>%
summarise(time_left = min(time_left), time_right = max(time_right), x = max(x)) %>%
select(!within)
But it still has some bugs. for the following df, this code will not work unless I change the order of the records.
df = data.frame(time_left = c("2014-01-01", "2014-01-01", "2014-12-01", "2014-12-26"),
time_right = c("2014-04-23", "2014-12-31", "2014-12-31", "2014-12-31"),
x = c(10,100,200,20))

Applying a function to rows but referencing different table

I have 2 tables
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
I would like to know the number of observation periods that occur on each day of df1, based on the start/stop dates in df2. E.g. on 1/1/2020, observations a and b were in progress, hence "2".
The expected output would be as follows:
I've tried using sums
df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)
But that only sums up the entire column values
I've then tried to create a custom function for this:
df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))
But it returns an NA value.
I then tried to do embed an "ifelse" within it, but get the same issue with NAs
apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))
Can anyone suggest what the issue is? Thanks!
edit: an interval join was suggested which is not what I'm trying to get - I think naming the observations with a numeric label was what caused confusion. I am trying to find out the TOTAL number of observations with periods that fall within the day, as compared to doing a 1:1 match.
Regards
Sing
Define the comparison in a function f and pass it through outer, rowSums is what you're looking for.
f <- \(x, y) df1[x, 1] >= df2[y, 2] & df1[x, 1] <= df2[y, 3]
cbind(df1, number=rowSums(outer(1:nrow(df1), 1:nrow(df2), f)))
# dates number
# 1 2020-01-01 2
# 2 2020-01-02 2
# 3 2020-01-03 1
# 4 2020-01-04 0
# 5 2020-01-05 1
# 6 2020-01-06 1
# 7 2020-01-07 1
# 8 2020-01-08 1
# 9 2020-01-09 1
# 10 2020-01-10 2
Here is a potential solution using dplyr/tidyverse functions and the %within% function from the lubridate package. This approach is similar to Left Join Subset of Column Based on Date Interval, however there are some important differences i.e. use summarise() instead of filter() to avoid 'losing' dates where "number" == 0, and join by 'character()' as there are no common columns between datasets:
library(dplyr)
library(lubridate)
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"),
as.Date("2020-1-10"),
by = "days")))
df2 = data.frame("observations" = c("1", "2", "3", "4"),
"start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")),
"end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
df1 %>%
full_join(df2, by = character()) %>%
mutate(number = dates %within% interval(start, end)) %>%
group_by(dates) %>%
summarise(number = sum(number))
#> # A tibble: 10 × 2
#> dates number
#> <date> <dbl>
#> 1 2020-01-01 2
#> 2 2020-01-02 2
#> 3 2020-01-03 1
#> 4 2020-01-04 0
#> 5 2020-01-05 1
#> 6 2020-01-06 1
#> 7 2020-01-07 1
#> 8 2020-01-08 1
#> 9 2020-01-09 1
#> 10 2020-01-10 2
Created on 2022-06-27 by the reprex package (v2.0.1)
Does this approach work with your actual data?

Subsetting data by time frame window for each row in a data frame after using group_by()

I have time-series data with three columns: a value column, a group_var column (used for grouping), and a date column. For each row in the data frame, I'd like to get the mean of that row's group after further subsetting by a specific timeframe. Here's an example of the code for subsetting:
df$value[df$date >= (current_row$date - 545) & df$date <= (current_row$date - 365)]
After I get this subset I can easily apply mean(), but where I'm stuck on is how I can get this code to work with something like this:
df %>%
group_by(group_var) %>%
mutate(subset_mean = mean(df$value[df$date >= (current_row$date - 545) & df$date <= (current_row$date - 365)])
)
The issues I see is that I don't think I can use 'df' inside the mutate() line after I group the original 'df'. Also I'm not sure how I can create 'current_row' variable for referencing the current row to calculate the data subset.
Edit:
Added example data and reproducible code
library(dplyr)
date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
date <- date %>% as.Date(format = "%Y-%m-%d")
val <- c(10, 20, 50, 70, 30, 44, 67, 42)
group_var <- c("A", "B", "B", "A", "B", "A", "A", "B")
df <- data.frame(date, val, group_var)
df %>%
group_by(group_var)
I would suggest using slider::slide_index_dbl for this:
library(dplyr)
df %>%
group_by(group_var) %>%
arrange(group_var, date) %>% # slider 0.1.5 requires the window variable to be ascending
mutate(subset_mean = slider::slide_index_dbl(
val, date, mean, .before = 545, .after = -365
# negative ".after" means the window ends before the current date
)) %>%
ungroup()
With the updated data, I get
#date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
# A tibble: 8 x 4
date val group_var subset_mean
<date> <dbl> <chr> <dbl>
1 2016-01-13 44 A NaN
2 2016-02-03 10 A NaN
3 2017-04-16 70 A 27
4 2017-04-24 67 A 27
5 2016-01-27 30 B NaN
6 2016-03-15 50 B NaN
7 2016-06-14 20 B NaN
8 2017-06-15 42 B 33.3
1) This can be done with a self join using sql:
library(sqldf)
sqldf("select a.date, a.val, a.group_var, avg(b.val) as mean
from df a
left join df b on a.group_var = b.group_var and
b.date between a.date - 595 and a.date - 365
group by a.rowid")
giving:
date val group_var mean
1 2016-02-03 10 A NA
2 2016-06-14 20 B NA
3 2016-03-15 50 B NA
4 2017-04-16 70 A 27.00000
5 2016-01-27 30 B NA
6 2016-01-13 44 A NA
7 2017-04-24 67 A 27.00000
8 2017-06-15 42 B 33.33333
2) or we can use SQL window functions:
sqldf("select date, val, group_var,
avg(val) over (partition by group_var
order by date
range between 595 preceding and 365 preceding) as mean
from df"
)
giving:
date val group_var mean
1 2016-01-13 44 A NA
2 2016-02-03 10 A NA
3 2017-04-16 70 A 27.00000
4 2017-04-24 67 A 27.00000
5 2016-01-27 30 B NA
6 2016-03-15 50 B NA
7 2016-06-14 20 B NA
8 2017-06-15 42 B 33.33333
Lubridate provides a very elegant solution...
library(tidyverse)
library(lubridate)
df = tibble(
value = runif(100,1,100),
group = rep(1:4,25),
dt = as.Date(round(runif(100,1000,2000)), origin = "1970-01-01")
)
first_year <- interval(ymd("1972-01-01"), ymd("1972-12-31"))
sec_year <- interval(ymd("1973-01-01"), ymd("1973-12-31"))
furhter <- interval(ymd("1974-01-01"), ymd("1975-12-31"))
df <- df %>%
mutate(
range = case_when(
dt %within% first_year ~"1972",
dt %within% sec_year ~"1973",
TRUE ~"1974-1975"
)
)
mean_by_group_interval <- df %>%
group_by(
group,
range
) %>%
summarise(
mean = mean(value)
)
Here is a solution that utilizes the dplyr package.
library(dplyr)
date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
date <- date %>% as.Date(format = "%Y-%m-%d")
val <- c(10, 20, 50, 70, 30, 44, 67, 42)
group_var <- c("A", "B", "B", "A", "B", "A", "A", "B")
df <- data.frame(date, val, group_var)
df %>%
group_by(group_var) %>%
arrange(group_var, date) %>%
mutate(
# Determine if the current date - the first date of each group is between 365 and 595 days.
match = between(date - first(date), 365, 595),
# Count the number of dates that are not within the range described above to be used in calculating the mean.
count_false = sum(match == FALSE),
# Calculate the cumulative sum for rows in each group that are not within the range described above.
sum_match_false = ifelse(match == FALSE, cumsum(val), NA),
# Calculate the mean.
mean_match_true = ifelse(match == TRUE, max(sum_match_false, na.rm = TRUE) / count_false, NA)
) %>%
# Return only these variables.
select(date, val, group_var, mean_match_true)
#> date val group_var mean_match_true
#> <date> <dbl> <chr> <dbl>
#> 1 2016-01-13 44 A NA
#> 2 2016-02-03 10 A NA
#> 3 2017-04-16 70 A 27
#> 4 2017-04-24 67 A 27
#> 5 2016-01-27 30 B NA
#> 6 2016-03-15 50 B NA
#> 7 2016-06-14 20 B NA
#> 8 2017-06-15 42 B 33.3
Created on 2021-03-12 by the reprex package (v0.3.0)

Count calendar days within a date interval using lubridate

I have data set of hospital admission and discharge days from which I want to generate an occupied beds count for each calendar day of a period of three years. I am using the tidyverse and lubridate packages.
My approach so far has been to convert the admit/discharge columns into an interval (the data are sensitive so I can't share actual dates):
d <- d %>%
mutate(duration = admit %--% discharge)
and then to create a tibble where each row corresponds to the time range, plus a column of zeroes that can be added to in a for loop:
t <-
tibble(
days = as.Date(date("2017-01-01"):date("2019-12-31")),
count = 0
)
Unfortunately, I can't figure out how to create a for loop that would sum days that fall within each interval. Here is my attempt thus far, which gives me uniform values of 24 throughout:
for(i in timeline$days) {
if (i %within% d$duration)
timeline$count = timeline$count + 1
}
Sample data.
library(dplyr)
set.seed(42)
d <- tibble(admit = Sys.Date() - sample(300, size = 1000, replace = TRUE)) %>%
mutate(discharge = admit + sample(0:30, size = 1000, replace = TRUE))
d
# # A tibble: 1,000 x 2
# admit discharge
# <date> <date>
# 1 2019-06-18 2019-07-14
# 2 2019-06-11 2019-06-12
# 3 2019-12-24 2020-01-18
# 4 2019-07-13 2019-07-29
# 5 2019-09-08 2019-09-23
# 6 2019-10-15 2019-10-15
# 7 2019-08-11 2019-08-28
# 8 2020-02-07 2020-02-29
# 9 2019-09-03 2019-09-10
# 10 2019-08-20 2019-09-14
# # ... with 990 more rows
We can produce a list of date ranges/sequences with Map (or purrr::pmap):
Map(seq.Date, d$admit, d$discharge, list(by = "days"))[1:2]
# [[1]]
# [1] "2019-06-18" "2019-06-19" "2019-06-20" "2019-06-21" "2019-06-22" "2019-06-23" "2019-06-24"
# [8] "2019-06-25" "2019-06-26" "2019-06-27" "2019-06-28" "2019-06-29" "2019-06-30" "2019-07-01"
# [15] "2019-07-02" "2019-07-03" "2019-07-04" "2019-07-05" "2019-07-06" "2019-07-07" "2019-07-08"
# [22] "2019-07-09" "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-13" "2019-07-14"
# [[2]]
# [1] "2019-06-11" "2019-06-12"
and then combine these, tabulate them (with table), and enframe them:
Map(seq.Date, d$admit, d$discharge, list(by = "days")) %>%
do.call(c, .) %>%
table() %>%
tibble::enframe(name = "date", value = "count") %>%
# because `table` preserves a *character* representation of the Date
mutate(date = as.Date(date)) %>%
arrange(date)
# # A tibble: 328 x 2
# date count
# <date> <table>
# 1 2019-05-24 1
# 2 2019-05-25 3
# 3 2019-05-26 7
# 4 2019-05-27 8
# 5 2019-05-28 9
# 6 2019-05-29 14
# 7 2019-05-30 20
# 8 2019-05-31 20
# 9 2019-06-01 20
# 10 2019-06-02 21
# # ... with 318 more rows
Here is another method using tidyverse functions.
library(tidyverse)
d %>%
mutate(days = map2(admit, discharge, seq, by = "day")) %>%
unnest(days) %>%
count(days) %>%
right_join(t, by = "days") %>%
mutate(n = coalesce(n, as.integer(count))) %>%
select(-count)
We create a sequennce of dates between admit and discharge, count every unique date, join it with t so that all the dates in t remain intact.

sum all visits by customer in the next week

I need to count of future visits by specific customer in the next 7 days. I solved this with purrr:map2 but I'm experiencing very slow performance. I think I must be missing something basic about how to use purrr. How do I speed this up? Thanks
This toy example takes 2.3 secs with 100 rows, but 3.3 minutes with 1000 rows on my machine. My actual data has 400K rows!
library(tidyverse)
set.seed(123)
rows <- 1000
df= data.frame(cust_num = sample(c("123","124","128"),rows,replace=T),
date = sample(seq(as.Date('2017/01/01'), as.Date('2017/01/31'), by="day"), rows, replace=T))
df <- df %>%
rowwise() %>%
mutate( visits.next.7.days = map2_lgl(df$cust_num,df$date,~.x==cust_num&.y>date&.y<(date+7)) %>% sum() )
Here's an option that uses purrr::reduce to sum the list of vectors returned by data.table::shift (a vectorized version of lead/lag). pmap_int with sum would do the same as reduce with + if you like, but it a little slower. You could similarly do map(1:7, ~lead(n, .x, default = 0L)) instead of data.table::shift, but it's more code and slower.
library(tidyverse)
set.seed(123)
rows <- 1000
df = data.frame(cust_num = sample(c("123","124","128"), rows, replace = TRUE),
date = sample(seq(as.Date('2017/01/01'),
as.Date('2017/01/31'),
by = "day"),
rows, replace = TRUE))
df2 <- df %>%
count(cust_num, date) %>%
group_by(cust_num) %>%
# add dates with no occurrences; none in sample data, but quite possible in real
complete(date = seq(min(date), max(date), by = 'day'), fill = list(n = 0L)) %>%
mutate(visits_next_7 = reduce(data.table::shift(n, 1:7, type = 'lead', fill = 0L), `+`)) %>%
right_join(df)
df2
#> # A tibble: 1,000 x 4
#> # Groups: cust_num [?]
#> cust_num date n visits_next_7
#> <fctr> <date> <int> <int>
#> 1 123 2017-01-09 10 78
#> 2 128 2017-01-19 12 70
#> 3 124 2017-01-05 15 73
#> 4 128 2017-01-27 14 37
#> 5 128 2017-01-27 14 37
#> 6 123 2017-01-15 19 74
#> 7 124 2017-01-24 12 59
#> 8 128 2017-01-10 10 78
#> 9 124 2017-01-03 19 77
#> 10 124 2017-01-14 8 84
#> # ... with 990 more rows
This may not be the most efficient algorithm, as depending on the spacing of your data, complete could potentially expand your data dramatically.
Further, with data this size, you may find data.table is more practical unless you want to put your data in a database and access it with dplyr.
A solution using the zoo package. The idea is to group the data by cust_num and date and count the row number first, and then use the lead function to shift the count number by 1 and use rollapply to calculate the sum of the next six days (not include the beginning date). Finally, use left_join to merge the results back to the original data frame. This should be much faster than your original approach. df3 is the final output.
library(dplyr)
library(zoo)
df2 <- df %>%
count(cust_num, date) %>%
ungroup() %>%
mutate(n2 = lead(n)) %>%
mutate(visits.next.7.days = rollapply(n2, width = 6, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)) %>%
select(cust_num, date, visits.next.7.days)
df3 <- df %>% left_join(df2, by = c("cust_num", "date"))
head(df3)
# cust_num date visits.next.7.days
# 1 123 2017-01-09 70
# 2 128 2017-01-19 54
# 3 124 2017-01-05 58
# 4 128 2017-01-27 37
# 5 128 2017-01-27 37
# 6 123 2017-01-15 68

Resources