I have a data frame of observations with a start and end date for each observation indicating the period it was active.
The duration active varies by observation, and can spread across multiple weeks.
Some observations are still active and do not have an end date.
For a given date range, how can I count the number of observations that were active during a week within that date range, including those still active?
I have a crude method that works, but is pretty slow. It seems like there has to be a more efficient and simpler way to do this.
EDIT: My first approach was similar to Ronak's solution, which is definitely better than mine for smaller data sets, but my real data set has more observations and longer date ranges, so I run into memory constraints.
#I'm primarily using tidyverse/lubridate, but definitely open to other solutions.
library(tidyverse)
library(lubridate)
# sample data frame of observations with start and end dates:
df_obs <- tibble(
observation = c(1:10),
date_start = as_date(c("2020-03-17", "2020-01-20", "2020-02-06", "2020-01-04", "2020-01-06", "2020-01-24", "2020-01-09", "2020-02-11", "2020-03-13", "2020-02-07")),
date_end = as_date(c("2020-03-27", "2020-03-20", NA, "2020-03-04", "2020-01-16", "2020-02-24", NA, "2020-02-19", NA, "2020-02-27"))
)
# to account for observations that are still active, NAs are converted to today's date:
df_obs <- mutate(df_obs, date_end = if_else(is.na(date_end), Sys.Date(), date_end))
# create a data frame of weeks by start and end date to count the active observations in a given week
# for this example I'm just using date ranges from the sample data:
df_weeks <-
seq(min(df_obs$date_start), max(df_obs$date_start), by = 'day') %>%
enframe(NULL, 'week_start') %>%
mutate(week_start = as_date(cut(week_start, "week"))) %>%
mutate(week_end = week_start + 6) %>%
distinct()
# create a function that filters the observations data frame based on start and end dates:
check_active <- function(d, s, e){
d %>%
filter(date_start <= e) %>%
filter(date_end >= s) %>%
nrow()
}
# applying that function to each week in the date range data frame gives the expected results:
df_weeks %>%
rowwise() %>%
mutate(total_active = check_active(df_obs, week_start, week_end)) %>%
select(-week_end) %>%
ungroup()
# A tibble: 12 x 2
week_start total_active
<date> <int>
1 2019-12-30 1
2 2020-01-06 3
3 2020-01-13 3
4 2020-01-20 4
5 2020-01-27 4
6 2020-02-03 6
7 2020-02-10 7
8 2020-02-17 7
9 2020-02-24 6
10 2020-03-02 4
11 2020-03-09 4
12 2020-03-16 5
Here is one way :
library(tidyverse)
df_obs %>%
#Replace NA with today's date
#Create sequence between start and end date
mutate(date_end = replace(date_end, is.na(date_end), Sys.Date()),
date = map2(date_start, date_end, seq, "day")) %>%
#Get data in long format
unnest(date) %>%
#Unselect start an end date
select(-date_start, -date_end) %>%
#Cut data by week
mutate(date = cut(date, "week")) %>%
#Get unique rows for observation and date
distinct(observation, date) %>%
#Count number of observation in each week
count(date)
which returns :
# A tibble: 14 x 2
# value n
# <fct> <int>
# 1 2019-12-30 1
# 2 2020-01-06 3
# 3 2020-01-13 3
# 4 2020-01-20 4
# 5 2020-01-27 4
# 6 2020-02-03 6
# 7 2020-02-10 7
# 8 2020-02-17 7
# 9 2020-02-24 6
#10 2020-03-02 4
#11 2020-03-09 4
#12 2020-03-16 5
#13 2020-03-23 4
#14 2020-03-30 3
Related
I am looking to calculate a 3 month rolling sum of values in one column of a data frame based upon the dates in another column and product.
newResults data frame columns : Product, Date, Value
In this example, I wish to calculate the rolling sum of value for Product for 3 months. I have sorted the data frame on Product and Date.
Dataset Example:
Sample Dataset
My Code:
newResults = newResults %>%
group_by(Product) %>%
mutate(Roll_12Mth =
rollapplyr(Value, width = 1:n() - findInterval( Date %m-% months(3), date), sum)) %>%
ungroup
Error: Problem with mutate() input Roll_12Mth.
x could not find function "%m-%"
i Input Roll_12Mth is rollapplyr(...).
Output:
Output
If the dates are always spaced 1 month apart, it is easy.
dat=data.frame(Date=seq(as.Date("2/1/2017", "%m/%d/%Y"), as.Date("1/1/2018", "%m/%d/%Y"), by="month"),
Product=rep(c("A", "B"), each=6),
Value=c(4182, 4822, 4805, 6235, 3665, 3326, 3486, 3379, 3596, 3954, 3745, 3956))
library(zoo)
library(dplyr)
dat %>%
group_by(Product) %>%
arrange(Date, .by_group=TRUE) %>%
mutate(Value=rollapplyr(Value, 3, sum, partial=TRUE))
Date Product Value
<date> <fct> <dbl>
1 2017-02-01 A 4182
2 2017-03-01 A 9004
3 2017-04-01 A 13809
4 2017-05-01 A 15862
5 2017-06-01 A 14705
6 2017-07-01 A 13226
7 2017-08-01 B 3486
8 2017-09-01 B 6865
9 2017-10-01 B 10461
10 2017-11-01 B 10929
11 2017-12-01 B 11295
12 2018-01-01 B 11655
I am new to R and I would like to ask how to transform the below data set into the two outcome tables which
have unique name as the row and list the trip 1, 2, 3, 4, 5 and so on of each person and have the avg trip n grand total at last column n row.
The second table I want to know the lag days between trips and avg. lag day of each person as the last column. Lag is the day between trips.
Dataset
name <- c('Mary', 'Sue', 'Peter', 'Mary', 'Mary', 'John', 'Sue', 'Peter',
'Peter', 'John', 'John', 'John', 'Mary', 'Mary')
date <- c('01/04/2018', '03/02/2017', '01/01/2019', '24/04/2017',
'02/03/2019', '31/05/2019', '08/09/2019', '17/12/2019',
'02/08/2017', '10/11/2017', '30/12/2017', '18/02/2018',
'18/02/2018', '18/10/2019')
data <- data.frame(name, date)
The desired results:
Result 1
Name Trip 1 Trip2 Total trips
Mary dd/mm/yyyy dd/mm/yyyy 2
John dd/mm/yyyy. N/A 1
Total Trip 2 1 3
Result 2
Name Lag1 Lag2 Avg.Lag
Mary 3 4 3.5
John 5 1 3
Result 1 can be achieved by arranging the data by date (first convert to date format) and doing a group_by() per person to calculate the rank and count of the trips. These can then by pivoted into columns using pivot_wider() from the tidyr package (the paste0() lines are to ensure readable column names).
For result 2 the difference in days needs to be calculated between trips using difftime(), which will give an NA for the first trip. The rest of the procedure is similar to result 1, but some columns have to be removed before the pivot.
library(dplyr)
library(tidyr)
name <- c('Mary','Sue','Peter','Mary','Mary','John','Sue','Peter','Peter','John',
'John','John','Mary','Mary')
date <- c('01/04/2018','03/02/2017','01/01/2019','24/04/2017',
'02/03/2019','31/05/2019','08/09/2019','17/12/2019',
'02/08/2017','10/11/2017','30/12/2017','18/02/2018',
'18/02/2018','18/10/2019')
data <- data.frame(name,date, stringsAsFactors = F)
data <- data %>%
mutate(date = as.Date(date, format = '%d/%m/%Y')) %>%
arrange(name, date) %>%
group_by(name) %>%
mutate(trip_nr = rank(date),
total_trips = n()) %>%
ungroup()
result1 <- data %>%
mutate(trip_nr = paste0('Trip_', trip_nr)) %>%
pivot_wider(names_from = trip_nr, values_from = date)
result2 <- data %>%
group_by(name) %>%
mutate(lag = difftime(date, lag(date), units = 'days'),
lag_avg = mean(lag, na.rm = T)) %>%
ungroup() %>%
filter(!is.na(lag)) %>%
mutate(lag_nr = paste0('Lag_', trip_nr-1)) %>%
select(-date,-trip_nr,-total_trips) %>%
pivot_wider(names_from = lag_nr, values_from = lag)
This gives the output for result1:
# A tibble: 4 x 7
name total_trips Trip_1 Trip_2 Trip_3 Trip_4 Trip_5
<chr> <int> <date> <date> <date> <date> <date>
1 John 4 2017-11-10 2017-12-30 2018-02-18 2019-05-31 NA
2 Mary 5 2017-04-24 2018-02-18 2018-04-01 2019-03-02 2019-10-18
3 Peter 3 2017-08-02 2019-01-01 2019-12-17 NA NA
4 Sue 2 2017-02-03 2019-09-08 NA NA NA
and result2:
# A tibble: 4 x 6
# Groups: name [4]
name lag_avg Lag_1 Lag_2 Lag_3 Lag_4
<chr> <drtn> <drtn> <drtn> <drtn> <drtn>
1 John 189.00 days 50 days 50 days 467 days NA days
2 Mary 226.75 days 300 days 42 days 335 days 230 days
3 Peter 433.50 days 517 days 350 days NA days NA days
4 Sue 947.00 days 947 days NA days NA days NA days
enter code here
data$date <- as.character(data$date)
data <- data[order(as.Date(data$date,"%d/%m/%Y")),]
data <- data.table(data)
data[,date := as.Date(date,"%d/%m/%Y")]
#trips
data[,Trips:=seq(.N),by="name"]
#time diff in "days" between trips
data[,Lag:=shift(date,1),by="name"]
data[,diff:=difftime(Lag,date,"days"),by="name"]
data[,diff:=abs(as.numeric(diff))]
#creating second summary table
data_summary_second_table <- data[,.(Avg_lag=mean(diff,na.rm = TRUE)),by="name"]
I have data set of hospital admission and discharge days from which I want to generate an occupied beds count for each calendar day of a period of three years. I am using the tidyverse and lubridate packages.
My approach so far has been to convert the admit/discharge columns into an interval (the data are sensitive so I can't share actual dates):
d <- d %>%
mutate(duration = admit %--% discharge)
and then to create a tibble where each row corresponds to the time range, plus a column of zeroes that can be added to in a for loop:
t <-
tibble(
days = as.Date(date("2017-01-01"):date("2019-12-31")),
count = 0
)
Unfortunately, I can't figure out how to create a for loop that would sum days that fall within each interval. Here is my attempt thus far, which gives me uniform values of 24 throughout:
for(i in timeline$days) {
if (i %within% d$duration)
timeline$count = timeline$count + 1
}
Sample data.
library(dplyr)
set.seed(42)
d <- tibble(admit = Sys.Date() - sample(300, size = 1000, replace = TRUE)) %>%
mutate(discharge = admit + sample(0:30, size = 1000, replace = TRUE))
d
# # A tibble: 1,000 x 2
# admit discharge
# <date> <date>
# 1 2019-06-18 2019-07-14
# 2 2019-06-11 2019-06-12
# 3 2019-12-24 2020-01-18
# 4 2019-07-13 2019-07-29
# 5 2019-09-08 2019-09-23
# 6 2019-10-15 2019-10-15
# 7 2019-08-11 2019-08-28
# 8 2020-02-07 2020-02-29
# 9 2019-09-03 2019-09-10
# 10 2019-08-20 2019-09-14
# # ... with 990 more rows
We can produce a list of date ranges/sequences with Map (or purrr::pmap):
Map(seq.Date, d$admit, d$discharge, list(by = "days"))[1:2]
# [[1]]
# [1] "2019-06-18" "2019-06-19" "2019-06-20" "2019-06-21" "2019-06-22" "2019-06-23" "2019-06-24"
# [8] "2019-06-25" "2019-06-26" "2019-06-27" "2019-06-28" "2019-06-29" "2019-06-30" "2019-07-01"
# [15] "2019-07-02" "2019-07-03" "2019-07-04" "2019-07-05" "2019-07-06" "2019-07-07" "2019-07-08"
# [22] "2019-07-09" "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-13" "2019-07-14"
# [[2]]
# [1] "2019-06-11" "2019-06-12"
and then combine these, tabulate them (with table), and enframe them:
Map(seq.Date, d$admit, d$discharge, list(by = "days")) %>%
do.call(c, .) %>%
table() %>%
tibble::enframe(name = "date", value = "count") %>%
# because `table` preserves a *character* representation of the Date
mutate(date = as.Date(date)) %>%
arrange(date)
# # A tibble: 328 x 2
# date count
# <date> <table>
# 1 2019-05-24 1
# 2 2019-05-25 3
# 3 2019-05-26 7
# 4 2019-05-27 8
# 5 2019-05-28 9
# 6 2019-05-29 14
# 7 2019-05-30 20
# 8 2019-05-31 20
# 9 2019-06-01 20
# 10 2019-06-02 21
# # ... with 318 more rows
Here is another method using tidyverse functions.
library(tidyverse)
d %>%
mutate(days = map2(admit, discharge, seq, by = "day")) %>%
unnest(days) %>%
count(days) %>%
right_join(t, by = "days") %>%
mutate(n = coalesce(n, as.integer(count))) %>%
select(-count)
We create a sequennce of dates between admit and discharge, count every unique date, join it with t so that all the dates in t remain intact.
I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))
Given a table
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017
I'm trying to split by Calendar month as the following table
id start end
1 22/03/2016 31/03/2016
1 01/04/2016 30/04/2016
1 01/05/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 30/09/2017
3 01/10/2017 31/10/2017
3 01/11/2017 30/11/2017
3 01/12/2017 25/12/2017
I'm trying to modify a code extract from how to split rows of a dataframe in multiple rows based on start date and end date? , but I am not being able to modify correctly the code. The problem is generally in the months with 30 days, and maybe is easy but I am not still familiarized with regular expressions.
#sample data
df <- data.frame("starting_date" = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
"end_date" = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
col3=c('1','2', '3'))
df1 <- df[,1:2] %>%
rowwise() %>%
do(rbind(data.frame(matrix(as.character(c(
.$starting_date,
seq(.$starting_date, .$end_date, by=1)[grep("\\d{4}-\\d{2}-31|\\d{4}-\\d{2}-01", seq(.$starting_date, .$end_date, by=1))],
.$end_date)), ncol=2, byrow=T))
)
) %>%
data.frame() %>%
`colnames<-`(c("starting_date", "end_date")) %>%
mutate(starting_date= as.Date(starting_date, format= "%Y-%m-%d"),
end_date= as.Date(end_date, format= "%Y-%m-%d"))
#add temporary columns to the original and expanded date column dataframes
df$row_idx <- seq(1:nrow(df))
df$temp_col <- (year(df$end_date) - year(df$starting_date)) +1
df1 <- cbind(df1,row_idx = rep(df$row_idx,df$temp_col))
#join both dataframes to get the final result
final_df <- left_join(df1,df[,3:(ncol(df)-1)],by="row_idx") %>%
select(-row_idx)
final_df
If anyone knows how to modify the code or a better way to do it I will be very grateful.
We assume there is an error in the sample output in the question since the third row spans parts of two months and so should be split into two rows.
Define Seq which given one start and end Date variables produces a data.frame of start and end columns and then run it on each id using group_by:
library(dplyr)
library(zoo)
Seq <- function(start, end) {
ym <- seq(as.yearmon(start), as.yearmon(end), 1/12)
starts <- pmax(start, as.Date(ym, frac = 0))
ends <- pmin(end, as.Date(ym, frac = 1))
unique(data.frame(start = starts, end = ends))
}
fmt <- "%d/%m/%Y"
DF %>%
mutate(start = as.Date(start, fmt), end = as.Date(end, fmt)) %>%
group_by(id) %>%
do(Seq(.$start, .$end)) %>%
ungroup
giving:
# A tibble: 9 x 3
id start end
<int> <date> <date>
1 1 2016-03-22 2016-03-31
2 1 2016-04-01 2016-04-30
3 1 2016-05-01 2016-05-31
4 1 2016-06-01 2016-06-05
5 2 2016-08-17 2016-08-29
6 3 2017-09-22 2017-09-30
7 3 2017-10-01 2017-10-31
8 3 2017-11-01 2017-11-30
9 3 2017-12-01 2017-12-25
Note
The input DF in reproducible form:
Lines <- "
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017"
DF <- read.table(text = Lines, header = TRUE)
So there's a probably a more elegant way to accomplish this and I feel like I've seen similar questions, but could not find a duplicate quickly, so here goes...
SETUP
library(tidyverse)
library(lubridate)
df <- data.frame(
id = c('1', '2', '3'),
starting_date = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
end_date = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
stringsAsFactors = FALSE
)
df
#> id starting_date end_date
#> 1 1 2016-03-22 2016-06-05
#> 2 2 2016-08-17 2016-08-29
#> 3 3 2017-09-12 2017-12-25
SOLUTION
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest() %>%
mutate(row = row_number()) %>%
mutate(
new_end_date = if_else(row == max(row), end_date, date_seq),
new_start_date = if_else(row == min(row), starting_date, floor_date(new_end_date, "month"))
) %>%
select(
id, new_start_date, new_end_date
)
#> # A tibble: 8 x 3
#> # Groups: id [3]
#> id new_start_date new_end_date
#> <chr> <date> <date>
#> 1 1 2016-03-22 2016-03-31
#> 2 1 2016-04-01 2016-04-30
#> 3 1 2016-06-01 2016-06-05
#> 4 2 2016-08-17 2016-08-29
#> 5 3 2017-09-12 2017-09-30
#> 6 3 2017-10-01 2017-10-31
#> 7 3 2017-11-01 2017-11-30
#> 8 3 2017-12-01 2017-12-25
EXPLANATION
Much of what's going on here takes place in the first mutate call which creates date_seq. To understand it, consider the following:
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month")
# [1] "2016-03-22" "2016-04-22" "2016-05-22"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month")
# [1] "2016-04-01" "2016-05-01" "2016-06-01"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month") - 1
# [1] "2016-03-31" "2016-04-30" "2016-05-31"
So basically, create a sequence of "end-of-month" dates between the original start and end dates. Putting this in a list-column allows us to organize by the id so that we unnest appropriately. Checkout the output after the end of the unnest():
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest()
From there I hope things are relatively straightforward. The row_number probably could have been replaced with something fancier like a first/last, but I thought this might be easier to follow.