Identify the number of active days in an isoweek given a date range - r

My data are as follows:
df <- read_table("begin.date end.date
2019-07-22 2019-07-29
2019-07-29 2019-08-03
2019-08-25 2019-08-30
2019-08-30 2019-09-24
2019-09-30 2019-10-05")
I would like to assign two new columns:
isoweek_id = every isoweek in the year (so there will be one row for every week in the year)
data_days = the number of days data collection occurred within that isoweek given the begin.date and end.date, which represent date ranges when data collection occurred.
We might, therefore, have weeks when the number of days data collection occurred is 0 if, for example, a temporal gap in data collection spanned more than one isoweek. (note: my real data have several years worth of data collection).
My desired output would look something like this:
begin.date end.date isoweek_id data_days
NA NA 29 0
2019-07-22 2019-07-29 30 7
2019-07-29 2019-08-03 31 6
NA NA 32 0
NA NA 33 0
2019-08-25 2019-08-30 34 1
2019-08-25 2019-08-30 35 5
2019-08-30 2019-09-24 36 7
2019-08-30 2019-09-24 37 7
2019-08-30 2019-09-24 38 7
2019-08-30 2019-09-24 39 2
2019-09-30 2019-10-05 40 6
NA NA 41 0
NA NA 42 0
NA NA 43 0
You can look at which isoweeks span which dates as follows:
library(ISOweek)
w <- paste("2019-W35", 1:7, sep = "-")
data.frame(weekdate = w, date = ISOweek2date(w))
Thank you in advance!

I hope this does the job:
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
dplyr::arrange(begin.date) %>%
# unnest day sequence from start to end into df https://stackoverflow.com/questions/50997084/create-dataframe-of-rows-of-sequence-of-years-from-rows-with-start-end-dates
dplyr::group_by(rn = dplyr::row_number()) %>%
dplyr::mutate(dates = list(seq.Date(from = begin.date, to = end.date, by = "days"))) %>%
tidyr::unnest(dates) %>%
dplyr::ungroup() %>%
# right join list of all dates with iso week and year
dplyr::right_join(dplyr::tibble(dates = seq.Date(from = min(df$begin.date), max(df$end.date), by = "days")) %>%
dplyr::mutate(year = lubridate::year(dates),
iso_week = lubridate::isoweek(dates)),
by = "dates") %>%
# fill up the rn in case it is zero with a number that is larger all rns
dplyr::mutate(rn = ifelse(is.na(rn), nrow(df) + 1, rn)) %>%
# summarize data
dplyr::group_by(year, iso_week, rn) %>%
dplyr::summarize(bdate = min(begin.date, na.rm = TRUE),
edate = min(end.date, na.rm = TRUE),
days = sum(ifelse(is.na(begin.date), 0, 1))) %>%
dplyr::ungroup() %>%
# get lowest sequential numbering per week since we can have duplicates like the example shows
dplyr::group_by(year, iso_week) %>%
dplyr::slice_min(order_by = rn, n = 1) %>%
dplyr::ungroup() # you might want to remove and or rename comluns
# A tibble: 11 x 6
year iso_week rn bdate edate days
<dbl> <dbl> <int> <date> <date> <dbl>
1 2019 30 1 2019-07-22 2019-07-29 7
2 2019 31 1 2019-07-22 2019-07-29 1
3 2019 32 6 NA NA 0
4 2019 33 6 NA NA 0
5 2019 34 3 2019-08-25 2019-08-30 1
6 2019 35 3 2019-08-25 2019-08-30 5
7 2019 36 4 2019-08-30 2019-09-24 7
8 2019 37 4 2019-08-30 2019-09-24 7
9 2019 38 4 2019-08-30 2019-09-24 7
10 2019 39 4 2019-08-30 2019-09-24 2
11 2019 40 5 2019-09-30 2019-10-05 6

Related

Assigning Values in R by Date Range

I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1

Determine the number of process running each day and average days of commencing those projects, in R

I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.
Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6

How to calculate Quarter Over Quarter %change when the dataset is monthly

I have this df which observations are monthly represented:
library(dplyr)
library(lubridate)
Date <- seq(from = as_date("2019-11-01"), to = as_date("2020-10-01"), by = "month")
A <- (10:21)
df <- data.frame(Date, A)
view(df)
Date A
<date> <int>
1 2019-11-01 10
2 2019-12-01 11
3 2020-01-01 12
4 2020-02-01 13
5 2020-03-01 14
6 2020-04-01 15
7 2020-05-01 16
8 2020-06-01 17
9 2020-07-01 18
10 2020-08-01 19
11 2020-09-01 20
12 2020-10-01 21
Using lag() I know how to calculate %change from Month over Month (MoM), but haven't been able to compare a quarter with the previous quarter: i.e, the sum of 3 months compared with the previous 3 months summed. I tried a loop approach but it didn't work and there should be a more efficient approach.
I appreciate it if someone can help.
We can use as.yearqtr from zoo to convert the 'Date' column to quarter, do a group by sum and then get the Difference between the current and next (lead) or current and previous (lag)
library(dplyr)
library(zoo)
df %>%
group_by(Quarter = as.yearqtr(Date)) %>%
summarise(A = sum(A), .groups = 'drop') %>%
mutate(Diff = lead(A) - A)
-output
# A tibble: 5 x 3
# Quarter A Diff
# <yearqtr> <int> <int>
#1 2019 Q4 21 18
#2 2020 Q1 39 9
#3 2020 Q2 48 9
#4 2020 Q3 57 -36
#5 2020 Q4 21 NA

How to show missing dates in case of application of rolling function

Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2

R inserting rows between dates by group based on second column

I have a df that looks like this
ID FINAL_DT START_DT
23 NA 2020-03-20
25 NA 2020-04-10
29 2020-02-02 2020-01-23
30 NA 2020-01-02
What I would like to do is for each ID add a row for every month starting from START_DT and ending at whatever comes first FINAL_DT or the current date. Expected output would be the follow:
ID FINAL_DT START_DT ACTIVE_MONTH
23 NA 2020-03-20 2020-03
23 NA NA 2020-04
23 NA NA 2020-05
25 NA 2020-04-10 2020-04
25 NA NA 2020-05
29 2020-02-02 2020-01-23 2020-01
29 2020-02-02 NA 2020-02
30 NA 2020-01-02 2020-01
30 NA NA 2020-02
30 NA NA 2020-03
30 NA NA 2020-04
30 NA NA 2020-05
I have the following code which works but does not account for FINAL_DT
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, current_date, by = "month"))
I have tried the following but get an error I believe due to the NA's:
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, min(FINAL_DT,current_date), by = "month"))
The day of the month also does not matter I am not sure if it would be easier to drop that before or after.
Here is another approach. You can use floor_date to get the first day of the month to use in your sequence of months. Then, you can include the full sequence to today's date, and filter based on FINAL_DT. You can use as.yearmon from zoo if you'd like a month/year object for month.
library(zoo)
library(tidyr)
library(dplyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(ACTIVE_MONTH = floor_date(START_DT, unit = "month")) %>%
group_by(ID) %>%
complete(ACTIVE_MONTH = seq.Date(floor_date(START_DT, unit = "month"), current_date, by = "month")) %>%
filter(ACTIVE_MONTH <= first(FINAL_DT) | is.na(first(FINAL_DT))) %>%
ungroup() %>%
mutate(ACTIVE_MONTH = as.yearmon(ACTIVE_MONTH))
Output
# A tibble: 12 x 4
ID ACTIVE_MONTH FINAL_DT START_DT
<dbl> <yearmon> <date> <date>
1 23 Mar 2020 NA 2020-03-20
2 23 Apr 2020 NA NA
3 23 May 2020 NA NA
4 25 Apr 2020 NA 2020-04-10
5 25 May 2020 NA NA
6 29 Jan 2020 2020-02-02 2020-01-23
7 29 Feb 2020 NA NA
8 30 Jan 2020 NA 2020-01-02
9 30 Feb 2020 NA NA
10 30 Mar 2020 NA NA
11 30 Apr 2020 NA NA
12 30 May 2020 NA NA
Here is an approach that returns rows for each MONTH with the help of lubridate.
library(dplyr)
library(tidyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(MONTH = month(START_DT)) %>%
group_by(ID) %>%
complete(MONTH = seq(MONTH, min(month(FINAL_DT)[!is.na(FINAL_DT)],month(current_date))))
# A tibble: 12 x 4
# Groups: ID [4]
# ID MONTH FINAL_DT START_DT
# <int> <dbl> <fct> <fct>
# 1 23 3 NA 2020-03-20
# 2 23 4 NA NA
# 3 23 5 NA NA
# 4 25 4 NA 2020-04-10
# 5 25 5 NA NA
# 6 29 1 2020-02-02 2020-01-23
# 7 29 2 NA NA
# 8 30 1 NA 2020-01-02
# 9 30 2 NA NA
#10 30 3 NA NA
#11 30 4 NA NA
#12 30 5 NA NA

Resources