Determine the number of process running each day and average days of commencing those projects, in R - r

I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.

Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6

Related

How to divide group depend on idx, diff in R?

There is my dataset. I want to make group numbers depending on idx, diff. Exactly, I want to make the same number until diff over 14 days. It means that if the same idx, under diff 14 days, it should be the same group. But if they have the same idx, over 14 days, it should be different group.
idx = c("a","a","a","a","b","b","b","c","c","c","c")
date = c(20201115, 20201116, 20201117, 20201105, 20201107, 20201110, 20210113, 20160930, 20160504, 20160913, 20160927)
group = c("1","1","1","1","2","2","3","4","5","6","6")
df = data.frame(idx,date,group)
df <- df %>% arrange(idx,date)
df$date <- as.Date(as.character(df$date), format='%Y%m%d')
df <- df %>% group_by(idx) %>%
mutate(diff = date - lag(date))
This is the result of what I want.
Use cumsum to create another group criteria, and then cur_group_id().
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = difftime(date, lag(date, default = first(date)), unit = "days"),
cu = cumsum(diff >= 14)) %>%
group_by(idx, cu) %>%
mutate(group = cur_group_id()) %>%
ungroup() %>%
select(-cu)
# A tibble: 11 × 4
idx date group diff
<chr> <date> <int> <drtn>
1 a 2020-11-05 1 0 days
2 a 2020-11-15 1 10 days
3 a 2020-11-16 1 1 days
4 a 2020-11-17 1 1 days
5 b 2020-11-07 2 0 days
6 b 2020-11-10 2 3 days
7 b 2021-01-13 3 64 days
8 c 2016-05-04 4 0 days
9 c 2016-09-13 5 132 days
10 c 2016-09-27 6 14 days
11 c 2016-09-30 6 3 days
Given that the first value of diff must be NA because of the use of lag(), you could use cumsum(diff >= 14 | is.na(diff) without grouping to create the new group:
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = date - lag(date)) %>%
ungroup() %>%
mutate(group = cumsum(diff >= 14 | is.na(diff)))
# # A tibble: 11 × 4
# idx date diff group
# <chr> <date> <drtn> <int>
# 1 a 2020-11-05 NA days 1
# 2 a 2020-11-15 10 days 1
# 3 a 2020-11-16 1 days 1
# 4 a 2020-11-17 1 days 1
# 5 b 2020-11-07 NA days 2
# 6 b 2020-11-10 3 days 2
# 7 b 2021-01-13 64 days 3
# 8 c 2016-05-04 NA days 4
# 9 c 2016-09-13 132 days 5
# 10 c 2016-09-27 14 days 6
# 11 c 2016-09-30 3 days 6

How to show missing dates in case of application of rolling function

Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2

Join with fuzzy matching by date in R

I have two data frames that I'd like to join them by the dates
df1 <-
data.frame(
day = seq(ymd("2020-01-01"), ymd("2020-01-14"), by = "1 day"),
key = rep(c("green", "blue"), 7),
value_x = sample(1:100, 14)
) %>%
as_tibble()
df2 <-
data.frame(
day = seq(ymd("2020-01-01"), ymd("2020-01-12"), by = "3 days"),
key = rep(c("green", "blue"), 2),
value_y = c(2, 4, 6, 8)
) %>%
as_tibble()
I want the output to be like this
# A tibble: 14 x 3
day key value_x value_y
<date> <fct> <int> <int>
1 2020-01-01 green 91 2
2 2020-01-02 blue 28 NA
3 2020-01-03 green 75 2
4 2020-01-04 blue 14 4
5 2020-01-05 green 3 2
6 2020-01-06 blue 27 4
7 2020-01-07 green 15 6
8 2020-01-08 blue 7 4
9 2020-01-09 green 1 6
10 2020-01-10 blue 10 8
11 2020-01-11 green 9 6
12 2020-01-12 blue 76 8
13 2020-01-13 green 31 6
14 2020-01-14 blue 62 8
I tried doing this code
merge(df1, df2, by = c("day", "key"), all.x = TRUE)
I'd like the day in the left table to join to the most recent day in the Y table that has a value. If there is no value, then it should be NA.
Edit --
Not all the dates in df2 will appear in df1 while they do have a common ID. This is an example-
df1
day id key
1 2020-01-08 A green
2 2020-01-10 A green
3 2020-02-24 A blue
4 2020-03-24 A green
df2
day id value
1 2020-01-03 A 2
2 2020-01-07 A 4
3 2020-01-22 A 4
4 2020-03-24 A 6
desired output
day id key value
1 2020-01-08 A green 4
2 2020-01-10 A green 4
3 2020-02-24 A blue 4
4 2020-03-24 A green 6
After merging, you can arrange the data based on key and day and fill with the most recent non-NA value.
library(dplyr)
merge(df1, df2, by = c('day', 'key'), all.x = TRUE) %>%
arrange(key, day) %>%
group_by(key) %>%
tidyr::fill(value_y) %>%
arrange(day)
# day key value_x value_y
#1 2020-01-01 green 40 2
#2 2020-01-02 blue 45 NA
#3 2020-01-03 green 54 2
#4 2020-01-04 blue 11 4
#5 2020-01-05 green 12 2
#6 2020-01-06 blue 7 4
#7 2020-01-07 green 72 6
#8 2020-01-08 blue 76 4
#9 2020-01-09 green 52 6
#10 2020-01-10 blue 32 8
#11 2020-01-11 green 69 6
#12 2020-01-12 blue 10 8
#13 2020-01-13 green 63 6
#14 2020-01-14 blue 84 8
For the updated data you can use the following :
df1 %>%
left_join(df2, by = 'id') %>%
mutate(diff = day.x - day.y) %>%
group_by(id, key, day.x) %>%
filter(diff == min(diff[diff >= 0])) %>%
arrange(day.x) %>%
select(day = day.x, id, key, value)
# day id key value
# <date> <chr> <chr> <int>
#1 2020-01-08 A green 4
#2 2020-01-10 A green 4
#3 2020-02-24 A blue 4
#4 2020-03-24 A green 6

cumsum NAs and other condition R

I've seen lots of questions like this but can't figure this simple problem out. I don't want to collapse the dataset. Say I have this dataset:
library(tidyverse)
library(lubridate)
df <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b"),
starts = c("2011-09-18", NA, "2014-08-08", "2016-09-18", NA, "2013-08-08", "2015-08-08", NA),
ends = c(NA, "2013-03-06", "2015-08-08", NA, "2017-03-06", "2014-08-08", NA, "2016-08-08"))
df$starts <- parse_date_time(df$starts, "ymd")
df$ends <- parse_date_time(df$ends, "ymd")
df
group starts ends
1 a 2011-09-18 <NA>
2 a <NA> 2013-03-06
3 a 2014-08-08 2015-08-08
4 a 2016-09-18 <NA>
5 a <NA> 2017-03-06
6 b 2013-08-08 2014-08-08
7 b 2015-08-08 <NA>
8 b <NA> 2016-08-08
Desired output is:
group starts ends epi
1 a 2011-09-18 <NA> 1
2 a <NA> 2013-03-06 1
3 a 2014-08-08 2015-08-08 2
4 a 2016-09-18 <NA> 3
5 a <NA> 2017-03-06 3
6 b 2013-08-08 2014-08-08 1
7 b 2015-08-08 <NA> 2
8 b <NA> 2016-08-08 2
I was thinking something like this but obviously doesn't account for episodes where there is no NA
df <- df %>%
group_by(group) %>%
mutate(epi = cumsum(is.na(ends)))
df
I'm not sure how to incorporate cumsum(is.na) with condition if_else. Maybe I'm going at it the wrong way?
Any suggestions would be great.
A solution using dplyr. Assuming your data frame is well structured that each start always has an associated end record.
df2 <- df %>%
group_by(group) %>%
mutate(epi = cumsum(!is.na(starts))) %>%
ungroup()
df2
# # A tibble: 8 x 4
# group starts ends epi
# <fct> <dttm> <dttm> <int>
# 1 a 2011-09-18 00:00:00 NA 1
# 2 a NA 2013-03-06 00:00:00 1
# 3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
# 4 a 2016-09-18 00:00:00 NA 3
# 5 a NA 2017-03-06 00:00:00 3
# 6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
# 7 b 2015-08-08 00:00:00 NA 2
# 8 b NA 2016-08-08 00:00:00 2
An option is to get the rowSums of NA elements for columns 'starts', 'ends', grouped by 'group', get the rleid from the 'epi'
library(dplyr)
library(data.table)
df %>%
mutate(epi = rowSums(is.na(.[c("starts", "ends")]))) %>%
group_by(group) %>%
mutate(epi = rleid(epi))
# A tibble: 8 x 4
# Groups: group [2]
# group starts ends epi
# <fct> <dttm> <dttm> <int>
#1 a 2011-09-18 00:00:00 NA 1
#2 a NA 2013-03-06 00:00:00 1
#3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
#4 a 2016-09-18 00:00:00 NA 3
#5 a NA 2017-03-06 00:00:00 3
#6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
#7 b 2015-08-08 00:00:00 NA 2
#8 b NA 2016-08-08 00:00:00 2
If there are only two columns to consider
df %>%
group_by(group) %>%
mutate(epi = rleid(is.na(starts) + is.na(ends)))

how to calculate recent n days unique rows

Say I want count recent 15 days unique id for everyday. Here is the code:
library(tidyverse)
library(lubridate)
set.seed(1)
eg <- tibble(day = sample(seq(ymd('2018-01-01'), length.out = 100, by = 'day'), 300, replace = T),
id = sample(letters[1:26], 300, replace = T),
value = rnorm(300))
eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
recent_15_days_unique_id = 'howto',
day_total = sum(value))
The result is
# A tibble: 95 x 4
day uniqu_id recent_15_days_unique_id day_total
<date> <int> <chr> <dbl>
1 2018-01-01 3 how -1.38
2 2018-01-02 3 how 2.01
3 2018-01-03 3 how 1.57
4 2018-01-04 6 how -1.64
5 2018-01-05 2 how -0.293
6 2018-01-06 4 how -2.08
For the 'recent_15_days_unique_id' column, first row is to count unique id between "day-15" to "day", which is '2017-12-17' and '2018-01-01', second row is between '2017-12-18' and '2018-01-02'.It is kind like 'rollsum' function but for counting.
We can ungroup and for every day, we can create a sequence of 15 days and count all the unique ids in that duration.
library(dplyr)
eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
day_total = sum(value)) %>%
ungroup() %>%
rowwise() %>%
mutate(recent_15_days_unique_id =
n_distinct(eg$id[eg$day %in% seq(day - 15, day, by = "1 day")]))
# day uniqu_id day_total recent_15_days_unique_id
# <date> <int> <dbl> <int>
#1 2018-01-02 2 0.170 2
#2 2018-01-03 2 -0.460 3
#3 2018-01-04 1 -1.53 3
#4 2018-01-05 2 1.67 5
#5 2018-01-06 2 1.52 6
#6 2018-01-07 4 -1.62 10
#7 2018-01-08 2 -0.0190 12
#8 2018-01-09 1 -0.573 12
#9 2018-01-10 2 -0.220 13
#10 2018-01-11 7 -1.73 14
Using the same logic we can also calculate it separately using sapply
new_eg <- eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
day_total = sum(value)) %>%
ungroup()
sapply(new_eg$day, function(x)
n_distinct(eg$id[as.numeric(eg$day) %in% seq(x-15, x, by = "1 day")]))
#[1] 2 3 3 5 6 10 12 12 13 14 15 16 17 17 18 20 21 22 22 20 20 21 21 .....

Resources