How to divide group depend on idx, diff in R? - r

There is my dataset. I want to make group numbers depending on idx, diff. Exactly, I want to make the same number until diff over 14 days. It means that if the same idx, under diff 14 days, it should be the same group. But if they have the same idx, over 14 days, it should be different group.
idx = c("a","a","a","a","b","b","b","c","c","c","c")
date = c(20201115, 20201116, 20201117, 20201105, 20201107, 20201110, 20210113, 20160930, 20160504, 20160913, 20160927)
group = c("1","1","1","1","2","2","3","4","5","6","6")
df = data.frame(idx,date,group)
df <- df %>% arrange(idx,date)
df$date <- as.Date(as.character(df$date), format='%Y%m%d')
df <- df %>% group_by(idx) %>%
mutate(diff = date - lag(date))
This is the result of what I want.

Use cumsum to create another group criteria, and then cur_group_id().
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = difftime(date, lag(date, default = first(date)), unit = "days"),
cu = cumsum(diff >= 14)) %>%
group_by(idx, cu) %>%
mutate(group = cur_group_id()) %>%
ungroup() %>%
select(-cu)
# A tibble: 11 × 4
idx date group diff
<chr> <date> <int> <drtn>
1 a 2020-11-05 1 0 days
2 a 2020-11-15 1 10 days
3 a 2020-11-16 1 1 days
4 a 2020-11-17 1 1 days
5 b 2020-11-07 2 0 days
6 b 2020-11-10 2 3 days
7 b 2021-01-13 3 64 days
8 c 2016-05-04 4 0 days
9 c 2016-09-13 5 132 days
10 c 2016-09-27 6 14 days
11 c 2016-09-30 6 3 days

Given that the first value of diff must be NA because of the use of lag(), you could use cumsum(diff >= 14 | is.na(diff) without grouping to create the new group:
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = date - lag(date)) %>%
ungroup() %>%
mutate(group = cumsum(diff >= 14 | is.na(diff)))
# # A tibble: 11 × 4
# idx date diff group
# <chr> <date> <drtn> <int>
# 1 a 2020-11-05 NA days 1
# 2 a 2020-11-15 10 days 1
# 3 a 2020-11-16 1 days 1
# 4 a 2020-11-17 1 days 1
# 5 b 2020-11-07 NA days 2
# 6 b 2020-11-10 3 days 2
# 7 b 2021-01-13 64 days 3
# 8 c 2016-05-04 NA days 4
# 9 c 2016-09-13 132 days 5
# 10 c 2016-09-27 14 days 6
# 11 c 2016-09-30 3 days 6

Related

In R, how can I group by one column and conditionally sum another?

This is an add on to my previous question:
How can I count a number of conditional rows within r dplyr mutate?
Let's say I have the data frame below. In my previous question I asked how I could calculate at each row how many subsequent times that row's customer ordered Product X (literally X, not the Product associated with the row), which is now given in nSubsqX. Now, I want to know the sum of the cost associated with those subsequent orders of X. I have manually entered the answer into nCostSubsqX below, but I don't understand how to do it programmatically.
Date Customer Product cost nSubsqX nCostSubsqX
1 2020-05-18 A X 9 0 0
2 2020-02-10 B X 2 5 42
3 2020-02-12 B Y 3 5 42
4 2020-03-04 B Z 4 5 42
5 2020-03-29 B X 5 4 37
6 2020-04-08 B X 6 3 31
7 2020-04-30 B X 7 2 24
8 2020-05-13 B X 8 1 5
9 2020-05-23 B Y 10 1 5
10 2020-07-02 B Y 11 1 5
11 2020-08-26 B Y 12 1 5
12 2020-12-06 B X 16 0 0
13 2020-01-31 C X 1 3 42
14 2020-09-19 C X 13 2 60
15 2020-10-13 C X 14 1 15
16 2020-11-11 C X 15 0 0
17 2020-12-26 C Y 17 0 0
For the purpose of providing a Reprex, below is the code to create the data frame.
df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12",
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18",
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11",
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))
df$cost = seq(nrow(df))
Below is the code that gets me nSubsqX:
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"))
Now I need to understand how to make the array the rows where Product is X, but from the cost column rather than from the Product column itself. Any thoughts?
Attempt 1, gives an error.
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = sum(cost[which(Product == "X")]) - cumsum(cost[which(Product == "X")]))
...
Error in `mutate_cols()`:
Problem with `mutate()` column `nCostSubsqX`.
`nCostSubsqX = sum(cost[which(Product == "X")]) - ...`.
`nCostSubsqX` must be size 11 or 1, not 6.
The error occurred in group 2: Customer = "B".
Attempt 2, where the math isn't right. The nCostSubsqX column needs to have the cum cost to this point removed.
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(cost[which(Product == "X")]))))))
...
Date Customer Product cost nSubsqX nCostSubsqX
1 2020-05-18 A X 9 0 9
2 2020-02-10 B X 2 5 44
3 2020-02-12 B Y 3 5 44
4 2020-03-04 B Z 4 5 44
5 2020-03-29 B X 5 4 43
6 2020-04-08 B X 6 3 42
7 2020-04-30 B X 7 2 41
8 2020-05-13 B X 8 1 40
9 2020-05-23 B Y 10 1 40
10 2020-07-02 B Y 11 1 40
11 2020-08-26 B Y 12 1 40
12 2020-12-06 B X 16 0 39
13 2020-01-31 C X 1 3 43
14 2020-09-19 C X 13 2 42
15 2020-10-13 C X 14 1 41
16 2020-11-11 C X 15 0 40
17 2020-12-26 C Y 17 0 40
Attempt 3, I don't know what the math here is doing, but it ain't right!
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(cost[which(Product == "X")])))))-
zoo::na.locf0(ifelse(Product == "X",cumsum(cost[which(Product == "X")]),NA)))
Attempt 1 was almost there. It's important that the number of rows is maintained. Replace cost[which(Product == "X")] with cost*(Product=="X") (a dirty trick).
Btw. the which is unnecessary.
The snippet would be:
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = sum(cost[Product == "X"]) - cumsum(cost*(Product == "X")))
Here is a slightly different approach, if you are interested.
library(data.table)
f <- function(p,co=rep(1,length(p))) {
sapply(seq_along(p), \(i) sum(co[-i:0][p[-i:0]=="X"]))
}
setDT(df)[
order(Date,Customer),
`:=`(nSubsqX = f(Product),nCostSubsqx=f(Product, cost)),
by=Customer
]
In this approach, I actually use the same function f() for both nSubsqX, and for nCostSubsqx; the only difference is whether cost is additionally passed to f() as the co parameter, or the default co parameter is used.
Output:
Date Customer Product cost nSubsqX nCostSubsqx
<Date> <char> <char> <int> <num> <int>
1: 2020-01-31 C X 1 3 42
2: 2020-02-10 B X 2 5 42
3: 2020-02-12 B Y 3 5 42
4: 2020-03-04 B Z 4 5 42
5: 2020-03-29 B X 5 4 37
6: 2020-04-08 B X 6 3 31
7: 2020-04-30 B X 7 2 24
8: 2020-05-13 B X 8 1 16
9: 2020-05-18 A X 9 0 0
10: 2020-05-23 B Y 10 1 16
11: 2020-07-02 B Y 11 1 16
12: 2020-08-26 B Y 12 1 16
13: 2020-09-19 C X 13 2 29
14: 2020-10-13 C X 14 1 15
15: 2020-11-11 C X 15 0 0
16: 2020-12-06 B X 16 0 0
17: 2020-12-26 C Y 17 0 0

Determine the number of process running each day and average days of commencing those projects, in R

I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.
Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6

Interpolating Mid-Year Averages

I have yearly observations of income for a series of geographies, like this:
library(dplyr)
library(lubridate)
date <- c("2004-01-01", "2005-01-01", "2006-01-01",
"2004-01-01", "2005-01-01", "2006-01-01")
geo <- c(1, 1, 1, 2, 2, 2)
inc <- c(10, 12, 14, 32, 34, 50)
data <- tibble(date = ymd(date), geo, inc)
date geo inc
<date> <dbl> <dbl>
1 2004-01-01 1 10
2 2005-01-01 1 12
3 2006-01-01 1 14
4 2004-01-01 2 32
5 2005-01-01 2 34
6 2006-01-01 2 50
I need to insert mid-year values, as averages of the start-of-year and end-of-year observations, so that the data is every 6 months. The outcome would like this:
2004-01-01 1 10
2004-06-01 1 11
2005-01-01 1 12
2004-06-01 1 13
2006-01-01 1 14
2004-01-01 2 32
2004-06-01 2 33
2005-01-01 2 34
2004-06-01 2 42
2006-01-01 2 50
Would appreciate any ideas.
Grouped by 'geoo', add (+) the 'inc' with the next value (lead) and get the average (/2), as well as add 5 months to the 'date', then filter out the NA elements in 'inc', bind the rows with the original data
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
summarise(date = date %m+% months(5),
inc = (inc + lead(inc))/2, .groups = 'drop') %>%
filter(!is.na(inc)) %>%
bind_rows(data, .) %>%
arrange(geo, date)
-output
# A tibble: 10 x 3
# date geo inc
# <date> <dbl> <dbl>
# 1 2004-01-01 1 10
# 2 2004-06-01 1 11
# 3 2005-01-01 1 12
# 4 2005-06-01 1 13
# 5 2006-01-01 1 14
# 6 2004-01-01 2 32
# 7 2004-06-01 2 33
# 8 2005-01-01 2 34
# 9 2005-06-01 2 42
#10 2006-01-01 2 50
You can use complete to create a sequence of dates for 6 months and then use na.approx to fill the NA values with interpolated values.
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
tidyr::complete(date = seq(min(date), max(date), by = '6 months')) %>%
mutate(date = if_else(is.na(inc), date %m-% months(1), date),
inc = zoo::na.approx(inc))
# geo date inc
# <dbl> <date> <dbl>
# 1 1 2004-01-01 10
# 2 1 2004-06-01 11
# 3 1 2005-01-01 12
# 4 1 2005-06-01 13
# 5 1 2006-01-01 14
# 6 2 2004-01-01 32
# 7 2 2004-06-01 33
# 8 2 2005-01-01 34
# 9 2 2005-06-01 42
#10 2 2006-01-01 50

How to show missing dates in case of application of rolling function

Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2

how to calculate recent n days unique rows

Say I want count recent 15 days unique id for everyday. Here is the code:
library(tidyverse)
library(lubridate)
set.seed(1)
eg <- tibble(day = sample(seq(ymd('2018-01-01'), length.out = 100, by = 'day'), 300, replace = T),
id = sample(letters[1:26], 300, replace = T),
value = rnorm(300))
eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
recent_15_days_unique_id = 'howto',
day_total = sum(value))
The result is
# A tibble: 95 x 4
day uniqu_id recent_15_days_unique_id day_total
<date> <int> <chr> <dbl>
1 2018-01-01 3 how -1.38
2 2018-01-02 3 how 2.01
3 2018-01-03 3 how 1.57
4 2018-01-04 6 how -1.64
5 2018-01-05 2 how -0.293
6 2018-01-06 4 how -2.08
For the 'recent_15_days_unique_id' column, first row is to count unique id between "day-15" to "day", which is '2017-12-17' and '2018-01-01', second row is between '2017-12-18' and '2018-01-02'.It is kind like 'rollsum' function but for counting.
We can ungroup and for every day, we can create a sequence of 15 days and count all the unique ids in that duration.
library(dplyr)
eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
day_total = sum(value)) %>%
ungroup() %>%
rowwise() %>%
mutate(recent_15_days_unique_id =
n_distinct(eg$id[eg$day %in% seq(day - 15, day, by = "1 day")]))
# day uniqu_id day_total recent_15_days_unique_id
# <date> <int> <dbl> <int>
#1 2018-01-02 2 0.170 2
#2 2018-01-03 2 -0.460 3
#3 2018-01-04 1 -1.53 3
#4 2018-01-05 2 1.67 5
#5 2018-01-06 2 1.52 6
#6 2018-01-07 4 -1.62 10
#7 2018-01-08 2 -0.0190 12
#8 2018-01-09 1 -0.573 12
#9 2018-01-10 2 -0.220 13
#10 2018-01-11 7 -1.73 14
Using the same logic we can also calculate it separately using sapply
new_eg <- eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
day_total = sum(value)) %>%
ungroup()
sapply(new_eg$day, function(x)
n_distinct(eg$id[as.numeric(eg$day) %in% seq(x-15, x, by = "1 day")]))
#[1] 2 3 3 5 6 10 12 12 13 14 15 16 17 17 18 20 21 22 22 20 20 21 21 .....

Resources