Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2
Related
There is my dataset. I want to make group numbers depending on idx, diff. Exactly, I want to make the same number until diff over 14 days. It means that if the same idx, under diff 14 days, it should be the same group. But if they have the same idx, over 14 days, it should be different group.
idx = c("a","a","a","a","b","b","b","c","c","c","c")
date = c(20201115, 20201116, 20201117, 20201105, 20201107, 20201110, 20210113, 20160930, 20160504, 20160913, 20160927)
group = c("1","1","1","1","2","2","3","4","5","6","6")
df = data.frame(idx,date,group)
df <- df %>% arrange(idx,date)
df$date <- as.Date(as.character(df$date), format='%Y%m%d')
df <- df %>% group_by(idx) %>%
mutate(diff = date - lag(date))
This is the result of what I want.
Use cumsum to create another group criteria, and then cur_group_id().
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = difftime(date, lag(date, default = first(date)), unit = "days"),
cu = cumsum(diff >= 14)) %>%
group_by(idx, cu) %>%
mutate(group = cur_group_id()) %>%
ungroup() %>%
select(-cu)
# A tibble: 11 × 4
idx date group diff
<chr> <date> <int> <drtn>
1 a 2020-11-05 1 0 days
2 a 2020-11-15 1 10 days
3 a 2020-11-16 1 1 days
4 a 2020-11-17 1 1 days
5 b 2020-11-07 2 0 days
6 b 2020-11-10 2 3 days
7 b 2021-01-13 3 64 days
8 c 2016-05-04 4 0 days
9 c 2016-09-13 5 132 days
10 c 2016-09-27 6 14 days
11 c 2016-09-30 6 3 days
Given that the first value of diff must be NA because of the use of lag(), you could use cumsum(diff >= 14 | is.na(diff) without grouping to create the new group:
library(dplyr)
df %>%
group_by(idx) %>%
mutate(diff = date - lag(date)) %>%
ungroup() %>%
mutate(group = cumsum(diff >= 14 | is.na(diff)))
# # A tibble: 11 × 4
# idx date diff group
# <chr> <date> <drtn> <int>
# 1 a 2020-11-05 NA days 1
# 2 a 2020-11-15 10 days 1
# 3 a 2020-11-16 1 days 1
# 4 a 2020-11-17 1 days 1
# 5 b 2020-11-07 NA days 2
# 6 b 2020-11-10 3 days 2
# 7 b 2021-01-13 64 days 3
# 8 c 2016-05-04 NA days 4
# 9 c 2016-09-13 132 days 5
# 10 c 2016-09-27 14 days 6
# 11 c 2016-09-30 3 days 6
My data are as follows:
df <- read_table("begin.date end.date
2019-07-22 2019-07-29
2019-07-29 2019-08-03
2019-08-25 2019-08-30
2019-08-30 2019-09-24
2019-09-30 2019-10-05")
I would like to assign two new columns:
isoweek_id = every isoweek in the year (so there will be one row for every week in the year)
data_days = the number of days data collection occurred within that isoweek given the begin.date and end.date, which represent date ranges when data collection occurred.
We might, therefore, have weeks when the number of days data collection occurred is 0 if, for example, a temporal gap in data collection spanned more than one isoweek. (note: my real data have several years worth of data collection).
My desired output would look something like this:
begin.date end.date isoweek_id data_days
NA NA 29 0
2019-07-22 2019-07-29 30 7
2019-07-29 2019-08-03 31 6
NA NA 32 0
NA NA 33 0
2019-08-25 2019-08-30 34 1
2019-08-25 2019-08-30 35 5
2019-08-30 2019-09-24 36 7
2019-08-30 2019-09-24 37 7
2019-08-30 2019-09-24 38 7
2019-08-30 2019-09-24 39 2
2019-09-30 2019-10-05 40 6
NA NA 41 0
NA NA 42 0
NA NA 43 0
You can look at which isoweeks span which dates as follows:
library(ISOweek)
w <- paste("2019-W35", 1:7, sep = "-")
data.frame(weekdate = w, date = ISOweek2date(w))
Thank you in advance!
I hope this does the job:
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
dplyr::arrange(begin.date) %>%
# unnest day sequence from start to end into df https://stackoverflow.com/questions/50997084/create-dataframe-of-rows-of-sequence-of-years-from-rows-with-start-end-dates
dplyr::group_by(rn = dplyr::row_number()) %>%
dplyr::mutate(dates = list(seq.Date(from = begin.date, to = end.date, by = "days"))) %>%
tidyr::unnest(dates) %>%
dplyr::ungroup() %>%
# right join list of all dates with iso week and year
dplyr::right_join(dplyr::tibble(dates = seq.Date(from = min(df$begin.date), max(df$end.date), by = "days")) %>%
dplyr::mutate(year = lubridate::year(dates),
iso_week = lubridate::isoweek(dates)),
by = "dates") %>%
# fill up the rn in case it is zero with a number that is larger all rns
dplyr::mutate(rn = ifelse(is.na(rn), nrow(df) + 1, rn)) %>%
# summarize data
dplyr::group_by(year, iso_week, rn) %>%
dplyr::summarize(bdate = min(begin.date, na.rm = TRUE),
edate = min(end.date, na.rm = TRUE),
days = sum(ifelse(is.na(begin.date), 0, 1))) %>%
dplyr::ungroup() %>%
# get lowest sequential numbering per week since we can have duplicates like the example shows
dplyr::group_by(year, iso_week) %>%
dplyr::slice_min(order_by = rn, n = 1) %>%
dplyr::ungroup() # you might want to remove and or rename comluns
# A tibble: 11 x 6
year iso_week rn bdate edate days
<dbl> <dbl> <int> <date> <date> <dbl>
1 2019 30 1 2019-07-22 2019-07-29 7
2 2019 31 1 2019-07-22 2019-07-29 1
3 2019 32 6 NA NA 0
4 2019 33 6 NA NA 0
5 2019 34 3 2019-08-25 2019-08-30 1
6 2019 35 3 2019-08-25 2019-08-30 5
7 2019 36 4 2019-08-30 2019-09-24 7
8 2019 37 4 2019-08-30 2019-09-24 7
9 2019 38 4 2019-08-30 2019-09-24 7
10 2019 39 4 2019-08-30 2019-09-24 2
11 2019 40 5 2019-09-30 2019-10-05 6
I want to complete a df in R when in it miss a month date for example if I have one year of information by months and days like this one.
df = data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-02","2020-04-01","2020-09-01","2020-10-01",
"2020-11-01","2020-12-01"))
When I use the function complete, I use it like this
df = df%>%
mutate(Date = as.Date(Date)) %>%
complete(Date= seq.Date("2020-01-01", "2020-12-31", by = "month"))
And the problem is that my final df complete all the dates like may, june, july and that is ok but also complete march because march doesn't have the first day and begings in 2020-03-02.
df = data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-01","2020-03-02","2020-04-01","2020-05-01",
"2020-06-01","2020-07-01","2020-08-01","2020-09-01",
"2020-10-01","2020-11-01","2020-12-01"))
Do you know how to complete df only if the df doesn't have any date of a month?
In my case I don't want to complete march because march has a date already.
Thanks a lot.
You can extract year and month value from the Date and use complete on that.
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
mutate(Date = as.Date(Date),
year = year(Date),
month = month(Date)) %>%
complete(year, month = 1:12) %>%
mutate(Date = if_else(is.na(Date),
as.Date(paste(year, month, 1, sep = '-')), Date)) %>%
select(Date)
# Date
# <date>
# 1 2020-01-01
# 2 2020-02-01
# 3 2020-03-02
# 4 2020-04-01
# 5 2020-05-01
# 6 2020-06-01
# 7 2020-07-01
# 8 2020-08-01
# 9 2020-09-01
#10 2020-10-01
#11 2020-11-01
#12 2020-12-01
A possible solution would be completing only by yearmon from the zoo package, so that it the actual day of the month is irrelevant.
library(dplyr)
library(zoo) # for as.yearmon
library(tidyr) # for complete
df <- data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-02","2020-04-01",
"2020-09-01","2020-10-01",
"2020-11-01","2020-12-01"),
id = 1:8)
df
#> Date id
#> 1 2020-01-01 1
#> 2 2020-02-01 2
#> 3 2020-03-02 3
#> 4 2020-04-01 4
#> 5 2020-09-01 5
#> 6 2020-10-01 6
#> 7 2020-11-01 7
#> 8 2020-12-01 8
df %>%
mutate(Date = as.Date(Date),
year_mon = as.yearmon(Date)) %>%
complete(
year_mon = seq.Date(as.Date("2020-01-01"),
as.Date("2020-12-31"),
by = "month") %>% as.yearmon()
)
#> # A tibble: 12 x 3
#> year_mon Date id
#> <yearmon> <date> <int>
#> 1 Jan 2020 2020-01-01 1
#> 2 Feb 2020 2020-02-01 2
#> 3 Mar 2020 2020-03-02 3
#> 4 Apr 2020 2020-04-01 4
#> 5 May 2020 NA NA
#> 6 Jun 2020 NA NA
#> 7 Jul 2020 NA NA
#> 8 Aug 2020 NA NA
#> 9 Sep 2020 2020-09-01 5
#> 10 Oct 2020 2020-10-01 6
#> 11 Nov 2020 2020-11-01 7
#> 12 Dec 2020 2020-12-01 8
Created on 2021-06-25 by the reprex package (v2.0.0)
I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.
Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6
I have this df which observations are monthly represented:
library(dplyr)
library(lubridate)
Date <- seq(from = as_date("2019-11-01"), to = as_date("2020-10-01"), by = "month")
A <- (10:21)
df <- data.frame(Date, A)
view(df)
Date A
<date> <int>
1 2019-11-01 10
2 2019-12-01 11
3 2020-01-01 12
4 2020-02-01 13
5 2020-03-01 14
6 2020-04-01 15
7 2020-05-01 16
8 2020-06-01 17
9 2020-07-01 18
10 2020-08-01 19
11 2020-09-01 20
12 2020-10-01 21
Using lag() I know how to calculate %change from Month over Month (MoM), but haven't been able to compare a quarter with the previous quarter: i.e, the sum of 3 months compared with the previous 3 months summed. I tried a loop approach but it didn't work and there should be a more efficient approach.
I appreciate it if someone can help.
We can use as.yearqtr from zoo to convert the 'Date' column to quarter, do a group by sum and then get the Difference between the current and next (lead) or current and previous (lag)
library(dplyr)
library(zoo)
df %>%
group_by(Quarter = as.yearqtr(Date)) %>%
summarise(A = sum(A), .groups = 'drop') %>%
mutate(Diff = lead(A) - A)
-output
# A tibble: 5 x 3
# Quarter A Diff
# <yearqtr> <int> <int>
#1 2019 Q4 21 18
#2 2020 Q1 39 9
#3 2020 Q2 48 9
#4 2020 Q3 57 -36
#5 2020 Q4 21 NA