Use dplyr to aggregate counts by month from start-stop ranged variables? - r

Let's say I have school enrollment data stored in this format, with start date and end date fields:
unique_name
enrollment_start
enrollment_end
Amy
1, Jan, 2017
30, Sep 2018
Franklin
1, Jan, 2017
19, Feb, 2017
Franklin
5, Jun, 2017
4, Feb, 2018
Franklin
21, Oct, 2018
9, Mar, 2019
Samir
1, Jun, 2017
4, Feb, 2017
Samir
5, Apr, 2017
12, Sep, 2018
...
...
...
And I want to produce aggregated counts of enrollment by month like this:
month
enrollment_count
Jan, 2017
25
Feb, 2017
31
Mar, 2017
19
Apr, 2017
34
May, 2017
29
Jun, 2017
32
...
...
Is there an easy way to accomplish this with dplyr?
The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.

I think this can be done pretty elegantly with the clock and ivs packages. You seem to want monthly counts, so you can use the year-month type from clock. And ivs is a package dedicated to working with intervals of data, which is exactly what you have here. Here we assume that if your enrollment start/end fell in a month, then you should be considered active in that month.
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
Created on 2022-04-05 by the reprex package (v2.0.1)

Create a list column containing the sequence of months between each set of dates, then unnest and count.
Notes:
I use lubridate::floor_date() to round enrollment_start to the first day of the month. Otherwise, seq() may skip months if enrollment_start is on the 29th of the month or later.
The fifth row of your example data has enrollment_start later than enrollment_end -- I assumed this was an error and removed.
library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)

Here's my take on this with dplyr and tidyr.
Pivot the data creating multiple rows per student and format your dates.
group on student and generate missing months using complete.
group on the generated periods and count.
data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date from lubridate to do this.

Related

Show more digits, tsibble in R

I have read a number of ways to show more digits in R output (to R Studio display) but the methods I have reviewed do not work for a tsibble with a mixture of character and numeric columns. I am able to use the num() function to set the number of digits for an individual numeric column but I would like to show the full tsibble object with character columns intact.
In the example below, I would like the output to be the same as that shown except the x1 and x2 columns should have three digits after the decimal for all rows.
Thank you in advance for your help.
library(tidyverse)
library(tsibble)
data <- tibble(date = seq(as.Date("2022/1/1"), by = "month", length.out = 6),
region = c("A","C","A","B","C","C"),
x1 = c(7.3456, 123.4532, 106.1059, 17.1594, 175.3951, 62.9431),
x2 = c(12.12, 15.29, 27.92, 9.23, 16.29, 13.11))
data <- data %>%
mutate(month = yearmonth(date)) %>%
as_tsibble(index = month)
data
data
# A tsibble: 6 x 5 [1M]
date region x1 x2 month
<date> <chr> <dbl> <dbl> <mth>
1 2022-01-01 A 7.35 12.1 2022 Jan
2 2022-02-01 C 123. 15.3 2022 Feb
3 2022-03-01 A 106. 27.9 2022 Mar
4 2022-04-01 B 17.2 9.23 2022 Apr
5 2022-05-01 C 175. 16.3 2022 May
6 2022-06-01 C 62.9 13.1 2022 Jun
>
As Jon Spring mentioned in the comments mutate(across(where(is.numeric), ~num(.,digits = 3))) does work, the same as it does for tibbles, section Fixed number of digits.
Do note the the print under x1 and x2. It will show num:.3! instead of . But this is just a print from how tibbles are printed. The data in x1 and x2 is still a double.
In your code:
data %>%
mutate(month = yearmonth(date),
across(where(is.numeric), ~num(.,digits = 3))) %>%
as_tsibble(index = month)
# A tsibble: 6 x 5 [1M]
date region x1 x2 month
<date> <chr> <num:.3!> <num:.3!> <mth>
1 2022-01-01 A 7.346 12.120 2022 Jan
2 2022-02-01 C 123.453 15.290 2022 Feb
3 2022-03-01 A 106.106 27.920 2022 Mar
4 2022-04-01 B 17.159 9.230 2022 Apr
5 2022-05-01 C 175.395 16.290 2022 May
6 2022-06-01 C 62.943 13.110 2022 Jun

Time spent in each calendar year

I followed some individuals A and B from start to end
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
df
id start end
1 A 2015-01-01 2021-06-12
2 B 2013-01-01 2017-10-10
I would like to calculate the the follow up time for each calendar year. For example I have 1 year for 2013 (from B), 1 year for 2014 (from B), 2 years for 2015 (from A and B) and so on.
I tried to treat year as an integer and count how many years each individual contributes but due to rounding errors the result is not plausible.
I tried
years<-NULL
for (i in 1:length(df$id)){
years<-c(years, as.character(as.Date(seq.Date(from = df$start[i], to = df$end[i], by = "day"))))
}
library(lubridate)
table(year(years))/365
2013 2014 2015 2016 2017 2018 2019 2020 2021
1.0000000 1.0000000 2.0000000 2.0054795 1.7753425 1.0000000 1.0000000 1.0027397 0.4465753
which is the answer I am trying to get but is computationally inefficient and very slow in large data. I am wondering is there any way to do this without the loop? Or do it more efficiently?
I'm now guessing what you actually don't want to round or truncate anything, so here's a solution that works and gives output similar to your method (correcting the 2016 value):
func <- function(st, ed) {
stopifnot(length(st) == 1, length(ed) == 1)
stL <- as.POSIXlt(st)
edL <- as.POSIXlt(ed)
start_year <- 1900 + stL$year
end_year <- 1900 + edL$year
start_eoy <- as.POSIXlt(paste0(start_year, "-12-31"))
end_eoy <- as.POSIXlt(paste0(end_year, "-12-31"))
firstyear <- (start_eoy$yday - stL$yday) / start_eoy$yday
lastyear <- edL$yday / end_eoy$yday
data.frame(
year = seq(start_year, end_year),
n = c(firstyear, rep(1, max(0, end_year - start_year - 1)), lastyear)
)
}
base R
aggregate(n ~ year, data = do.call(rbind, Map(func, df$start, df$end)), FUN = sum)
# year n
# 1 2013 1.0000000
# 2 2014 1.0000000
# 3 2015 2.0000000
# 4 2016 2.0000000
# 5 2017 1.7747253
# 6 2018 1.0000000
# 7 2019 1.0000000
# 8 2020 1.0000000
# 9 2021 0.4450549
dplyr
library(dplyr)
df %>%
with(Map(func, start, end)) %>%
bind_rows() %>%
group_by(year) %>%
summarize(n = sum(n))
# # A tibble: 9 x 2
# year n
# <int> <dbl>
# 1 2013 1
# 2 2014 1
# 3 2015 2
# 4 2016 2
# 5 2017 1.77
# 6 2018 1
# 7 2019 1
# 8 2020 1
# 9 2021 0.445
Sounds like a job for a great package called lubridate. See example:
By the way, I assumed dates are year-month-day, therefore ymd. If not, you can use ydm (year-day-month) for American date format.
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(tidyverse)
df %>%
mutate(across(start:end, ymd),
follow_up_years = interval(start, end)/years(1),
follow_up_months = interval(start, end)/months(1),
follow_up_days = interval(start, end)/days(1),
)
#> id start end follow_up_years follow_up_months follow_up_days
#> 1 A 2015-01-01 2021-06-12 6.443836 77.36667 2354
#> 2 B 2013-01-01 2017-10-10 4.772603 57.29032 1743
Created on 2021-10-28 by the reprex package (v2.0.1)
Edit
I think I understand. I guess we can also just use lubridate intervals:
df %>%
mutate(follow_up_2015 = interval(start, as_date("2015-01-01"))/years(1)) %>%
pull(follow_up_2015) %>%
sum()
#> [1] 2
Created on 2021-10-28 by the reprex package (v2.0.1)

Parsing a string in R and comparing the values with another column

This may seem trivial but I am really stuck at this problem of comparing a value with this complex string
My data frame looks like this:
Id
History
Report Month
1001
Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX
July 2021
1003
Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX
July 2021
1005
Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX
July 2021
1006
Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX
July 2021
Key, value pair from the column history that will be used in comparison are as following:
Id : 1001 - Jun 2020,030 May 2020, 035 Apr 2020, 040......
Id : 1003 - Jun 2017,823 May 2017, 000 Apr 2017, 000......
Problem statement is: I want to compare these key, value pair with the report month (i.e. always current month) and make a conditional column based on it. Logic is: 24 months (could be 12 or 36) preceding July 2021 i.e July 2021-Jun 2019, how many key,value pairs have value >= 30 or >= 60 etc for months that lie within this time period. So if a string starts from <Jun 2019, like for 1003, the answer should be 0.
Output
Id
Report Month
+30_last_24
+30_last_36
1001
July 2021
4
4
1003
July 2021
0
0
1005
July 2021
0
1
1006
July 2021
3
3
I started with R very recently and have no solution to even begin with, so any help would be deeply appreciated.
MODIFIED ORIGINAL DATASET
df <- read.table(header = T, text = "Id History ReportMonth
1001 Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003 Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005 Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006 Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")
Revised Strategy in view of modifications-
separate rows using | but only after escaping it with\\
separate into cols using ,
extracts digits from values using gsub
rest is pretty obvious.
Feel free to ask clarifications, if any.
df <- read.table(header = T, text = "Id History ReportMonth
1001 Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003 Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005 Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006 Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")
library(tidyverse)
library(lubridate, warn.conflicts = F)
df %>%
separate_rows(History, sep = '\\|') %>%
separate(History, into = c('Hist_mon', 'Hist_val'), sep = ',') %>%
mutate(Hist_mon = dmy(paste0('1:', Hist_mon)),
Hist_val = as.numeric(gsub('(\\D*)', '', Hist_val)),
ReportMonth = dmy(paste0('1 ', ReportMonth))) %>%
group_by(Id, ReportMonth) %>%
summarise(last_24_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(24)),
last_36_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(36)), .groups = 'drop')
#> # A tibble: 4 x 4
#> Id ReportMonth last_24_30 last_36_30
#> <int> <date> <int> <int>
#> 1 1001 2021-07-01 4 4
#> 2 1003 2021-07-01 0 0
#> 3 1005 2021-07-01 0 1
#> 4 1006 2021-07-01 3 3
Created on 2021-07-16 by the reprex package (v2.0.0)
library(tidyverse)
library(lubridate)
df %>%
separate_rows(History, sep = '[|]')%>%
filter(str_detect(History, "\\w"), str_detect(History, "\\d+/"))%>%
separate(History, c("Date", "Value", "d"), sep = '[,/]', convert = TRUE) %>%
mutate(across(c(Date,ReportMonth), ~myd(paste(.x, "01")))) %>%
group_by(Id) %>%
summarise(r = list(map(c(m24 = 24, m36 = 36), ~sum(
Date + months(.x) > ReportMonth & Value >= 30)))) %>%
unnest_wider(r) %>%
right_join(df, 'Id')
# A tibble: 4 x 5
Id m24 m36 History_Report Month
<int> <int> <int> <chr> <chr>
1 1001 4 4 Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX July 2021
2 1003 0 0 Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX July 2021
3 1005 0 1 Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX July 2021
4 1006 3 3 Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX July 2021

How to sum all unique factors in column based on group_by function in R and output as new column?

I have a dataframe that consists of 4 columns where year goes from 2016-2018 and the Lost_Reason values have a total of 15 unique "reasons" that are tallied each year:
Year1 LOST_REASON TotalLost
<chr> <fct> <int>
1 2016 "" 0
2 2016 "Change in Business Strategy" 31
3 2016 "Data Issue" 12
4 2016 "Lack of Adoption" 21
5 2016 "Lack of Value" 14
6 2016 "Lost to Competition" 20
How can I reformat this dataframe which was generated by this simple code:
df_test1 <- complete_df %>%
mutate(full_year = format(as.Date(CLOSEDATE, format = "%m/%d/%Y"), "%Y-%m-%d")) %>%
group_by(Year1, LOST_REASON) %>%
summarise(TotalWon = sum(STAGENAME == 'Closed Won'), TotalLost = sum(STAGENAME == 'CS: Non-Renewal'))
to match an output like so where the "Lost_Reason" factors are summed per year with a "total" column generated:
Reason 2016 2017 2018 Total
1 Change in Business Strategy 31 39 45 151
2 Data Issue 12 20 11 51
3 Lack of Adoption 21 25 26 89
4 Lack of Value 14 23 20 90
5 Lost to Competition 20 13 13 66
6 No Budget 14 27 41 103
It would be a pivot_wider option after creating a row index based on the 'Year' column
library(dplyr)
library(tidyr)
library(data.table)
df_test1 %>%
mutate(rn = rowid(Year1)) %>%
pivot_wider(names_from = Year1, values_from = TotalLost) %>%
mutate(Total = `2016` + `2017` + `2018`)
The workflow I would use is group_by and summarize to create the sum column, then pivot_wider to spread across the years and, finally, left_join to put the two together.
Note I create the sums while the data is still in "tidy" format. You could sum across the rows after you pivot the data but that would be more complicated (for me, anyway).
library(dplyr)
library(tidyr)
df_1 <- tribble(
~Year1,~LOST_REASON,~TotalLost,
2016, "" ,0,
2016, "Change in Business Strategy" ,31,
2016, "Data Issue" ,12,
2016, "Lack of Adoption" ,21,
2016, "Lack of Value" ,14,
2016, "Lost to Competition" ,20,
2017, "" ,0,
2017, "Change in Business Strategy" ,31,
2018, "Data Issue" ,12,
2019, "Lack of Adoption" ,21,
2020, "Lack of Value" ,14,
2020, "Lost to Competition" ,20
)
Sums <- df_1 %>% group_by(LOST_REASON) %>%
summarise(Sum=sum(TotalLost,na.rm = TRUE))
Sums
#> # A tibble: 6 x 2
#> LOST_REASON Sum
#> <chr> <dbl>
#> 1 "" 0
#> 2 "Change in Business Strategy" 62
#> 3 "Data Issue" 24
#> 4 "Lack of Adoption" 42
#> 5 "Lack of Value" 28
#> 6 "Lost to Competition" 40
df_2 <- df_1 %>% pivot_wider(id_cols="LOST_REASON",
names_from = "Year1",
values_from = "TotalLost") %>%
left_join(Sums)
#> Joining, by = "LOST_REASON"
df_2
#> # A tibble: 6 x 7
#> LOST_REASON `2016` `2017` `2018` `2019` `2020` Sum
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "" 0 0 NA NA NA 0
#> 2 "Change in Business Strategy" 31 31 NA NA NA 62
#> 3 "Data Issue" 12 NA 12 NA NA 24
#> 4 "Lack of Adoption" 21 NA NA 21 NA 42
#> 5 "Lack of Value" 14 NA NA NA 14 28
#> 6 "Lost to Competition" 20 NA NA NA 20 40
Created on 2020-04-23 by the reprex package (v0.3.0)

Dropping the rows by checking whether it has multiple values in R

I have a data frame in this form;
Year Department Jan Feb ................... Dec
2017 TF 15.15 225.51 .............. 5562.1
2015 CIF ...................................
2013 TTR ....................................
2011 COR ....................
. .............................
. ......................
As a summary, I want to create an algorithm but first I have to make this filtering:
If a department does not have a value for 2013, 2014, 2015, 2016 years, than I want to exclude that department from my data set.
In other words, by reading the each departments data, filtering the data by departments that has all four years values in the months columns.
I tried exists, is.na but the multiple filtering always fails. And another handicap is that filter works for only single condition, but here I need like 4 condition. 4 years values must be exist to use them in next step.
Thank you.
I can't find a clear duplicate to this question. Seems like a quick fix with group_by:
library(dplyr)
df <- data_frame(Year = c(2013:2016, 2015, 2016),
Department = c(rep('TF', 4), 'CIF', 'TTR'))
df
#> # A tibble: 6 x 2
#> Year Department
#> <dbl> <chr>
#> 1 2013 TF
#> 2 2014 TF
#> 3 2015 TF
#> 4 2016 TF
#> 5 2015 CIF
#> 6 2016 TTR
df %>%
group_by(Department) %>%
mutate(x = Year %in% c(2013:2016),
y = sum(x)) %>%
ungroup() %>%
filter(y == 4)
#> # A tibble: 4 x 4
#> Year Department x y
#> <dbl> <chr> <lgl> <int>
#> 1 2013 TF TRUE 4
#> 2 2014 TF TRUE 4
#> 3 2015 TF TRUE 4
#> 4 2016 TF TRUE 4
A solution using R base:
df = read.table(text = "Year, Department
2016,TF
2017,TF
2013,CIF
2014,CIF
2015,CIF
2016,CIF
2013,TTR", header = TRUE, sep = ",", stringsAsFactors = FALSE)
df[df$Department %in% subset(aggregate(subset(df, Year %in% c(2013,2014,2015,2016)), by=list(n$Department), FUN=length), Department==4)[,1], ]
Output:
Year Department
3 2013 CIF
4 2014 CIF
5 2015 CIF
6 2016 CIF

Resources