Parsing a string in R and comparing the values with another column - r

This may seem trivial but I am really stuck at this problem of comparing a value with this complex string
My data frame looks like this:
Id
History
Report Month
1001
Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX
July 2021
1003
Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX
July 2021
1005
Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX
July 2021
1006
Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX
July 2021
Key, value pair from the column history that will be used in comparison are as following:
Id : 1001 - Jun 2020,030 May 2020, 035 Apr 2020, 040......
Id : 1003 - Jun 2017,823 May 2017, 000 Apr 2017, 000......
Problem statement is: I want to compare these key, value pair with the report month (i.e. always current month) and make a conditional column based on it. Logic is: 24 months (could be 12 or 36) preceding July 2021 i.e July 2021-Jun 2019, how many key,value pairs have value >= 30 or >= 60 etc for months that lie within this time period. So if a string starts from <Jun 2019, like for 1003, the answer should be 0.
Output
Id
Report Month
+30_last_24
+30_last_36
1001
July 2021
4
4
1003
July 2021
0
0
1005
July 2021
0
1
1006
July 2021
3
3
I started with R very recently and have no solution to even begin with, so any help would be deeply appreciated.
MODIFIED ORIGINAL DATASET
df <- read.table(header = T, text = "Id History ReportMonth
1001 Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003 Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005 Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006 Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")

Revised Strategy in view of modifications-
separate rows using | but only after escaping it with\\
separate into cols using ,
extracts digits from values using gsub
rest is pretty obvious.
Feel free to ask clarifications, if any.
df <- read.table(header = T, text = "Id History ReportMonth
1001 Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003 Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005 Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006 Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")
library(tidyverse)
library(lubridate, warn.conflicts = F)
df %>%
separate_rows(History, sep = '\\|') %>%
separate(History, into = c('Hist_mon', 'Hist_val'), sep = ',') %>%
mutate(Hist_mon = dmy(paste0('1:', Hist_mon)),
Hist_val = as.numeric(gsub('(\\D*)', '', Hist_val)),
ReportMonth = dmy(paste0('1 ', ReportMonth))) %>%
group_by(Id, ReportMonth) %>%
summarise(last_24_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(24)),
last_36_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(36)), .groups = 'drop')
#> # A tibble: 4 x 4
#> Id ReportMonth last_24_30 last_36_30
#> <int> <date> <int> <int>
#> 1 1001 2021-07-01 4 4
#> 2 1003 2021-07-01 0 0
#> 3 1005 2021-07-01 0 1
#> 4 1006 2021-07-01 3 3
Created on 2021-07-16 by the reprex package (v2.0.0)

library(tidyverse)
library(lubridate)
df %>%
separate_rows(History, sep = '[|]')%>%
filter(str_detect(History, "\\w"), str_detect(History, "\\d+/"))%>%
separate(History, c("Date", "Value", "d"), sep = '[,/]', convert = TRUE) %>%
mutate(across(c(Date,ReportMonth), ~myd(paste(.x, "01")))) %>%
group_by(Id) %>%
summarise(r = list(map(c(m24 = 24, m36 = 36), ~sum(
Date + months(.x) > ReportMonth & Value >= 30)))) %>%
unnest_wider(r) %>%
right_join(df, 'Id')
# A tibble: 4 x 5
Id m24 m36 History_Report Month
<int> <int> <int> <chr> <chr>
1 1001 4 4 Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX July 2021
2 1003 0 0 Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX July 2021
3 1005 0 1 Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX July 2021
4 1006 3 3 Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX July 2021

Related

Removing dates ( in any format) form a text column

Hope everyone is well.
In my dataset there is column including free texts. My goal is to remove all dates in any format form the text.
this is a snapshot of the data
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
df %>% select(text)
text
1 tommorow is 2022 11 03
2 I married on 2020-01-01
3 why not going there on 2023/01/14
4 2023 08 01 will be great
The outcome should look like
text
1 tommorow is
2 I married on
3 why not going there on
4 will be great
Thank you!
Best approach would perhaps be to have a sensitive regex pattern:
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
library(tidyverse)
df |>
mutate(left_text = str_trim(str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}")))
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
This will match dates by:
\\d{1,4} = starting with either month (1-2 numeric characters), day (1-2 characters) or year (2-4 characters); followed by
\\D = anything that's not a number, i.e. the separator; followed by
\\d{1,2} = day or month (1-2 chars); followed by
\\D again; ending with
\\d{1,4} = day or year (1-2 or 2-4 chars)
The challenge is balancing sensitivity with specificity. This should not take out numbers which are clearly not dates, but might miss out:
dates with no year
dates with no separators
dates with double spaces between parts
But hopefully should catch every sensible date in your text column!
Further date detection examples:
library(tidyverse)
df <- data.frame(
text = c(
'tommorow is 2022 11 03',
"I married on 2020-01-01",
'why not going there on 2023/01/14',
'2023 08 01 will be great',
'A trickier example: January 05,2020',
'or try Oct 2010',
'dec 21/22 is another date'
)
)
df |>
mutate(left_text = str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}") |>
str_remove(regex(paste0("(", paste(month.name, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_remove(regex(paste0("(", paste(month.abb, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_trim())
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
#> 5 A trickier example: January 05,2020 A trickier example:
#> 6 or try Oct 2010 or try
#> 7 dec 21/22 is another date is another date
Final Edit - doing replace with temporary placeholders
The following code should work on a wide range of date formats. It works by replacing in a specific order so as not to accidentally chop out bits of some dates. Gluing together pre-made regex patterns to hopefully give a clearer idea as to what each bit is doing:
library(tidyverse)
df <- data.frame(
text = c(
'tommorow is 2022 11 03',
"I married on 2020-01-01",
'why not going there on 2023/01/14',
'2023 08 01 will be great',
'A trickier example: January 05,2020',
'or try Oct 26th 2010',
'dec 21/22 is another date',
'today is 2023-01-29 & tomorrow is 2022 11 03 & 2022-12-01',
'A trickier example: January 05,2020',
'2020-01-01 I married on 2020-12-01',
'Adding in 1st December 2018',
'And perhaps Jul 4th 2023'
)
)
r_year <- "\\d{2,4}"
r_day <- "\\d{1,2}(\\w{1,2})?" # With or without "st" etc.
r_month_num <- "\\d{1,2}"
r_month_ab <- paste0("(", paste(month.abb, collapse = "|"), ")")
r_month_full <- paste0("(", paste(month.name, collapse = "|"), ")")
r_sep <- "[^\\w]+" # The separators can be anything but letters
library(glue)
df |>
mutate(
text =
# Any numeric day/month/year
str_replace_all(text,
glue("{r_day}{r_sep}{r_month_num}{r_sep}{r_year}"),
"REP_DATE") |>
# Any numeric month/day/year
str_replace_all(glue("{r_month_num}{r_sep}{r_day}{r_sep}{r_year}"),
"REP_DATE") |>
# Any numeric year/month/day
str_replace_all(glue("{r_year}{r_sep}{r_month_num}{r_sep}{r_day}"),
"REP_DATE") |>
# Any day[th]/monthname/year or monthname/day[th]/year
str_replace_all(regex(paste0(
glue("({r_day}{r_sep})?({r_month_full}|{r_month_ab})",
"{r_sep}({r_day}{r_sep})?{r_year}")
), ignore_case = TRUE),
"REP_DATE") |>
# And transform all placeholders to required date
str_replace_all("REP_DATE", "25th October 2022")
)
#> text
#> 1 tommorow is 25th October 2022
#> 2 I married on 25th October 2022
#> 3 why not going there on 25th October 2022
#> 4 25th October 2022 will be great
#> 5 A trickier example: 25th October 2022
#> 6 or try 25th October 2022
#> 7 25th October 2022 is another date
#> 8 today is 25th October 2022 & tomorrow is 25th October 2022 & 25th October 2022
#> 9 A trickier example: 25th October 2022
#> 10 25th October 2022 I married on 25th October 2022
#> 11 Adding in 25th October 2022
#> 12 And perhaps 25th October 2022
This should catch all the most common ways of writing dates, even with added "st"s "nd"s and "th"s after day number and irrespective of ordering of parts (apart from any format which puts "year" in the middle between "day" and "month", but that seems unlikely).

Show more digits, tsibble in R

I have read a number of ways to show more digits in R output (to R Studio display) but the methods I have reviewed do not work for a tsibble with a mixture of character and numeric columns. I am able to use the num() function to set the number of digits for an individual numeric column but I would like to show the full tsibble object with character columns intact.
In the example below, I would like the output to be the same as that shown except the x1 and x2 columns should have three digits after the decimal for all rows.
Thank you in advance for your help.
library(tidyverse)
library(tsibble)
data <- tibble(date = seq(as.Date("2022/1/1"), by = "month", length.out = 6),
region = c("A","C","A","B","C","C"),
x1 = c(7.3456, 123.4532, 106.1059, 17.1594, 175.3951, 62.9431),
x2 = c(12.12, 15.29, 27.92, 9.23, 16.29, 13.11))
data <- data %>%
mutate(month = yearmonth(date)) %>%
as_tsibble(index = month)
data
data
# A tsibble: 6 x 5 [1M]
date region x1 x2 month
<date> <chr> <dbl> <dbl> <mth>
1 2022-01-01 A 7.35 12.1 2022 Jan
2 2022-02-01 C 123. 15.3 2022 Feb
3 2022-03-01 A 106. 27.9 2022 Mar
4 2022-04-01 B 17.2 9.23 2022 Apr
5 2022-05-01 C 175. 16.3 2022 May
6 2022-06-01 C 62.9 13.1 2022 Jun
>
As Jon Spring mentioned in the comments mutate(across(where(is.numeric), ~num(.,digits = 3))) does work, the same as it does for tibbles, section Fixed number of digits.
Do note the the print under x1 and x2. It will show num:.3! instead of . But this is just a print from how tibbles are printed. The data in x1 and x2 is still a double.
In your code:
data %>%
mutate(month = yearmonth(date),
across(where(is.numeric), ~num(.,digits = 3))) %>%
as_tsibble(index = month)
# A tsibble: 6 x 5 [1M]
date region x1 x2 month
<date> <chr> <num:.3!> <num:.3!> <mth>
1 2022-01-01 A 7.346 12.120 2022 Jan
2 2022-02-01 C 123.453 15.290 2022 Feb
3 2022-03-01 A 106.106 27.920 2022 Mar
4 2022-04-01 B 17.159 9.230 2022 Apr
5 2022-05-01 C 175.395 16.290 2022 May
6 2022-06-01 C 62.943 13.110 2022 Jun

Use dplyr to aggregate counts by month from start-stop ranged variables?

Let's say I have school enrollment data stored in this format, with start date and end date fields:
unique_name
enrollment_start
enrollment_end
Amy
1, Jan, 2017
30, Sep 2018
Franklin
1, Jan, 2017
19, Feb, 2017
Franklin
5, Jun, 2017
4, Feb, 2018
Franklin
21, Oct, 2018
9, Mar, 2019
Samir
1, Jun, 2017
4, Feb, 2017
Samir
5, Apr, 2017
12, Sep, 2018
...
...
...
And I want to produce aggregated counts of enrollment by month like this:
month
enrollment_count
Jan, 2017
25
Feb, 2017
31
Mar, 2017
19
Apr, 2017
34
May, 2017
29
Jun, 2017
32
...
...
Is there an easy way to accomplish this with dplyr?
The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.
I think this can be done pretty elegantly with the clock and ivs packages. You seem to want monthly counts, so you can use the year-month type from clock. And ivs is a package dedicated to working with intervals of data, which is exactly what you have here. Here we assume that if your enrollment start/end fell in a month, then you should be considered active in that month.
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
Created on 2022-04-05 by the reprex package (v2.0.1)
Create a list column containing the sequence of months between each set of dates, then unnest and count.
Notes:
I use lubridate::floor_date() to round enrollment_start to the first day of the month. Otherwise, seq() may skip months if enrollment_start is on the 29th of the month or later.
The fifth row of your example data has enrollment_start later than enrollment_end -- I assumed this was an error and removed.
library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
Here's my take on this with dplyr and tidyr.
Pivot the data creating multiple rows per student and format your dates.
group on student and generate missing months using complete.
group on the generated periods and count.
data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date from lubridate to do this.

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

Use dplyr/tidyr to turn rows into columns in R data frame

I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000

Resources