Hope everyone is well.
In my dataset there is column including free texts. My goal is to remove all dates in any format form the text.
this is a snapshot of the data
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
df %>% select(text)
text
1 tommorow is 2022 11 03
2 I married on 2020-01-01
3 why not going there on 2023/01/14
4 2023 08 01 will be great
The outcome should look like
text
1 tommorow is
2 I married on
3 why not going there on
4 will be great
Thank you!
Best approach would perhaps be to have a sensitive regex pattern:
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
library(tidyverse)
df |>
mutate(left_text = str_trim(str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}")))
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
This will match dates by:
\\d{1,4} = starting with either month (1-2 numeric characters), day (1-2 characters) or year (2-4 characters); followed by
\\D = anything that's not a number, i.e. the separator; followed by
\\d{1,2} = day or month (1-2 chars); followed by
\\D again; ending with
\\d{1,4} = day or year (1-2 or 2-4 chars)
The challenge is balancing sensitivity with specificity. This should not take out numbers which are clearly not dates, but might miss out:
dates with no year
dates with no separators
dates with double spaces between parts
But hopefully should catch every sensible date in your text column!
Further date detection examples:
library(tidyverse)
df <- data.frame(
text = c(
'tommorow is 2022 11 03',
"I married on 2020-01-01",
'why not going there on 2023/01/14',
'2023 08 01 will be great',
'A trickier example: January 05,2020',
'or try Oct 2010',
'dec 21/22 is another date'
)
)
df |>
mutate(left_text = str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}") |>
str_remove(regex(paste0("(", paste(month.name, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_remove(regex(paste0("(", paste(month.abb, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_trim())
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
#> 5 A trickier example: January 05,2020 A trickier example:
#> 6 or try Oct 2010 or try
#> 7 dec 21/22 is another date is another date
Final Edit - doing replace with temporary placeholders
The following code should work on a wide range of date formats. It works by replacing in a specific order so as not to accidentally chop out bits of some dates. Gluing together pre-made regex patterns to hopefully give a clearer idea as to what each bit is doing:
library(tidyverse)
df <- data.frame(
text = c(
'tommorow is 2022 11 03',
"I married on 2020-01-01",
'why not going there on 2023/01/14',
'2023 08 01 will be great',
'A trickier example: January 05,2020',
'or try Oct 26th 2010',
'dec 21/22 is another date',
'today is 2023-01-29 & tomorrow is 2022 11 03 & 2022-12-01',
'A trickier example: January 05,2020',
'2020-01-01 I married on 2020-12-01',
'Adding in 1st December 2018',
'And perhaps Jul 4th 2023'
)
)
r_year <- "\\d{2,4}"
r_day <- "\\d{1,2}(\\w{1,2})?" # With or without "st" etc.
r_month_num <- "\\d{1,2}"
r_month_ab <- paste0("(", paste(month.abb, collapse = "|"), ")")
r_month_full <- paste0("(", paste(month.name, collapse = "|"), ")")
r_sep <- "[^\\w]+" # The separators can be anything but letters
library(glue)
df |>
mutate(
text =
# Any numeric day/month/year
str_replace_all(text,
glue("{r_day}{r_sep}{r_month_num}{r_sep}{r_year}"),
"REP_DATE") |>
# Any numeric month/day/year
str_replace_all(glue("{r_month_num}{r_sep}{r_day}{r_sep}{r_year}"),
"REP_DATE") |>
# Any numeric year/month/day
str_replace_all(glue("{r_year}{r_sep}{r_month_num}{r_sep}{r_day}"),
"REP_DATE") |>
# Any day[th]/monthname/year or monthname/day[th]/year
str_replace_all(regex(paste0(
glue("({r_day}{r_sep})?({r_month_full}|{r_month_ab})",
"{r_sep}({r_day}{r_sep})?{r_year}")
), ignore_case = TRUE),
"REP_DATE") |>
# And transform all placeholders to required date
str_replace_all("REP_DATE", "25th October 2022")
)
#> text
#> 1 tommorow is 25th October 2022
#> 2 I married on 25th October 2022
#> 3 why not going there on 25th October 2022
#> 4 25th October 2022 will be great
#> 5 A trickier example: 25th October 2022
#> 6 or try 25th October 2022
#> 7 25th October 2022 is another date
#> 8 today is 25th October 2022 & tomorrow is 25th October 2022 & 25th October 2022
#> 9 A trickier example: 25th October 2022
#> 10 25th October 2022 I married on 25th October 2022
#> 11 Adding in 25th October 2022
#> 12 And perhaps 25th October 2022
This should catch all the most common ways of writing dates, even with added "st"s "nd"s and "th"s after day number and irrespective of ordering of parts (apart from any format which puts "year" in the middle between "day" and "month", but that seems unlikely).
Related
This may seem trivial but I am really stuck at this problem of comparing a value with this complex string
My data frame looks like this:
Id
History
Report Month
1001
Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX
July 2021
1003
Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX
July 2021
1005
Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX
July 2021
1006
Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX
July 2021
Key, value pair from the column history that will be used in comparison are as following:
Id : 1001 - Jun 2020,030 May 2020, 035 Apr 2020, 040......
Id : 1003 - Jun 2017,823 May 2017, 000 Apr 2017, 000......
Problem statement is: I want to compare these key, value pair with the report month (i.e. always current month) and make a conditional column based on it. Logic is: 24 months (could be 12 or 36) preceding July 2021 i.e July 2021-Jun 2019, how many key,value pairs have value >= 30 or >= 60 etc for months that lie within this time period. So if a string starts from <Jun 2019, like for 1003, the answer should be 0.
Output
Id
Report Month
+30_last_24
+30_last_36
1001
July 2021
4
4
1003
July 2021
0
0
1005
July 2021
0
1
1006
July 2021
3
3
I started with R very recently and have no solution to even begin with, so any help would be deeply appreciated.
MODIFIED ORIGINAL DATASET
df <- read.table(header = T, text = "Id History ReportMonth
1001 Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003 Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005 Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006 Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")
Revised Strategy in view of modifications-
separate rows using | but only after escaping it with\\
separate into cols using ,
extracts digits from values using gsub
rest is pretty obvious.
Feel free to ask clarifications, if any.
df <- read.table(header = T, text = "Id History ReportMonth
1001 Jun:2020,030/XXX|May:2020,035/XXX|Apr:2020,040/XXX|Mar:2020,060/XXX 'July 2021'
1003 Jun:2017,DDD/XXX|May:2017,030/XXX|Apr:2017,DDD/STD|Mar:2017,000/XXX 'July 2021'
1005 Apr:2019,000/XXX|Mar:2019,800/DDD|Feb:2019,000/XXX|Jan:2019,000/XXX 'July 2021'
1006 Jun:2020,000/XXX|May:2020,030/XXX|Apr:2020,060/XXX|Mar:2020,090/XXX 'July 2021'")
library(tidyverse)
library(lubridate, warn.conflicts = F)
df %>%
separate_rows(History, sep = '\\|') %>%
separate(History, into = c('Hist_mon', 'Hist_val'), sep = ',') %>%
mutate(Hist_mon = dmy(paste0('1:', Hist_mon)),
Hist_val = as.numeric(gsub('(\\D*)', '', Hist_val)),
ReportMonth = dmy(paste0('1 ', ReportMonth))) %>%
group_by(Id, ReportMonth) %>%
summarise(last_24_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(24)),
last_36_30 = sum(Hist_val >= 30 & Hist_mon >= ReportMonth %m-% months(36)), .groups = 'drop')
#> # A tibble: 4 x 4
#> Id ReportMonth last_24_30 last_36_30
#> <int> <date> <int> <int>
#> 1 1001 2021-07-01 4 4
#> 2 1003 2021-07-01 0 0
#> 3 1005 2021-07-01 0 1
#> 4 1006 2021-07-01 3 3
Created on 2021-07-16 by the reprex package (v2.0.0)
library(tidyverse)
library(lubridate)
df %>%
separate_rows(History, sep = '[|]')%>%
filter(str_detect(History, "\\w"), str_detect(History, "\\d+/"))%>%
separate(History, c("Date", "Value", "d"), sep = '[,/]', convert = TRUE) %>%
mutate(across(c(Date,ReportMonth), ~myd(paste(.x, "01")))) %>%
group_by(Id) %>%
summarise(r = list(map(c(m24 = 24, m36 = 36), ~sum(
Date + months(.x) > ReportMonth & Value >= 30)))) %>%
unnest_wider(r) %>%
right_join(df, 'Id')
# A tibble: 4 x 5
Id m24 m36 History_Report Month
<int> <int> <int> <chr> <chr>
1 1001 4 4 Jun:2020,030/XXX-May:2020,035/XXX-Apr:2020,040/XXX-Mar:2020,060/XXX July 2021
2 1003 0 0 Jun:2017,823/XXX-May:2017,000/XXX-Apr:2017,000/XXX-Mar:2017,000/XXX July 2021
3 1005 0 1 Apr:2019,000/XXX-Mar:2019,800/XXX-Feb:2019,000/XXX-Jan:2019,000/XXX July 2021
4 1006 3 3 Jun:2020,000/XXX-May:2020,030/XXX-Apr:2020,060/XXX-Mar:2020,090/XXX July 2021
I am trying to merge rows by pattern.
The dataframe has only one column (string) and normally, it should follow a pattern of date, company_name and salary. However, some cases just don't have the salary.
Is there is a way I can merge the rows by the pattern of the date? By doing so, I can later split them into columns. The reason why I didn't want to do pivot_wider earlier was that it's likely to get mismatched between the company name and salary - unbalanced rows. So I think it's better to merge the rows by the date pattern as the date is never missing and following a pattern.
dataset:
# A tibble: 10 x 1
detail
<chr>
1 26 January 2021
2 NatWest Group - Bristol, BS2 0PT
3 26 January 2021
4 NatWest Group - Manchester, M3 3AQ
5 15 February 2021
6 Brook Street - Liverpool, Merseyside, L21AB
7 £13.84 per hour
8 16 February 2021
9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day
dput for the dataset:
structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT",
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021",
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour",
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU",
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Expected outcome:
detail
<chr>
1 26 January 2021 NatWest Group - Bristol, BS2 0PT
2 26 January 2021 NatWest Group - Manchester, M3 3AQ
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
dput for expected outcome:
df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT",
"26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour",
"16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Preface each line with a tag and then use read.dcf to create a 3 column character matrix mat. At the end we convert that to a character vector with one element per logical record but you may just want to use mat since that seems like a more useful format.
We assume that the dates have the %d %B %Y format (see ?strptime for the percent codes), that salary lines start with £ and other lines are Address lines.
library(dplyr)
mat <- dat %>%
mutate(detail = case_when(
!is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail),
grepl("^£", detail) ~ paste("Salary:", detail),
TRUE ~ paste("Address:", detail))) %>%
{ read.dcf(textConnection(.$detail)) }
mat %>%
apply(1, toString) %>%
sub(", NA$", "", .)
Update
Simplied assumptions and code.
One more solution assuming only that first row contains a date. It'll work irrespective of the number of rows in between two dates..
library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\\d\\d? \\w+ \\d{4})$"))) %>%
mutate(c = paste0("Col", as.character(row_number()))) %>%
pivot_wider(id_cols = d, values_from = detail, names_from = c)
# A tibble: 4 x 4
# Groups: d [4]
d Col1 Col2 Col3
<int> <chr> <chr> <chr>
1 1 26 January 2021 NatWest Group - Bristol, BS2 0PT NA
2 2 26 January 2021 NatWest Group - Manchester, M3 3AQ NA
3 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a pure data.table approach
library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using ## as separator
ans <- df[, .( paste0( detail, collapse = "##") ),
by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te ## introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "##" ))) := tstrsplit( V1, "##" )][, V1 := NULL ][]
# d Col1 Col2 Col3
# 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
# 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
# 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a data.table approach which uses dcast() and rowid() to reshape to wide format. It returns a data.table with four columns: a record number, date,
company_name, and salary.
library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
rn Col1 Col2 Col3
1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
For detecting rows which start a new record, i.e., rows with a date, this approach borrows from Anil's answer as well as from G.Grothendieck's.
dcast() allows to pack all into a "one-liner" (if the library() calls are not counted):
library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"),
value.var = "detail")
After executing the R code, the values I got in the column of dataframe are:
25 July 2012 bet
22 June 2015 bet
09 April 2015 be
14 November 2016
I want only the dates, How can I remove "bet", "be" from the values?
I am using the below code to extract the above values from the text document:
coalesce((substr((stringr::str_match(text, "ISDA Master Agreement dated as of (.) ")[, 2]),1,16)),(substr((stringr::str_match(text, "ISDA Master Agreement dated as of (.) ")[, 2]),1,13)))
If I swipe the coalesce arguements, then the 4th value gets truncated.
I am ok with the code, but while cleaning, how should I remove the "bet","be"?
I am far away from being a regex expert, but here goes a tidyverse way of doing what you want:
library(tidyverse, verbose = F)
df <- tibble::tribble(
~V1, ~V2,
1L, "25 July 2012 bet",
2L, "22 June 2015 bet",
3L, "09 April 2015 be",
4L, "14 November 2016"
)
df %>%
mutate(V2 = str_replace(V2, pattern = "[:space:]be.*", replacement = ""))
#> # A tibble: 4 x 2
#> V1 V2
#> <int> <chr>
#> 1 1 25 July 2012
#> 2 2 22 June 2015
#> 3 3 09 April 2015
#> 4 4 14 November 2016
Created on 2020-02-21 by the reprex package (v0.3.0)
We can use sub to remove whitespace and everything with "be"
sub("\\s+be.*", "", c("25 July 2012 bet", "09 April 2015 be"))
#[1] "25 July 2012" "09 April 2015"
If you use lubridate you can strip away the excess text after the date:
library(lubridate)
test_strings <- c("25 July 2012 bet", "09 April 2015 be")
dmy(test_strings)
[1] "2012-07-25" "2015-04-09"
I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000
I have a table that uses unique IDs but inconsistent readable names for those IDs. It is more complex than month names, but for the sake of a more simple example, let's say it looks something like this:
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE)
Except that they might have spelled "Feb" or "March" eight different ways. I also have a clean data frame that contains consistent names for the names that have variations:
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE)
I want to get to this:
1 Jan 37
2 Feb 63
3 Mar 9
3 Mar 150
2 Feb 49"
I tried merge(month_lookup, demo_frame, by = "Month_id") but that dropped all the January values because "Jan" doesn't exist in the lookup table:
Month_id Month_name.x Month_name.y Number
1 2 Feb Feb 63
2 2 Feb February 49
3 3 Mar March 9
4 3 Mar Mar 150
My read of How to replace data.frame column names with string in corresponding lookup table in R is that I ought to be able to use plyr::mapvalues but I'm unclear from examples and documentation on how I'd map the id to the name. I don't just want to say "Replace 'March' with 'Mar'" -- I need to say SET month_name = 'Mar' WHERE month_id = 3 for each value in lookup.
I think you want this.
library(dplyr)
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE, stringsAsFactors = FALSE)
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE, stringsAsFactors = FALSE)
result =
demo_frame %>%
rename(bad_month = Month_name) %>%
left_join(month_lookup) %>%
mutate(month_fix =
Month_name %>%
is.na %>%
ifelse(bad_month, Month_name) )