Regex - matching text after the nth '\n' - r

I have a sample text like this:
"\n Apr 15, 2019\n 12:00 PM – 3:00 PMWMC 2502, Burnaby\n "
I want to extract the date, time and location separately.
What I am thinking is to extract whatever before the second "\n", this should gives me "\n Apr 15, 2019". Then I can remove the "\n" and white spaces.
Then for the time, I want to remove whatever before the second "\n" and whatever after "PM".
For the location, just keep whatever after PM, then remove the "\n" and white spaces.
Here is the result I want:
[1] Apr 15, 2019
[2] 12:00 PM – 3:00 PM
[3] WMC 2502, Burnaby
Could anyone tell me how to do this? Doing it in some other ways is fine too.
Thanks.

Here is a base R one-liner using strsplit
sapply(strsplit(ss, "(\\s{2,}|(?<=[AP]M)(?=\\w))", perl = T), function(x) x[x != ""]) # [,1]
#[1,] "Apr 15, 2019"
#[2,] "12:00 PM – 3:00 PM"
#[3,] "WMC 2502, Burnaby"
It's difficult to say how well this generalises on account of the very small sample string.
Explanation: We split ss on either a stretch of at least 2 whitespaces "\\s{2,}" (this avoids splitting on a single whitespace), or at a position that is preceded by "[AP]M" through a positive look-behind and followed by a word character (i.e. not a whitespace) through a positive look-ahead "(?<=[AP]M)(?=\\w)".
Sample data
ss <- "\n Apr 15, 2019\n 12:00 PM – 3:00 PMWMC 2502, Burnaby\n "

This should work if your strings share the same structure with the sample text.
library(dplyr)
library(stringr)
str_split(x, "\\n", simplify = T) %>%
trimws() %>%
as.data.frame() %>%
mutate(
time = str_match(V3, "^.+PM"),
location = gsub(time, "", V3)
) %>%
select(
date = 2,
time,
location
)
# date time location
# 1 Apr 15, 2019 12:00 PM – 3:00 PM WMC 2502, Burnaby

Related

How to extract patterns along with dates in string using R?

I want to extract the dates along with a regex pattern (dates come after the pattern) from sentences.
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
The pattern is number of subscribers and then there is the date as Month Day, Year format. Sometimes there are as of or in or no characters between the pattern and dates.
I have tried the following script.
find_dates <- function(text){
pattern <- "\\bnumber\\s+of\\s+subscribers\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
str_extract(text, pattern)
}
However, this extracts the in-between words too, which I would like to ignore.
Desired output:
find_dates(text1)
'number of subscribers December 31, 2022'
find_dates(text2)
'number of subscribers January 10, 2023'
An approach using stringr
library(stringr)
find_Dates <- function(x) paste0(str_extract_all(x,
"\\bnumber\\b (\\b\\S+\\b ){2}|\\b\\S+\\b \\d{2}, \\d{4}")[[1]], collapse="")
find_Dates(text1)
[1] "number of subscribers December 31, 2022"
# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"
[[2]]
[1] "number of subscribers January 10, 2023"
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
find_dates <- function(text){
# pattern <- "(\\bnumber\\s+of\\s+subscribers)\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
pattern <- "(\\bnumber\\s+of\\s+subscribers)(?:\\s+as\\s+of\\s|\\s+in\\s+)?(\\S+(\\s+\\S+){2})" # pattern and next 3 words
str_extract(text, pattern, 1:2)
}
find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"
find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"

Writing a function to clean string data and rename columns

I am writing a function to be applied to many individual matrices. Each matrix has 5 columns of string text. I want to remove a piece of one string which matches the string inside another element exactly, then apply a couple more stringr functions, transform it into a data frame, then rename the columns and in the last step I want to add a number to the end of each column name, since I will apply this to many matrices and need to identify the columns later.
This is very similar to another function I wrote so I can't figure out why it won't work. I tried running each line individually by filling in the inputs like this and it works perfectly:
Review1[,4] <- str_remove(Review1[,4], Review1[,3])
Review1[,4] <- str_sub(Review1[,4], 4, -4)
Review1[,4] <- str_trim(Review1[,4], "both")
Review1 <- as.data.frame(Review1)
colnames(Review1) <- c("Title", "Rating", "Date", "User", "Text")
Review1 <- Review1 %>% rename_all(paste0, 1)
But when I run the function nothing seems to happen at all.
Transform_Reviews <- function(x, y, z, a) {
x[,y] <- str_remove(x[,y], x[,z])
x[,y] <- str_sub(x[,y], 4, -4)
x[,y] <- str_trim(x[,y], "both")
x <- as.data.frame(x)
colnames(x) <- c("Title", "Rating", "Date", "User", "Text")
x <- x %>% rename_all(paste0, a)
}
Transform_Reviews(Review1, 4, 3, 1)
This is the only warning message I get. I also receive this when I run the str_remove function individually, but it still changes the elements. But it changes nothing when I run the UDF.
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), ... :
empty search patterns are not supported
This is an example of the part of Review1 that I'm working with.
[,3] [,4]
[1,] "6 April 2014" "By Copnovelist on 6 April 2014"
[2,] "18 Dec. 2015" "By kenneth bell on 18 Dec. 2015"
[3,] "26 May 2015" "By Simon.B :-) on 26 May 2015"
[4,] "22 July 2013" "By Lilla Lukacs on 22 July 2013"
This is what I want the output to look like:
Date1 User1
1 6 April 2014 Copnovelist
2 18 Dec. 2015 kenneth bell
3 26 May 2015 Simon.B :-)
4 22 July 2013 Lilla Lukacs
I realized I just needed to use an assignment operator to see my function work.
Review1 <- Transform_Reviews(Review1, 4, 3, 1)

Extract part of string: date and times

I have a variable that usually has some gibberish like:
\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n
I am trying to extract the date (30.07.2019) and time (12:00 - 14:30). I am not very good with parsing so some help with implementing this in R would be appreciated.
If you can rely on the fact that the date and time part only occur once in your data you could use regular expressions to extract them (here using a dataframe):
library(tidyverse)
data <-
tibble(gibberish_string = "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n")
data %>% mutate(date = str_extract(gibberish_string,
pattern = "\\d{1,2}\\.\\d{1,2}\\.\\d{4}"),
time = str_extract(gibberish_string,
pattern = "\\d{1,2}:\\d{1,2}"))
String split, then extract date and times:
x <- "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"
lapply(strsplit(x, "[\n\t ]"), function(i){
dd <- i[ grepl("[0-9]{2}.[0-9]{2}.[0-9]{2}", i) ]
tt <- i[ grepl("[0-9]{2}:[0-9]{2}", i) ]
c(dd, paste(tt, collapse = "-"))
})
# [[1]]
# [1] "30.07.2019" "12:00-14:30"
This for date:
(\d{1,2}[\.\/]){2}((\d{4})|(\d{2}))
Here is Demo
This for time:
\d{1,2}:\d{2}\s?-\s?\d{1,2}:\d{2}
Here Is Demo
A kind of lengthy step by step base/stringr approach:
tst<-"\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"
cleaner<-gsub("\\n|\\t","",tst)
split_txt<-strsplit(cleaner, "\\s(?=[a-z])",perl=T)
dates<-stringr::str_extract_all(unlist(split_txt),
"\\d{1,}\\.\\d{2,}\\.\\d{4}")
times<-stringr::str_extract_all(stringr::str_remove_all(unlist(split_txt),
"[A-Za-z]"),".*\\-.*")
dates[lengths(dates)>0]
[[1]]
[1] "30.07.2019"
trimws(times[lengths(times)>0])
[1] "12:00 - 14:30"

Extract dates from a vector of character strings

I have vector with two elements. Each element contains a string of characters
with two sets of dates. I need to extract the latter of these two dates,
and make a new vector or list with them.
#webextract vector
webextract <- list("The Employment Situation, December 2006 January 5 \t 8:30 am\r","The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r")
#This is how the output of webextract looks like:
[[1]]
[1] The Employment Situation, December 2006 January 5 \t 8:30 am\r
[[2]]
[1] The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r
webextract is the result of web scraping an URL with plain text, that's why it looks like that. What I need to extract is "January 5" and "Feb. 2". I have been experimenting with grep and strsplit and failed to get anywhere. Have gone through all related SO questions without success. Thank you for your help.
We can try with gsub after unlisting the 'webextract'
gsub("^\\D+\\d+\\s+|(,\\s+\\d+)*\\D+\\d+:.*$", "", unlist(webextract))
#[1] "January 5" "Feb. 2"

Creating a unified time-series, with dates coming from different (natural) languages

I am using the as.Date function as follows:
x$time_date <- as.Date(x$time_date, format = "%H:%M - %d %b %Y")
This worked fine until I saw a lot of NA values in the output, which I traced back to some of the dates stemming from a different language: German.
My English dates look like this: 18:00 - 10 Dec 2014
Where the German equivalent is: 18:00 - 10 Dez 2014
The month December is abbreviated the German way. This is not recognised by the as.Date function. I have the same problem for five other months:
Mar - März
May - Mai
Jun - Juni
Jul - Juli
Oct - Okt
This looks like it would be of use, but I am unsure of how to implement it for 'unrecognised' formats:
How to change multiple Date formats in same column
I attempted to just go through and use gsub to replace all the occurences of German months, but without luck. x below is the data.table and I work on just the time_date column:
x$time_date <- gsub("(März)?", "Mar", x$time_date) %>%
gsub("(Mai)?", "May", .) %>%
gsub("(Juni)?", "Jun", .) %>%
gsub("(Juli)?", "Jul", .) %>%
gsub("(Okt)?", "Oct", .) %>%
gsub("(Dez)?", "Dec", .)
Not only did this not work, but it is also a very slow process and I have nearly 20 GB of pure .csv files to work through.
In the as.Date documentation there is mention of different locales / languages, but not how to work with several simultaneously. I also found instructions on how to use different languages, however my data is all mixed, so I can only thing of a conditional loop using the correct language for each file, however that would also be slow.
Is there a known workaround for this, which I can't find?
Create a table tab that contains all the translations and then use subscripting to actually do the translation. The code below seems to work for me on Windows provided your input abbreviations are the same as the standard ones generated but the precise language names ("German", etc.) may vary depending on your system. See ?Sys.setlocale for more information. Also if the abbreviations in your input are different than the ones generated here you will have to add those to tab yourself, e.g. tab <- c(tab, Juli = "Jul")
langs <- c("French", "German", "English")
tab <- unlist(lapply(langs, function(lang) {
Sys.setlocale("LC_TIME", lang)
nms <- format(ISOdate(2000, 1:12, 1), "%b")
setNames(month.abb, nms)
}))
x <- c("18:00 - 10 Juli 2014", "18:00 - 10 Mai 2014") # test input
source_month <- gsub("[^[:alpha:]]", "", x)
mapply(sub, source_month, tab[source_month], x, USE.NAMES = FALSE)
giving:
[1] "18:00 - 10 Jul 2014" "18:00 - 10 May 2014"

Resources