I want to extract the dates along with a regex pattern (dates come after the pattern) from sentences.
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
The pattern is number of subscribers and then there is the date as Month Day, Year format. Sometimes there are as of or in or no characters between the pattern and dates.
I have tried the following script.
find_dates <- function(text){
pattern <- "\\bnumber\\s+of\\s+subscribers\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
str_extract(text, pattern)
}
However, this extracts the in-between words too, which I would like to ignore.
Desired output:
find_dates(text1)
'number of subscribers December 31, 2022'
find_dates(text2)
'number of subscribers January 10, 2023'
An approach using stringr
library(stringr)
find_Dates <- function(x) paste0(str_extract_all(x,
"\\bnumber\\b (\\b\\S+\\b ){2}|\\b\\S+\\b \\d{2}, \\d{4}")[[1]], collapse="")
find_Dates(text1)
[1] "number of subscribers December 31, 2022"
# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"
[[2]]
[1] "number of subscribers January 10, 2023"
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
find_dates <- function(text){
# pattern <- "(\\bnumber\\s+of\\s+subscribers)\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
pattern <- "(\\bnumber\\s+of\\s+subscribers)(?:\\s+as\\s+of\\s|\\s+in\\s+)?(\\S+(\\s+\\S+){2})" # pattern and next 3 words
str_extract(text, pattern, 1:2)
}
find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"
find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"
Related
In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
however the result is the same.
In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"
You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w+\\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"
How can I extract the words/sentence next to an specific word? Example:
"On June 28, Jane went to the cinema and ate popcorn"
I would like to choose 'Jane' and get [-2,2], meaning:
"June 28, Jane went to"
We could make a function to help out. This might make it a little more dynamic.
library(tidyverse)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
grab_text <- function(text, target, before, after){
min <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))-before
max <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))+after
paste(str_split(text, "\\s")[[1]][min:max], collapse = " ")
}
grab_text(text = txt, target = "Jane", before = 2, after = 2)
#> [1] "June 28, Jane went to"
First we split the sentence, then we figure out the position of the target, then we grab any word before or after (number specified in the function), last we collapse the sentence back together.
I have a shorter version using str_extract from stringr
library(stringr)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
str_extract(txt,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")
[1] "June 28, Jane went to"
The function str_extract extract the patern form the string. The regex \\s is for white space, and [^\\s] is the negation of it, so anything but white space. So the whole pattern is Jane with twice a white space before and after and something composed by anything but whitespace
The advantage is that it is already vectorized, and if you have a vector of text you can use str_extract_all:
s <- c("On June 28, Jane went to the cinema and ate popcorn.
The next day, Jane hiked on a trail.",
"an indeed Jane loved it a lot")
str_extract_all(s,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")
[[1]]
[1] "June 28, Jane went to" "next day, Jane hiked on"
[[2]]
[1] "an indeed Jane loved it"
Here's an example with an expansion for multiple occurrences. Basically, split on whitespace, find the word, expand the indices, then make a list of results.
s <- "On June 28, Jane went to the cinema and ate popcorn. The next day, Jane hiked on a trail."
words <- strsplit(s, '\\s+')[[1]]
inds <- grep('Jane', words)
lapply(inds, FUN = function(i) {
paste(words[max(1, i-2):min(length(words), i+2)], collapse = ' ')
})
#> [[1]]
#> [1] "June 28, Jane went to"
#>
#> [[2]]
#> [1] "next day, Jane hiked on"
Created on 2019-09-17 by the reprex package (v0.3.0)
This should work:
stringr::str_extract(text, "(?:[^\\s]+\\s){5}Jane(?:\\s[^\\s]+){5}")
I have a sample text like this:
"\n Apr 15, 2019\n 12:00 PM – 3:00 PMWMC 2502, Burnaby\n "
I want to extract the date, time and location separately.
What I am thinking is to extract whatever before the second "\n", this should gives me "\n Apr 15, 2019". Then I can remove the "\n" and white spaces.
Then for the time, I want to remove whatever before the second "\n" and whatever after "PM".
For the location, just keep whatever after PM, then remove the "\n" and white spaces.
Here is the result I want:
[1] Apr 15, 2019
[2] 12:00 PM – 3:00 PM
[3] WMC 2502, Burnaby
Could anyone tell me how to do this? Doing it in some other ways is fine too.
Thanks.
Here is a base R one-liner using strsplit
sapply(strsplit(ss, "(\\s{2,}|(?<=[AP]M)(?=\\w))", perl = T), function(x) x[x != ""]) # [,1]
#[1,] "Apr 15, 2019"
#[2,] "12:00 PM – 3:00 PM"
#[3,] "WMC 2502, Burnaby"
It's difficult to say how well this generalises on account of the very small sample string.
Explanation: We split ss on either a stretch of at least 2 whitespaces "\\s{2,}" (this avoids splitting on a single whitespace), or at a position that is preceded by "[AP]M" through a positive look-behind and followed by a word character (i.e. not a whitespace) through a positive look-ahead "(?<=[AP]M)(?=\\w)".
Sample data
ss <- "\n Apr 15, 2019\n 12:00 PM – 3:00 PMWMC 2502, Burnaby\n "
This should work if your strings share the same structure with the sample text.
library(dplyr)
library(stringr)
str_split(x, "\\n", simplify = T) %>%
trimws() %>%
as.data.frame() %>%
mutate(
time = str_match(V3, "^.+PM"),
location = gsub(time, "", V3)
) %>%
select(
date = 2,
time,
location
)
# date time location
# 1 Apr 15, 2019 12:00 PM – 3:00 PM WMC 2502, Burnaby
I tried to extract a date from the following text. Unfortunately, it keeps giving me warning and the result is NA
I have a following text:
"IRA-401K Investment Assets Under Management (AUM) As of July 31, 2018 BMG Funds
$217,743,573 BMG BullionBars $45,176,561 TOTAL $262,920,134 Physical Holdings Download
Scotiabank BMG BullionBars List Download Brinks BMG BullionBars List Holdings by Ounces As
of July 31, 2018 Gold Bars 21,132.496 Silver Bars 453,531.574 Silver Coins
80,500 Platinum Bars"
The text contains following date: July 31, 2018. These dates appear twice in the text.
I used following code to extract the dates out of the text.
test_take <- lapply(cleanurl_text, parse_date_time, orders = "mdy",
locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"))
I get the following error message:
Warning message:
All formats failed to parse. No formats found.
When I include exact = TRUE
test_take <- lapply(as.character(cleanurl_text), parse_date_time, orders = "mdy",
locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"), exact = TRUE)
I get the following warning:
Warning message:
1 failed to parse.
The resulting object still contains NA.
The following regex can extract the date in the posted format.
pattern <- paste(month.name, collapse = "|")
pattern <- paste0("(", pattern, ")\\s\\d{1,2}.{1,2}\\d{4}")
m <- gregexpr(pattern, cleanurl_text)
regmatches(cleanurl_text, m)
#[[1]]
#[1] "July 31, 2018" "July 31, 2018"
Note that this can be done in just one code line, regmatches(gregexpr(.)), but I have opted for two lines in order to make it more readable.
I have vector with two elements. Each element contains a string of characters
with two sets of dates. I need to extract the latter of these two dates,
and make a new vector or list with them.
#webextract vector
webextract <- list("The Employment Situation, December 2006 January 5 \t 8:30 am\r","The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r")
#This is how the output of webextract looks like:
[[1]]
[1] The Employment Situation, December 2006 January 5 \t 8:30 am\r
[[2]]
[1] The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r
webextract is the result of web scraping an URL with plain text, that's why it looks like that. What I need to extract is "January 5" and "Feb. 2". I have been experimenting with grep and strsplit and failed to get anywhere. Have gone through all related SO questions without success. Thank you for your help.
We can try with gsub after unlisting the 'webextract'
gsub("^\\D+\\d+\\s+|(,\\s+\\d+)*\\D+\\d+:.*$", "", unlist(webextract))
#[1] "January 5" "Feb. 2"