Optional pattern part in regex lookbehind - r

In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
however the result is the same.

In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"

You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w+\\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"

Related

How to split a string at every point, question mark and exclamation mark in r

As the title says, I should split a string at every . ! and ?
That doesn't work:
strsplit(x, "/ (\\?|\\.|!) /")
$`352`
[1] "Saudi Arabian Oil Minister Hisham (...)
the\n... accord and it will never sell its oil at prices below the\npronounced prices under any circumstance.\"\n Saudi Arabia was a main architect of December pact under\nwhich OPEC agreed to cut its total oil output ceiling by 7.25\npct and return to fixed prices of around 18 dollars a barrel.\n Reuter"
$`353`
[1] "Kuwait's oil minister said (...)
daily (bpd).\n Crude oil prices fell sharply last week as international\noil traders and analysts estimated the 13-nation OPEC was\npumping up to one million bpd over its self-imposed limits.\n Reuter"
$`368`
[1] "The port of Philadelphia (...)
the ship on the high tide.\n After delivering oil to a refinery in Paulsboro, New\nJersey, the ship apparently lost its steering and hit the power\ntransmission line carrying power from the nuclear plant to the\nstate of Delaware.\n Reuter"
I shortened it with "(...)" here, so that's not part of the code obviously.
There should be far more splits because there are points where it doesn't split.
Jonathan V. Solórzano is right:
x <- "Ceci.est!un?pipe. . ."
strsplit(x, "\\?|\\.|!")
[[1]]
[1] "Ceci" "est" "un" "pipe" " " " "

How to extract a fixed number of characters before a string in R

I have a text that contains somewhere in the document a citation to a court case, such as
x <- "2009 U.S. LEXIS"
I know it is always a four-digit year plus a space in front of the pattern "U.S. LEXIS". How should I extract these four digits of years?
Thanks
I think the data/vector given by you was inadequate to let the experts here solve your problem.
UPDATE try this
str_extract_all(x, "\\d{4}(?=\\sU.S.\\sLEXIS)")
[[1]]
[1] "2009" "2015" "1990"
OR to extract these as numbers
lapply(str_extract_all(x, "\\d{4}(?=\\sU.S.\\sLEXIS)"), as.numeric)
[[1]]
[1] 2009 2015 1990
OLD ANSWER Moreover, I am also new in regex therefore my solution may not be a very clean method. Typically your case is of searching nested groups in regex patterns. Still, you can try this method
x <- "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"
> x
[1] "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"
Now follow these steps
library(stringr)
lapply(str_extract_all(x, "(\\d{4})\\sU.S.\\sLEXIS"), str_extract, pattern = "(\\d{4})")
[[1]]
[1] "2009" "2015" "1990"
Typically "((\\d{4})\\sU.S.\\sLEXIS)" would have worked as regex pattern but I am sure about nested groups in R, so used lapply here. Basically str_extract_all(x, "(\\d{4})\\sU.S.\\sLEXIS" will cause to return all citations. Try this.
You can try :
x <- "2009 U.S. LEXIS"
as.numeric(sub('.*?(\\d{4}) U.S. LEXIS', '\\1', x))
#[1] 2009
Using stringr::str_extract :
as.numeric(stringr::str_extract(x, '\\d{4}(?= U.S. LEXIS)'))
We can use parse_number from readr
library(readr)
parse_number(x)
#[1] 2009
data
x <- "2009 U.S. LEXIS"
substr function in stringr library solve it
substr(x,1,4)
if you need to convert in numeric, then you can return it as.numeric
as.numeric(substr(x,1,4))

How to extract unique string in between string pattern in full text in R?

I'm looking to extract names and professions of those who testified in front of Congress from the following text:
text <- c(("FULL COMMITTEE HEARINGS\\", \\" 2017\\",\n\\" April 6, 2017—‘‘The 2017 Tax Filing Season: Internal Revenue\\", \", \"\\"\nService Operations and the Taxpayer Experience.’’ This hearing\\", \\" examined\nissues related to the 2017 tax filing season, including\\", \\" IRS performance,\ncustomer service challenges, and information\\", \\" technology. Testimony was\nheard from the Honorable John\\", \\" Koskinen, Commissioner, Internal Revenue\nService, Washington,\\", \", \"\\" DC.\\", \\" May 25, 2017—‘‘Fiscal Year 2018 Budget\nProposals for the Depart-\\", \\" ment of Treasury and Tax Reform.’’ The hearing\ncovered the\\", \\" President’s 2018 Budget and touched on operations of the De-\n\\", \\" partment of Treasury and Tax Reform. Testimony was heard\\", \\" from the\nHonorable Steven Mnuchin, Secretary of the Treasury,\\", \", \"\\" United States\nDepartment of the Treasury, Washington, DC.\\", \\" July 18, 2017—‘‘Comprehensive\nTax Reform: Prospects and Chal-\\", \\" lenges.’’ The hearing covered issues\nsurrounding potential tax re-\\", \\" form plans including individual, business,\nand international pro-\\", \\" posals. Testimony was heard from the Honorable\nJonathan Talis-\\", \", \"\\" man, former Assistant Secretary for Tax Policy 2000–\n2001,\\", \\" United States Department of the Treasury, Washington, DC; the\\",\n\\" Honorable Pamela F. Olson, former Assistant Secretary for Tax\\", \\" Policy\n2002–2004, United States Department of the Treasury,\\", \\" Washington, DC; the\nHonorable Eric Solomon, former Assistant\\", \", \"\\" Secretary for Tax Policy\n2006–2009, United States Department of\\", \\" the Treasury, Washington, DC; and\nthe Honorable Mark J.\\", \\" Mazur, former Assistant Secretary for Tax Policy\n2012–2017,\\", \\" United States Department of the Treasury, Washington, DC.\\",\n\\" (5)\\", \\"VerDate Sep 11 2014 14:16 Mar 28, 2019 Jkt 000000 PO 00000 Frm 00013\nFmt 6601 Sfmt 6601 R:\\\\DOCS\\\\115ACT.000 TIM\\"\", \")\")"
)
The full text is available here: https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf
It seems that the names are in between "Testimony was heard from" until the next ".". So, how can I extract the names between these two patterns? The text is much longer (50 page document), but I figured that if I can do it one, I'll do it for the rest of the text.
I know I can't use NLP for name extraction because they are names of persons that didn't testify, for example.
NLP is likely unavoidable because of the many abbreviations in the text. Try this workflow:
Tokenize by sentence
Remove sentences without "Testimony"
Extract persons + professions from remaining sentences
There are a couple of packages with sentence tokenizers, but openNLP has generally worked best for me when dealing with abbreviation laden sentences. The following code should get you close to your goal:
library(tidyverse)
library(pdftools)
library(openNLP)
# Get the data
testimony_url <- "https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf"
download.file(testimony_url, "testimony.pdf")
text_raw <- pdf_text("testimony.pdf")
# Clean the character vector and smoosh into one long string.
text_string <- str_squish(text_raw) %>%
str_replace_all("- ", "") %>%
paste(collapse = " ") %>%
NLP::as.String()
# Annotate and extract the sentences.
annotations <- NLP::annotate(text_string, Maxent_Sent_Token_Annotator())
sentences <- text_string[annotations]
# Some sentences starting with "Testimony" list multiple persons. We need to
# split these and clean up a little.
name_title_vec <- str_subset(sentences, "Testimony was") %>%
str_split(";") %>%
unlist %>%
str_trim %>%
str_remove("^(Testimony .*? from|and) ") %>%
str_subset("^\\(\\d\\)", negate = T)
# Put in data frame and separate name from profession/title.
testimony_tibb <- tibble(name_title_vec) %>%
separate(name_title_vec, c("name", "title"), sep = ", ", extra = "merge")
You should end up with the below data frame. Some additional cleaning may be necessary:
# A tibble: 95 x 2
name title
<chr> <chr>
1 the Honorable John Koskin… Commissioner, Internal Revenue Service, Washington, DC.
2 the Honorable Steven Mnuc… Secretary of the Treasury, United States Department of the Treasury…
3 the Honorable Jonathan Ta… former Assistant Secretary for Tax Policy 2000–2001, United States …
4 the Honorable Pamela F. O… former Assistant Secretary for Tax Policy 2002–2004, United States …
5 the Honorable Eric Solomon former Assistant Secretary for Tax Policy 2006–2009, United States …
6 the Honorable Mark J. Maz… "former Assistant Secretary for Tax Policy 2012–2017, United States…
7 Mr. Daniel Garcia-Diaz Director, Financial Markets and Community Investment, United States…
8 Mr. Grant S. Whitaker president, National Council of State Housing Agencies, Washington, …
9 the Honorable Katherine M… Ph.D., professor of public policy and planning, and faculty directo…
10 Mr. Kirk McClure Ph.D., professor, Urban Planning Program, School of Public Policy a…
# … with 85 more rows

Extracting 5 words before and after an specific word

How can I extract the words/sentence next to an specific word? Example:
"On June 28, Jane went to the cinema and ate popcorn"
I would like to choose 'Jane' and get [-2,2], meaning:
"June 28, Jane went to"
We could make a function to help out. This might make it a little more dynamic.
library(tidyverse)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
grab_text <- function(text, target, before, after){
min <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))-before
max <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))+after
paste(str_split(text, "\\s")[[1]][min:max], collapse = " ")
}
grab_text(text = txt, target = "Jane", before = 2, after = 2)
#> [1] "June 28, Jane went to"
First we split the sentence, then we figure out the position of the target, then we grab any word before or after (number specified in the function), last we collapse the sentence back together.
I have a shorter version using str_extract from stringr
library(stringr)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
str_extract(txt,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")
[1] "June 28, Jane went to"
The function str_extract extract the patern form the string. The regex \\s is for white space, and [^\\s] is the negation of it, so anything but white space. So the whole pattern is Jane with twice a white space before and after and something composed by anything but whitespace
The advantage is that it is already vectorized, and if you have a vector of text you can use str_extract_all:
s <- c("On June 28, Jane went to the cinema and ate popcorn.
The next day, Jane hiked on a trail.",
"an indeed Jane loved it a lot")
str_extract_all(s,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")
[[1]]
[1] "June 28, Jane went to" "next day, Jane hiked on"
[[2]]
[1] "an indeed Jane loved it"
Here's an example with an expansion for multiple occurrences. Basically, split on whitespace, find the word, expand the indices, then make a list of results.
s <- "On June 28, Jane went to the cinema and ate popcorn. The next day, Jane hiked on a trail."
words <- strsplit(s, '\\s+')[[1]]
inds <- grep('Jane', words)
lapply(inds, FUN = function(i) {
paste(words[max(1, i-2):min(length(words), i+2)], collapse = ' ')
})
#> [[1]]
#> [1] "June 28, Jane went to"
#>
#> [[2]]
#> [1] "next day, Jane hiked on"
Created on 2019-09-17 by the reprex package (v0.3.0)
This should work:
stringr::str_extract(text, "(?:[^\\s]+\\s){5}Jane(?:\\s[^\\s]+){5}")

Extract dates from a vector of character strings

I have vector with two elements. Each element contains a string of characters
with two sets of dates. I need to extract the latter of these two dates,
and make a new vector or list with them.
#webextract vector
webextract <- list("The Employment Situation, December 2006 January 5 \t 8:30 am\r","The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r")
#This is how the output of webextract looks like:
[[1]]
[1] The Employment Situation, December 2006 January 5 \t 8:30 am\r
[[2]]
[1] The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r
webextract is the result of web scraping an URL with plain text, that's why it looks like that. What I need to extract is "January 5" and "Feb. 2". I have been experimenting with grep and strsplit and failed to get anywhere. Have gone through all related SO questions without success. Thank you for your help.
We can try with gsub after unlisting the 'webextract'
gsub("^\\D+\\d+\\s+|(,\\s+\\d+)*\\D+\\d+:.*$", "", unlist(webextract))
#[1] "January 5" "Feb. 2"

Resources