Extract first sentence in string - r

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.
require(stringr)
x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."
My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:
str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA
Any tips much appreciated.

You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:
> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."
See this regex demo.
Details
.*? - any 0+ chars other than line break chars
[a-z0-9] - an ASCII lowercase letter or a digit
[.?!] - a ., ? or !
(?= ) - that is followed with a literal space.
Alternatively, you may use sub:
sub("([a-z0-9][?!.])\\s.*", "\\1", x)
See this regex demo.
Details
([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
\s - a whitespace
.* - any 0+ chars, as many as possible (up to the end of string).

corpus has special handling for abbreviations when determining sentence boundaries:
library(corpus)
text_split(x, "sentences")
#> parent index text
#> 1 1 1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1 2 The death toll has now risen to at least 187.
There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

Related

Extract a string of words between two specific words, but allow for a mismatches in R

I have the following string.
string =c("today is Oscar")
I want to extract everything between today and Oscar but allow for a maximum of two mismatches/typos in words today and Oscar.
The expected outcome, in this case, will be is, but there are strings that have another word between today and Oscar. Typos can occur in any letter in words today and Oscar.
I am currently having a look at the agrep package. Any help or guidance is appreciated.
If I understood you correctly, you want to extract from your vector the verbs (i.e., the middle substring) iff the words on the left and on the right of it are maximally 2 insertions/deletions etc. distant from the "today \\w+ Oscar"pattern.
If that premise is correct you can first subset your vector on those strings that meet that condition using agrep (or agrepl) and second capture the substring in the middle in a capturing group (...) and refer to it using backreference \\1 in sub's replacement argument:
sub("\\w+ (\\w+) \\w+", "\\1", string[agrepl("today \\w+ Oscar", string, max.distance = list(all = 2), ignore.case = T, fixed = F)])
[1] "IS" "drive" "goes"
Note: the argument all specifies "maximal number/fraction of all transformations (insertions, deletions and substitutions)"; alternatively use: insertions, deletions, and substitutions.
Mock data:
string = c("today IS Oscar", "today drive car", "tody goes Oscar", "tomorrow was Oscar")
"today IS Oscar" fully matches as ignore.case = T makes sure case doesn't matter
"today drive car" is a fuzzy match as caris 2 steps away from Oscar
"tody goes Oscar"is a fuzzy match as tody is 1 step away from todayand
"tomorrow was Oscar" is not a match at all as tomorrowis in excess of 2 steps distant from today

getting title of citation with regex

I am not so extremely familiar with regex but I would like to extract the title of a paper from a citation:
The title is in between the year (for example 1991 in the 1st citation) and the following dot in the sentence. I make it here in italics.
"1Moulds J.M., Nickells M.W., Moulds J.J., et al. (1991) The C3b/C4b
receptor is recognized by the Knops, McCoy, Swain-langley, and York
blood group antisera. J. Exp. Med.5:1159-63."
"2Rochowiak A., Niemir Z.I. (2010) The structure and role of CR1
complement receptor in pathology. Pol. Merkur Lekarski. 28:84–88."
"3WHO. Geneva: WHO; 2018. World Malaria Report 2018".
The citation are stored in a data frame (df) in the column "citation"
Output:
The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera
The structure and role of CR1 complement receptor in pathology
I wrote a regex which looks like this:
df$citation = sub('[^"]*?)', "", df$citation)
df$citation = sub("\\..*", "", df$citation)
Any advice on how to make it one line only?
In addition, it would be good to have a regex which if it does not find the year in parenthesis such as for the third citation it will delete the citation. Possible to do this?
Given your set of requirements, you can use
sub("^.*?\\b(?:19|20)\\d{2}\\)\\s*([^.]+).*", "\\1", df$citation, perl=TRUE)
See the regex demo
Details
^ - start of string
.*? - any 0+ chars, other than line break chars, as few as possible
\b(?:19|20)\d{2} - word boundary, 19 or 20 and any two digits
\) - a ) char
\s* - 0+ whitespaces
([^.]+) - Group 1: one or more chars other than .
.* - any 0+ chars, other than line break chars, as many as possible.

Remove the string before a certain word with R

I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word "Votes." Note that the number has a comma to separate thousands, so it's easier to treat it as a string.
I know that gsub("*. Votes","", text) will remove everything, but how do I just remove the number? Also, how do I collapse the repeated spaces into just one space?
Thanks for any help you might have!
Example data:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
You may use
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
See the online R demo and the online regex demo.
Details
(\\s){2,} - matches 2 or more whitespace chars while capturing the last occurrence that will be reinserted using the \1 placeholder in the replacement pattern
| - or
\\d - a digit
[0-9,]* - 0 or more digits or commas
\\s* - 0+ whitespace chars
(Votes) - Group 2 (will be restored in the output using the \2 placeholder): a Votes substring.
Note that trimws will remove any leading/trailing whitespace.
Easiest way is with stringr:
> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
To do the same thing but extract only the number, wrap it in gsub:
> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
Here's a version that will strip out all numbers before the word "Votes" even if they have commas or periods in it:
> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
If you want the label too, then just throw out the gsub part:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
And if you want to pull out all the numbers:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"

Finding Abbreviations in Data with R

In my data (which is text), there are abbreviations.
Is there any functions or code that search for abbreviations in text? For example, detecting 3-4-5 capital letter abbreviations and letting me count how often they happen.
Much appreciated!
detecting 3-4-5 capital letter abbreviations
You may use
\b[A-Z]{3,5}\b
See the regex demo
Details:
\b - a word boundary
[A-Z]{3,5} - 3, 4 or 5 capital letters (use [[:upper:]] to match letters other than ASCII, too)
\b - a word boundary.
R demo online (leveraging the regex occurrence count code from #TheComeOnMan)
abbrev_regex <- "\\b[A-Z]{3,5}\\b";
x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
sum(gregexpr(abbrev_regex,x)[[1]] > 0)
## => [1] 3
regmatches(x, gregexpr(abbrev_regex, x))[[1]]
## => [1] "XYZ" "WXYZ" "VWXYZ"
You can use the regular expression [A-Z] to match any ocurrence of acapital letter. If you want this pattern to be repeated 3 times you can add \1{3} to your regex. Consider using variables and a loop to get the job done for 3 to 5 repetition times.

text manipulation in R

I am trying to add parentheses around certain book titles character strings and I want to be able to paste with the paste0 function. I want to take this string:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)", "My Liffe 1993 07e pdfDrama (amazon.com)")
wrap certain strings in parentheses:
a
[1] “I Like What I Know (1959) (02e) (pdfDrama) (amazon.com)”
[2] ”My Life (1993) (07e) (pdfDrama) (amazon.com)”
I have tried but can't figure out a way to replace them within the string:
paste0("(",str_extract(a, "\\d{4}"),")")
paste0("(",str_extract(a, ”[0-9]+.e”),”)”)
Help?
I can suggest a regex for a fixed number of words of specific type:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)","My Life 1993 07e pdfDrama (amazon.com)")
sub("\\b(\\d{4})(\\s+)(\\d+e)(\\s+)([a-zA-Z]+)(\\s+\\([^()]*\\))", "(\\1)\\2(\\3)\\4(\\5)\\6", a)
See the R demo
And here is the regex demo. In short,
\\b(\\d{4}) - captures 4 digits as a whole word into Group 1
(\\s+) - Group 2: one or more whitespaces
(\\d+e) - Group 3: one or more digits and e
(\\s+) - Group 4: ibid
([a-zA-Z]+) - Group 5: one or more letters
(\\s+\\([^()]*\\)) - Group 6: one or more whitespaces, (, zero or more chars other than ( and ), ).
The contents of the groups are inserted back into the result with the help of backreferences.
If there are more words, and you need to wrap words starting with a letter/digit/underscore after a 4-digit word in the string, use
gsub("(?:(?=\\b\\d{4}\\b)|\\G(?!\\A))\\s*\\K\\b(\\S+)", "(\\1)", a, perl=TRUE)
See the R demo and a regex demo
Details:
(?:(?=\\b\\d{4}\\b)|\\G(?!\\A)) - either the location before a 4-digit whole word (see the positive lookahead (?=\\b\\d{4}\\b)) or the end of the previous successful match
\\s* - 0+ whitespaces
\\K - omitting the text matched so far
\\b(\\S+) - Group 1 capturing 1 or more non-whitespace symbols that are preceded with a word boundary.

Resources