getting title of citation with regex - r

I am not so extremely familiar with regex but I would like to extract the title of a paper from a citation:
The title is in between the year (for example 1991 in the 1st citation) and the following dot in the sentence. I make it here in italics.
"1Moulds J.M., Nickells M.W., Moulds J.J., et al. (1991) The C3b/C4b
receptor is recognized by the Knops, McCoy, Swain-langley, and York
blood group antisera. J. Exp. Med.5:1159-63."
"2Rochowiak A., Niemir Z.I. (2010) The structure and role of CR1
complement receptor in pathology. Pol. Merkur Lekarski. 28:84–88."
"3WHO. Geneva: WHO; 2018. World Malaria Report 2018".
The citation are stored in a data frame (df) in the column "citation"
Output:
The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera
The structure and role of CR1 complement receptor in pathology
I wrote a regex which looks like this:
df$citation = sub('[^"]*?)', "", df$citation)
df$citation = sub("\\..*", "", df$citation)
Any advice on how to make it one line only?
In addition, it would be good to have a regex which if it does not find the year in parenthesis such as for the third citation it will delete the citation. Possible to do this?

Given your set of requirements, you can use
sub("^.*?\\b(?:19|20)\\d{2}\\)\\s*([^.]+).*", "\\1", df$citation, perl=TRUE)
See the regex demo
Details
^ - start of string
.*? - any 0+ chars, other than line break chars, as few as possible
\b(?:19|20)\d{2} - word boundary, 19 or 20 and any two digits
\) - a ) char
\s* - 0+ whitespaces
([^.]+) - Group 1: one or more chars other than .
.* - any 0+ chars, other than line break chars, as many as possible.

Related

Replace matched string with its sub-groups

I have some DNA sequences to process, they look like:
>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA
>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
I coded a regex in order to detect the title line of each sequence:
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
What function should I use to replace this whole match with its group3 + group4? In addition, I've got 72 matches in one txt file, how can I replace them in one run?
Your current regex won't work for lines where Group 3 or 4 contains a single letter word because [a-zA-Z]+\\-?[a-zA-Z]+ matches 1+ letters, then an optional hyphen, and then again 1+ letters (that means, there must be at least 2 letters). With [a-zA-Z]+(?:-[a-zA-Z]+)?, you can match 1+ letters followed with an optional sequence of - and then 1+ letters.
Also, \s also matches line breaks, and if your title lines are shorter than you assume then .* may grab a sequence line by mistake. You may use \h or [ \t] instead.
Note that \n is not necessary at the end of the pattern because .* matches any 0+ chars other than line break chars with an ICU regex library (it is used in your current code, str_replace_all).
In general, you should only capture with (...) what you need to keep, everything else can be just matched. Remove extra capturing parentheses, and it will save some performance.
If you add (?m)^ at the start, you will make sure you only match > that is at the start of a line.
You may use
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*"
See the regex demo.
Code:
Sequence <- str_replace_all(SequenceRaw,
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*",
"\\1 \\2")
I figured it out myself, with tidyverse packages:
library(tidyverse)
SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta
Sequence <- str_replace_all(SequenceRaw,
"(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n",
">\\3 \\4\n") ## Keep '>' and add a new line with '\n'
write_file(Sequence, "YOUR PATH\\sequence.fasta")
Here is the result:

Extract first sentence in string

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.
require(stringr)
x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."
My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:
str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA
Any tips much appreciated.
You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:
> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."
See this regex demo.
Details
.*? - any 0+ chars other than line break chars
[a-z0-9] - an ASCII lowercase letter or a digit
[.?!] - a ., ? or !
(?= ) - that is followed with a literal space.
Alternatively, you may use sub:
sub("([a-z0-9][?!.])\\s.*", "\\1", x)
See this regex demo.
Details
([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
\s - a whitespace
.* - any 0+ chars, as many as possible (up to the end of string).
corpus has special handling for abbreviations when determining sentence boundaries:
library(corpus)
text_split(x, "sentences")
#> parent index text
#> 1 1 1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1 2 The death toll has now risen to at least 187.
There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

How to extract a substring by inverse pattern with R?

I trying to extract a substring by pattern using gsub() R function.
# Example: extracting "7 years" substring.
string <- "Psychologist - 7 years on the website, online"
gsub(pattern="[0-9]+\\s+\\w+", replacement="", string)`
`[1] "Psychologist - on the website, online"
As you can see, it's easy to exlude needed substring using gsub(), but I need to inverse the result and getting "7 years" only.
I think about using "^", something like that:
gsub(pattern="[^[0-9]+\\s+\\w+]", replacement="", string)
Please, could anyone help me with correct regexp pattern?
You may use
sub(pattern=".*?([0-9]+\\s+\\w+).*", replacement="\\1", string)
See this R demo.
Details
.*? - any 0+ chars, as few as possible
([0-9]+\\s+\\w+) - Capturing group 1:
[0-9]+ - one or more digits
\\s+ - 1 or more whitespaces
\\w+ - 1 or more word chars
.* - the rest of the string (any 0+ chars, as many as possible)
The \1 in the replacement replaces with the contents of Group 1.
You could use the opposite of \d, which is \D in R:
string <- "Psychologist - 7 years on the website, online"
sub(pattern = "\\D*(\\d+\\s+\\w+).*", replacement = "\\1", string)
# [1] "7 years"
\D* means: no digits as long as possible, the rest is captured in a group and then replaces the complete string.
See a demo on regex101.com.

Finding Abbreviations in Data with R

In my data (which is text), there are abbreviations.
Is there any functions or code that search for abbreviations in text? For example, detecting 3-4-5 capital letter abbreviations and letting me count how often they happen.
Much appreciated!
detecting 3-4-5 capital letter abbreviations
You may use
\b[A-Z]{3,5}\b
See the regex demo
Details:
\b - a word boundary
[A-Z]{3,5} - 3, 4 or 5 capital letters (use [[:upper:]] to match letters other than ASCII, too)
\b - a word boundary.
R demo online (leveraging the regex occurrence count code from #TheComeOnMan)
abbrev_regex <- "\\b[A-Z]{3,5}\\b";
x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
sum(gregexpr(abbrev_regex,x)[[1]] > 0)
## => [1] 3
regmatches(x, gregexpr(abbrev_regex, x))[[1]]
## => [1] "XYZ" "WXYZ" "VWXYZ"
You can use the regular expression [A-Z] to match any ocurrence of acapital letter. If you want this pattern to be repeated 3 times you can add \1{3} to your regex. Consider using variables and a loop to get the job done for 3 to 5 repetition times.

text manipulation in R

I am trying to add parentheses around certain book titles character strings and I want to be able to paste with the paste0 function. I want to take this string:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)", "My Liffe 1993 07e pdfDrama (amazon.com)")
wrap certain strings in parentheses:
a
[1] “I Like What I Know (1959) (02e) (pdfDrama) (amazon.com)”
[2] ”My Life (1993) (07e) (pdfDrama) (amazon.com)”
I have tried but can't figure out a way to replace them within the string:
paste0("(",str_extract(a, "\\d{4}"),")")
paste0("(",str_extract(a, ”[0-9]+.e”),”)”)
Help?
I can suggest a regex for a fixed number of words of specific type:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)","My Life 1993 07e pdfDrama (amazon.com)")
sub("\\b(\\d{4})(\\s+)(\\d+e)(\\s+)([a-zA-Z]+)(\\s+\\([^()]*\\))", "(\\1)\\2(\\3)\\4(\\5)\\6", a)
See the R demo
And here is the regex demo. In short,
\\b(\\d{4}) - captures 4 digits as a whole word into Group 1
(\\s+) - Group 2: one or more whitespaces
(\\d+e) - Group 3: one or more digits and e
(\\s+) - Group 4: ibid
([a-zA-Z]+) - Group 5: one or more letters
(\\s+\\([^()]*\\)) - Group 6: one or more whitespaces, (, zero or more chars other than ( and ), ).
The contents of the groups are inserted back into the result with the help of backreferences.
If there are more words, and you need to wrap words starting with a letter/digit/underscore after a 4-digit word in the string, use
gsub("(?:(?=\\b\\d{4}\\b)|\\G(?!\\A))\\s*\\K\\b(\\S+)", "(\\1)", a, perl=TRUE)
See the R demo and a regex demo
Details:
(?:(?=\\b\\d{4}\\b)|\\G(?!\\A)) - either the location before a 4-digit whole word (see the positive lookahead (?=\\b\\d{4}\\b)) or the end of the previous successful match
\\s* - 0+ whitespaces
\\K - omitting the text matched so far
\\b(\\S+) - Group 1 capturing 1 or more non-whitespace symbols that are preceded with a word boundary.

Resources