Regular Expression Question - Two Negative Look behinds in the same expression - r

I have the following problem for which I have been working on for a few hours.
I am trying to build the following RegEx :
I want to be able to extract the word reduced from sentences but not if the word is preceded by a negative expression.
For example
Sentences |Output
1. lv function is reduced reduced
2. lv function is not reduced -
3. reduced lv function reduced
4. no evidence of reduced lv function -
Right now, I have been able to a have a function RegEx for in the cases 3 and 4 where the adjective precedes the noun of interest using a negative look behind.
However, for the cases 1 and 2, the negative look behind does not work.
Here are sentences and the current RegEx to test :
((?<!((no|not|none)(?:\D*?)))(reduced|depressed|normal)(?:\D*?))?(?:lv function|lv|systolic function|left ventricular ejection fraction)(((?:.*\bnot\b)(\D*))(reduced|depressed|normal))?
Sentences :
lv function is reduced
lv function is not reduced
reduced lv function
no evidence of reduced lv function
Alternatively here is a link : regexr.com/4tc61
Also, I am ultimately going to be working in R.
Thank you all.

The regex solution will be very complex, and you may use it only if you understand it well. I will try to explain it as well as I can.
Q: How do I match something that is not preceded with a string of unknown length if my lookbehinds do not support such patterns?
A: Match what you do not need, skip the matched texts, and go on matching from the position where the match failed.
You may do it with PCRE regex that supports (*SKIP)(*FAIL) (or shorter (*SKIP)(*F)) construct.
Now, look at the pattern:
(?:\b(?:no|not|none)\b\D*?\b(?:reduced|depressed|normal)\b\D*?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b|\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b\D*?\bnot\b\D*?\b(?:reduced|depressed|normal)\b)(*SKIP)(*F)|(?:\b(reduced|depressed|normal)\b\D*?)?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b(?:\D*?\b(reduced|depressed|normal)\b)?
Looks unwieldly, but let's go through the constituents:
(?: - start of a non-capturing group that serves as a container, the (*SKIP)(*F) will be applied to all alternatives inside it):
\b(?:no|not|none)\b\D*?\b(?:reduced|depressed|normal)\b\D*?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b:
\b(?:no|not|none)\b - any of the words inside the non-capturing group as whole words
\D*? - 0+ non-digit chars
\b(?:reduced|depressed|normal)\b - any of the words inside the non-capturing group as whole words
\D*? - 0+ non-digit chars
\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b - any of the texts inside the non-capturing group as whole words
| - or
\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b\D*?\bnot\b\D*?\b(reduced|depressed|normal)\b:
\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b - any of the texts inside the non-capturing group as whole words
\D*?\bnot\b\D*? - 0+ non-digits as few as possible, whole word not, 0+ non-digits as few as possible
\b(?:reduced|depressed|normal)\b - any of the texts inside the non-capturing group as whole words
)(*SKIP)(*F) - end of the container group, and the PCRE verbs that fail the match, making the regex engine go on to search for matches starting at the position where the match failed
| - or (that is, now, really match what we need with the next alternative)
(?:\b(reduced|depressed|normal)\b\D*?)?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b(?:\D*?\b(reduced|depressed|normal)\b)?:
(?:\b(reduced|depressed|normal)\b\D*?)? - an optional non-capturing group matching a reduced, depressed or normal captured into Group 1 (we need to extract the word matched with this group!) as whole words and then any 0+ non-digit chars as few as possible
\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b - any of the texts inside the non-capturing group as whole words
(?:\D*?\b(reduced|depressed|normal)\b)? - an optional non-capturing group matching any 0+ non-digit chars as few as possible and then captures into Group 2 a reduced, depressed or normal captured into Group 1 as a whole word.
There are so many repetitive parts, so it makes sense to use variables in the pattern:
x <- c("lv function is reduced", "lv function is not reduced", "reduced lv function", "no evidence of reduced lv function")
cap <- "reduced|depressed|normal"
negate_prefix <- paste0("(?:\\b(?:no|not|none)\\b\\D*?\\b(?:",cap,")\\b\\D*?")
match <- "\\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\\b"
regex <- paste0(negate_prefix,
match, "|", match, "\\D*?\\bnot\\b\\D*?\\b(?:",cap,")\\b)(*SKIP)(*F)|(?:\\b(",cap,")\\b\\D*?)?",match,"(?:\\D*?\\b(",cap,")\\b)?")
So, all we need is the captured substrings. See the R demo online:
results <- regmatches(x, regexec(regex, x, perl=TRUE))
unlist(lapply(results, function(x) paste(x[-1], collapse="")))
## => [1] "reduced" "" "reduced" ""

if the example is as complicated as it gets, then something like this might work?
library(tidyverse)
library(stringr)
df <- data.frame(sent = c("lv function is reduced",
"lv function is not reduced",
"reduced lv function",
"no evidence of reduced lv function"))
df <- df %>%
mutate(out_test = +str_detect(sent, "not|no"),
output = ifelse(out_test == 0, "reduced", NA)) %>%
select(-out_test)
df
sent output
1 lv function is reduced reduced
2 lv function is not reduced <NA>
3 reduced lv function reduced
4 no evidence of reduced lv function <NA>

Related

How to match binomial expressions in R?

I want to match binomials, that is, bisyllabic words, sometimes hyphenated, with slightly varied syllable reduplication; the variation always concerns the first (and, possibly, second) letter in the reduplicated syllable:
x <- c("pow-wow", "pickwick", "easy-peasy", "nitty-gritty", "bzzzzzzz", "mmmmmm", "shish", "wedged", "yaaaaaa")
Here, we have said syllable reduplication in pow-wow, pickwick, easy-peasy, and nitty-gritty (which are then the expected output) but not in bzzzzzzz, mmmmmm, shish, wedged and yaaaaa.
This regex does at least manage to get rid of wedged(which is pronounced as one syllable) as well as monosyllabic words by requiring the presence of a vowel in the capturing group:
grep("\\b\\w?((?!ed)(?=[aeiou])\\w{2,})-?\\w\\w?\\1\\b$", x, value = T, perl = T)
[1] "pow-wow" "pickwick" "easy-peasy" "nitty-gritty" "yaaaaa"
However, yaaaaa is getting matched too. To not match it my feeling is that the capturing group should be disallowed to contain two identical vowels in immediate succession but I don't know how to implement that restriction.
Any ideas?
It looks as though you want to match words that cannot contain ed after the initial chars and 2 or more repeated chars if the same chunk is not found farther in the string. Also, the allowed "difference" window at the start and middle is 0 to 2 characters.
You may use
\b\w{0,2}(?!((.)\2+)(?!.*\1)|ed)([aeiou]\w+)-?\w\w?\3\b
See the regex demo
Details
\b - a word boundary (you may use ^ if your "words" are equal to whole string)
\w{0,2} - two or more word chars (replace with \p{L} to only match letters)
(?!((.)\2+)(?!.*\1)|ed) - no ed or two or more identical chars that do not repeat later in the string are allowed immediately to the right of the current location
([aeiou]\w+) - a vowel (captured in Group 3) and 1+ word chars (replace with \p{L} to only match letters)
-? - an optional hyphen
\w\w? - 1 or 2 word charsd
\3 - same value as captured in Group 3
\b - a word boundary (you may use $ if your "words" are equal to whole string)

Replace matched string with its sub-groups

I have some DNA sequences to process, they look like:
>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA
>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
I coded a regex in order to detect the title line of each sequence:
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
What function should I use to replace this whole match with its group3 + group4? In addition, I've got 72 matches in one txt file, how can I replace them in one run?
Your current regex won't work for lines where Group 3 or 4 contains a single letter word because [a-zA-Z]+\\-?[a-zA-Z]+ matches 1+ letters, then an optional hyphen, and then again 1+ letters (that means, there must be at least 2 letters). With [a-zA-Z]+(?:-[a-zA-Z]+)?, you can match 1+ letters followed with an optional sequence of - and then 1+ letters.
Also, \s also matches line breaks, and if your title lines are shorter than you assume then .* may grab a sequence line by mistake. You may use \h or [ \t] instead.
Note that \n is not necessary at the end of the pattern because .* matches any 0+ chars other than line break chars with an ICU regex library (it is used in your current code, str_replace_all).
In general, you should only capture with (...) what you need to keep, everything else can be just matched. Remove extra capturing parentheses, and it will save some performance.
If you add (?m)^ at the start, you will make sure you only match > that is at the start of a line.
You may use
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*"
See the regex demo.
Code:
Sequence <- str_replace_all(SequenceRaw,
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*",
"\\1 \\2")
I figured it out myself, with tidyverse packages:
library(tidyverse)
SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta
Sequence <- str_replace_all(SequenceRaw,
"(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n",
">\\3 \\4\n") ## Keep '>' and add a new line with '\n'
write_file(Sequence, "YOUR PATH\\sequence.fasta")
Here is the result:

Match word containing both lower and upper case letters

Matching acronyms containing both lower and upper case letters (atleast one of more lower and capital case like reKHS) or capital case acronyms of length 3 or more (CASE, CAT) in R. Regex should match both reKHS and CASE. This regex takes care of the latter case (matching acronyms of length 3 or more) regex <- "\\b^[a-zA-Z]*${3,10}\\b";. Would need to find a way to combine this with the regex containing both lower and upper case.
A positive look-ahead or two should solve this
(.*(?=.*[a-z])(?=.*[A-Z]).*)|([A-Z]{3,})
To explain:
Either contain a lower and upper case character somewhere
(.*(?=.*[a-z])(?=.*[A-Z]).*)
or
|
have at least 3 upper case characters
([A-Z]{3,})
You may use a TRE compliant pattern like
regex <- "\\b(?:[[:upper:]]{3,10}|(?:[[:lower:]]+[[:upper:]]|[[:upper:]][[:lower:]]*[[:upper:]])[[:alpha:]]*)\\b"
Or a PCRE regex (use with perl=TRUE in base R functions):
regex <- "\\b(?:\\p{Lu}{3,10}|(?:\\p{Ll}+\\p{Lu}|\\p{Lu}\\p{Ll}*\\p{Lu})\\p{L}*)\\b"
See the regex demo (and the PCRE regex demo).
Details
\\b - a word boundary
(?: - either
[[:upper:]]{3,10} - 3 to 10 uppercase letters
| - or
(?: - either
[[:lower:]]+[[:upper:]] - 1 or more lowecase and 1 uppercase
| - or
[[:upper:]][[:lower:]]*[[:upper:]] - an uppercase, then 0+ lowercase and then an uppercase letter
) - end of the grouping
[[:alpha:]]* - 0+ letters
) - end of the alternation group
\\b - a word boundary.

Identifying all numbers above n with grep

I have a character vector that contains X.1 - X.13 (and in reality, also a lot of other stuff, including both other numbered variables and variables featuring X). I want to locate X.3 - X.13 and to that effect have used grep with the following expression:
x <- paste0("X.", 1:13)
grep("^X\\.[3-9]{1}|^X\\.[0-9]{2}", x)
My question: is there a better, shorter expression that could be used here? I get that this is probably fairly trivial, but I just want to understand regex better.
Your pattern contains two alternatives, ^X\\.[3-9]{1} matches X.3 to X.9 and ^X\\.[0-9]{2} matches X.00 to X.99. That is not what you are looking for.
To locate just X.3 to X.13, use
grep("^X\\.(?:[3-9]|1[0-3])\\b", x)
Or, to match in any right-hand side context (no word boundary on the right):
grep("^X\\.(?:1[0-3]|[3-9])", x)
See the regex demo.
Or, if there can be a letter or _ after the number, replace \\b with (?!\\d) and be sure to pass perl=TRUE to the grep function as the lookahead constructs are not supported by the default TRE regex engine:
grep("^X\\.(?:[3-9]|1[0-3])(?!\\d)", x, perl=TRUE)
See this regex demo.
Another point: the ^ caret is used to denote the start of string. If you mean to match the substring anywhere inside the string, remove it or replace with \\b to match X that is not preceded with _, letters or digits (see yet another regex demo).
Details
^ - start of string
X\\. - a X. substring
(?: - start of a group:
1[0-3] - 1 followed with a digit from 0 to 3
| - or
[3-9] - 3 to 9
) - end of the non-capturing group
\\b - a word boundary

text manipulation in R

I am trying to add parentheses around certain book titles character strings and I want to be able to paste with the paste0 function. I want to take this string:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)", "My Liffe 1993 07e pdfDrama (amazon.com)")
wrap certain strings in parentheses:
a
[1] “I Like What I Know (1959) (02e) (pdfDrama) (amazon.com)”
[2] ”My Life (1993) (07e) (pdfDrama) (amazon.com)”
I have tried but can't figure out a way to replace them within the string:
paste0("(",str_extract(a, "\\d{4}"),")")
paste0("(",str_extract(a, ”[0-9]+.e”),”)”)
Help?
I can suggest a regex for a fixed number of words of specific type:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)","My Life 1993 07e pdfDrama (amazon.com)")
sub("\\b(\\d{4})(\\s+)(\\d+e)(\\s+)([a-zA-Z]+)(\\s+\\([^()]*\\))", "(\\1)\\2(\\3)\\4(\\5)\\6", a)
See the R demo
And here is the regex demo. In short,
\\b(\\d{4}) - captures 4 digits as a whole word into Group 1
(\\s+) - Group 2: one or more whitespaces
(\\d+e) - Group 3: one or more digits and e
(\\s+) - Group 4: ibid
([a-zA-Z]+) - Group 5: one or more letters
(\\s+\\([^()]*\\)) - Group 6: one or more whitespaces, (, zero or more chars other than ( and ), ).
The contents of the groups are inserted back into the result with the help of backreferences.
If there are more words, and you need to wrap words starting with a letter/digit/underscore after a 4-digit word in the string, use
gsub("(?:(?=\\b\\d{4}\\b)|\\G(?!\\A))\\s*\\K\\b(\\S+)", "(\\1)", a, perl=TRUE)
See the R demo and a regex demo
Details:
(?:(?=\\b\\d{4}\\b)|\\G(?!\\A)) - either the location before a 4-digit whole word (see the positive lookahead (?=\\b\\d{4}\\b)) or the end of the previous successful match
\\s* - 0+ whitespaces
\\K - omitting the text matched so far
\\b(\\S+) - Group 1 capturing 1 or more non-whitespace symbols that are preceded with a word boundary.

Resources