Regex includes Lookahead strings in selection - r

I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.
Here is the link to the sample excel file with 2 of those echo reports.
The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.
I wrote the following pattern:
pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
ignore_case = FALSE)
Now, let's look at the results (remember I want the "Mild" part not the "LV" part):
str_view_all(df$echo, pattern)
As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv inside a positive lookahead (?= ( lv)?) construct.
Anyone knows what am I doing wrong?

The problem is that \w+ matches any one or more word chars, and the lookahead does not consume the chars it matches (the regex index remains where it was).
So, the LV gets matched with \w+ as there is diastolic dysfunction right after it, and ( lv)? is an optional group (there may be no space+lv right before diastolic dysfunction) for the \w+ to match).
If you do not want to match LV, add a negative lookahead to restrict what \w+ matches:
\b(?!lv\b)\w+\b(?=(?:\s+lv)?\s+d(?:[iy]a|i)stolic d[yi]sfunction)
See the regex demo
Also, note that [iy] is a better way to write (i|y).
In R, you may define it as
pattern <- regex(
"\\b(?!lv\\b)\\w+\\b(?=(?:\\s+lv)?\\s+d(?:[iy]a|i)stolic\\s+d[yi]sfunction)",
ignore_case = FALSE
)

Using \w+ can also match LV and the lv part is optional.
Instead of a lookahead, you can also use a capture group.
\b(?!lv)(\w+)\b (?:lv )?(?:d[iy]astolic|distolic) d[iy]sfunction
regex demo

Related

Can quantifiers be used in regex replacement in R?

My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.
Is this possible?
I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc.
You can use perl = TRUE or perl = FALSE and any other options with convenience.
I assume the answer can be negative, since seems to be quite limited options (?gsub):
a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.
Main quantifiers are (?base::regex):
?
The preceding item is optional and will be matched at most once.
*
The preceding item will be matched zero or more times.
+
The preceding item will be matched one or more times.
{n}
The preceding item is matched exactly n times.
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.
EDIT UPDATE:
Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with
gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"
How should I replace the exact number of characters in \\S* by dots or any other symbol?
Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.
What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:
a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)
See the R demo and the regex demo.
Details
(?:\G(?!^)|(?<!\S)\S{5}) - the end of the previous successful match or five non-whitespace chars not preceded with a non-whitespace char
\K - a match reset operator discarding text matched so far
\S - any non-whitespace char.
gsubfn is like gsub except the replacement string can be a function which inputs the match and outputs the replacement. The function can optionally be expressed a formula as we do here replacing each string of word characters with the output of the function replacing that string. No complex regular expressions are needed.
library(gsubfn)
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."
or almost the same except function is slightly different:
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

Extracting word with co-occurring alphabets in R

I wanted to extract certain words from a bigger word-list. One example of a desired extracted word-list is: extract all the words that contain /s/ followed by /r/. So this should give me words such as sər'ka:rəh, e:k'sa:r, səmʋitərəɳ, and so:'ha:rd. from the bigger word-list.
Consider the data (IPA transcription) to be the one given below:
sər'ka:rəh
sə'lᴔ:nija:
hã:ki:
pu:'dʒa:ẽ:
e:k'sa:r
mritko:
dʒʱã:sa:
pə'hũtʃ'ne:'ʋa:le:
kərəpʈ
tʃinhirit
tʃʰəʈʈʰi:
dʱũdʱ'la:pən
səmʋitərəɳ
so:'ha:rd
məl'ʈi:spe:'ʃijliʈi:
la:'pər'ʋa:i:
upləbɡʱ
Thanks much!
Here's an answer to the issue described in the first paragraph of your post. (To my mind, the examples in the second paragraph are inconsistent with the issue described in the first para, so I'll take the liberty of ignoring them here).
You say you want to "extract all the words that contain p followed by t". The word 'extract' implies that there are other characters in the same string than those you want to match and extract. The verb 'contain' implies that the words you want to extract need not necessarily have p in word-initial position. Based on these premises, here's some mock data and a solution to the task:
Data:
x <- c("pastry is to the pastor's appetite what pot is to the pupil's")
Solution:
libary(stringr)
unlist(str_extract_all(x, "\\b\\w*(?<=p)\\w*t\\w*\\b"))
This uses word boundaries \\b to extract the target words from the surrounding context; it further uses positive lookbehind (?<=...) to assert the condition that for there to be a matching t there needs to be a p character occurring prior to the match.
The regex in more detail:
\\b: the opening word boundary
\\w*: zero or more alphanumeric chars (or an underscore)
(?<=p): positive lookbehind: "if and only if you see a p char on
the left..."
\\w*: zero or more alphanumeric chars (or an underscore)
t: the literal character t
\\w*: zero or more alphanumeric chars (or an underscore)
\\b: the closing word boundary
Result:
[1] "pastry" "pastor" "appetite" "pot"
EDIT 1:
Now that the question has been updated, a more definitive answer is possible.
Data:
x <- c("sər'ka:rəh","sə'lᴔ:nija:","hã:ki:","pu:'dʒa:ẽ:","e:k'sa:r",
"mritko:","dʒʱã:sa:","pə'hũtʃ'ne:'ʋa:le:","kərəpʈ","tʃinhirit",
"tʃʰəʈʈʰi:","dʱũdʱ'la:pən","səmʋitərəɳ","so:'ha:rd",
"məl'ʈi:spe:'ʃijliʈi:", "la:'pər'ʋa:i:","upləbɡʱ")
If you want to match (rather than extract) words that "contain /s/ followed by /r/", you can use grepin various ways. Here are two ways:
grep("s.*r", x, value = T)
or:
grep("(?<=s).*r", x, value = T, perl = T) # with lookbehind
The result is the same in either case:
[1] "sər'ka:rəh" "e:k'sa:r" "səmʋitərəɳ" "so:'ha:rd"
EDIT 2:
If the aim is to match words that "contain /s/ or /p/ followed by /r/ or /t/", you can use the metacharacter | to allow for alternatives:
grep("s.*r|s.*t|p.*r|p.*t", x, value = T)
# or, more succinctly:
grep("(s|p).*(r|t)", x, value = T)
[1] "sər'ka:rəh" "e:k'sa:r" "pə'hũtʃ'ne:'ʋa:le:" "səmʋitərəɳ" "so:'ha:rd"
[6] "la:'pər'ʋa:i:"
You can use grep function. Assuming your list is called list:
grep("p[a-z]+t", list, value=TRUE)

Insert character string between period and digit in R

I have a vector of character strings like so:
test <- c("A1.7","A1.8")
and I want to used regular expressions to insert A1c<= between the period and digit like so:
A1.A1c<=7 A1.A1c<=8
I looked through questions and found #zx8754 similar question; I tried to modify the answer posted in their question but had no luck
insert <- 'A1c<='
n <- 4
old <- test
lhs <- paste0('([[:alpha:]][[:digit:]][[:punct:]]{', n-1, '})([[:digit:]]+)$')
rhs <- paste0('\\1', insert, '\\2')
gsub(lhs, rhs, test)
Can anyone direct me as to how to correctly execute this?
Another pattern:
gsub("\\.(\\d+)", "\\.A1c<=\\1", test)
## [1] "A1.A1c<=7" "A1.A1c<=8"
Regex Demo
You may use
insert <- 'A1c<='
test <- c("A1.7","A1.8")
sub("(?<=\\.)(?=\\d)", insert, test, perl=TRUE)
## => A1.A1c<=7 A1.A1c<=8
See the online R demo
Details
(?<=\\.) - a positive lookbehind that matches a location that is immediately preceded with a dot
(?=\\d) - a positive lookahead that matches a location that is immediately followed with a digit.
The sub function will replace the first occurrence only, and perl=TRUE makes it possible to use the lookaround constructs in the pattern (as it is now parsed with the PCRE regex engine).

How to extract a string between a symbol and a space?

I am trying to extract usernames tagged in a text-chat, such as "#Jack #Marie Hi there!"
I am trying to do it on the combination of # and whitespace but I cannot get the regex to match non-greedy (or at least this is what I think is wrong):
library(stringr)
str_extract(string = '#This is what I want to extract', pattern = "(?<=#)(.*)(?=\\s+)")
[1] "This is what I want to"
What I would like to extract instead is only This.
You could make your regex non greedy:
(?<=#)(.*?)(?=\s+)
Or if you want to capture only "This" after the # sign, you could try it like this using only a positive lookbehind:
(?<=#)\w+
Explanation
A positive lookbehind (?<=
That asserts that what is behind is an #
Close positive lookbehind )
Match one or more word characters \w+
The central part of your regex ((.*)) is a sequence of any chars.
Instead you shoud look for a sequence of chars other than white space
(\S+) or word chars (\w+).
Note also that I changed * to +, as you are probably not interested
in any empty sequence of chars.
To capture also a name which has "last" position in the source
string, the last part of your regex should match not only a sequence
of whitespace chars, but also the end of the string, so change
(?=\\s+) to (?=\\s+|$).
And the last remark: Actually you don't need the parentheses around
the "central" part.
So to sum up, the whole regex can be like this:
(?<=#)\w+(?=\s+|$)
(with global oprion).
Here is a non-regex approach or rather a minimal-regex approach since grep takes the detection of # through the regex engine
grep('#', strsplit(x, ' ')[[1]], value = TRUE)
#[1] "#This"
Or to avoid strsplit, we can use scan (taken from this answer), i.e.
grep('#', scan(textConnection(x), " "), value=TRUE)
#Read 7 items
#[1] "#This"

Grep in R to find words with custom "extended" boundaries

I'm looking for a regular expression to grep whole words, including words separated by digits or underscore. \\b considers digits and underscore as parts of words, not as boundaries.
For example, I'd like to catch MOUSE in "DOG MOUSE CAT", in "DOG MOUSE:CAT" but also in "DOG_MOUSE9CAT" and at the end or the beginning of an expression, as in "MOUSE9CAT" and "DOG_MOUSE". Basically, the boundary I'm looking for is any non-uppercase-alpha character plus beginning and end of line/expression (maybe missing some other cases caught by \\b here).
I've tried:
"[[0-9_]\\b]MOUSE[[0-9_]\\b]"
"[[0-9_]|\\b]MOUSE[[0-9_]|\\b]"
"[$|[^A-Z]]MOUSE[^|[^A-Z]]"
"[?<=^|[^A-Z]]MOUSE[?=$|[^A-Z]]"
None of them work.
I'm actually looking for several words (based on a long vector of values), so the final result should look something like
grep(paste("\\b", paste(searchwords, collapse = "\\b|\\b"), "\\b"), targettext)
(with a different delimiter because \\b is too restrictive for me).
(This is a similar question to the one asked by user Nick Sabbe in a comment here: Using grep in R to find strings as whole words (but not strings as part of words))
Use PCRE regex with lookarounds:
grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)
See the regex demo
The (?<![A-Z]) negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z]) will fail the match if the word is followed with an uppercase ASCII letter.
To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...).
See the R online demo:
> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE")
> searchwords <- c("MOUSE","FROG")
> grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE)
[1] 1 2 3 4 5

Resources