grepl() in R using complex pattern with multiple AND, OR - r

Is that possible to use a pattern like this (see below) in grepl()?
(poverty OR poor) AND (eradicat OR end OR reduc OR alleviat) AND extreme
The goal is to determine if a sentence meets the pattern using
ifelse(grepl(pattern, x, ignore.case = TRUE),"Yes","No")
For example, if x = "end extreme poverty in the country", it will return "Yes", while if x = "end poverty in the country", it will return "No".
An earlier post here works only for single work like poor AND eradicat AND extreme, but not work for my case. Any way to achieve my goal?
Tried this, pattern = "(?=.*poverty|poor)(?=.*eradicat|end|reduce|alleviate)(?=.*extreme)", but it does not work. The error is 'Invalid regexp'

For using all 3 assertions, you can group the words using a non capture group.
^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+
^ Start of string
(?=.*(?:poverty|poor)) Assert either poverty OR poor
(?=.*extreme) Assert extreme
(?=.*(?:eradicat|end|reduc|alleviat)) Assert either eradicat OR end OR reduc or alleviat
.+ Match the whole line for example
Regex demo
For grepl, you have to use perl=T enabling PCRE for the lookarounds.
grepl('^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+', v, perl=T)

Related

Regex includes Lookahead strings in selection

I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.
Here is the link to the sample excel file with 2 of those echo reports.
The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.
I wrote the following pattern:
pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
ignore_case = FALSE)
Now, let's look at the results (remember I want the "Mild" part not the "LV" part):
str_view_all(df$echo, pattern)
As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv inside a positive lookahead (?= ( lv)?) construct.
Anyone knows what am I doing wrong?
The problem is that \w+ matches any one or more word chars, and the lookahead does not consume the chars it matches (the regex index remains where it was).
So, the LV gets matched with \w+ as there is diastolic dysfunction right after it, and ( lv)? is an optional group (there may be no space+lv right before diastolic dysfunction) for the \w+ to match).
If you do not want to match LV, add a negative lookahead to restrict what \w+ matches:
\b(?!lv\b)\w+\b(?=(?:\s+lv)?\s+d(?:[iy]a|i)stolic d[yi]sfunction)
See the regex demo
Also, note that [iy] is a better way to write (i|y).
In R, you may define it as
pattern <- regex(
"\\b(?!lv\\b)\\w+\\b(?=(?:\\s+lv)?\\s+d(?:[iy]a|i)stolic\\s+d[yi]sfunction)",
ignore_case = FALSE
)
Using \w+ can also match LV and the lv part is optional.
Instead of a lookahead, you can also use a capture group.
\b(?!lv)(\w+)\b (?:lv )?(?:d[iy]astolic|distolic) d[iy]sfunction
regex demo

Problem with understanding Functional Programming concept

I'm taking a course in functional programming and coming from OOP my brain hurts trying to solve something that I think is quite trivial but I'm just not understanding the concept here. This is an exercise I need to do for school
Given a phrase, in my case it is
"Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment"
I need to check starting letters of every word for a matching pattern and apply specific modification depending on pattern.
Im not even entirely sure what my current code is producing, when I inspect x and y I sort of understand what the loops are doing, they have a list for every word of phrase and that inner list consists of checks against every single letter, nil if it doesnt start with that letter and modified word if it does.
OOP in me wants those loops not to return any "nil" and only return a single edited word in every iteration. In functional programming I cant break loops and force returns so I need to think about this in another way.
My question is how should I approach this problem from functional programming perspective?
In the beginning I get a list of words which form a phrase, then I wish to edit every word and in the end get a list again containing these edited words.
Maybe a pseudo-code-like structure on how to tackle this would help me understand underlying concepts.
Here is my current code:
#Task two
def taskTwo() do
IO.puts "Task Two"
IO.puts "Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment"
IO.puts "Task Two\n...Editing Words ..."
phrase = String.downcase("Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment") |> String.split()
x = for word <- phrase do
checkVowels(word)
end
y = for word <- phrase do
checkConsonants(word)
end
IO.inspect x
IO.inspect y
end
#check vowels
def checkVowels(word) do
vowels = ["a","e","i","o","u"]
for vowel <- vowels do
if String.starts_with?(word, vowel) do
word <> "ay "
end
end
end
#check consonants
def checkConsonants(word) do
consonants = ["b","c","d","f","g","h","j","k","l","m","n","p","q","r","s","t","v","w","x","z","y"]
for consonant <- consonants do
if String.starts_with?(word, consonant) do
edited = String.replace_prefix(word, consonant, "")
edited <> consonant <> "ay "
end
end
end
MOdifications I need to apply: First check for starting letter and apply modification, then check again an see if inside word there are any of the multiletter combinations
Words beginning with consonants should have the consonant moved to the end of the word, followed by "ay".
Words beginning with vowels (aeiou) should have "ay" added to the end of the word.
Some groups of letters are treated like consonants, including "ch", "qu", "squ", "th", "thr", and "sch".
Some groups are treated like vowels, including "yt" and "xr".
One possible way (maybe not the most efficient, but using some of the Elixir powers) is to use recursion and pattern matching.
So one possible way to solve the first part of your exercise.
defmodule Example do
#vowels ~w[a e i o u y]
#phrase "Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment"
def task_two() do
phrase =
#phrase
|> String.downcase()
|> String.split()
|> Enum.reverse()
# splitting the first letter of each word
|> Enum.map(&String.split_at(&1, 1))
# pass the phrase and an accumulator as arguments
result = check_words(phrase, [])
IO.inspect(result)
end
# using recursion to traverse the list
# when the phrase have no more words, it will match the empty list
def check_words([], phrase) do
Enum.join(phrase, " ")
end
# pattern match on the first letter and using guards to decide if vowel
def check_words([{first_letter, rest_of_word} | rest_of_list], accumulator)
when first_letter in #vowels do
# call the function again, with the rest of the list
new_word = first_letter <> rest_of_word <> "ay"
check_words(rest_of_list, [new_word | accumulator])
end
# when the pattern does not match for vowels, it should be a consonant
def check_words([{first_letter, rest_of_word} | rest_of_list], accumulator) do
new_word = rest_of_word <> first_letter <> "ay"
check_words(rest_of_list, [new_word | accumulator])
end
end
You will need to skip/handle the commas and possible some more tweaks to be fully functional. But hope it helps as a general idea.
Notes:
write tests. They will really help you understand what the code is doing
Elixir is an amazing language. If you come from OOP may look strange in the beginning. This book: Programming Elixir is really good and will help you a lot if you want to progress quickly.

Is my R Regular Expression matching correctly?

I've struggled with regular expressions in general and recently wrote one that I think is working correctly, but I'm not sure. My question to anyone who takes the time to review my code below - is it theoretically doing what I want it to do?
Purpose: I'm looking through every column in my data set to identify rows that include strings that begin with 'pharmacy - ' followed by any one of 13 drug types and ends with parentheses with a number inside. Here are some examples:
pharmacy - oxycodone/acetaminophen (3)
pharmacy - fentanyl (2.83)
pharmacy - hydromorphone (6.8)
The code I wrote is below. I believe it is working but would appreciate if any regex experts out there could take a look and confirm that it is doing what I think it's supposed to be doing:
viz$med_2 <- apply(viz, 1, function(x)as.integer(any(grep("^pharmacy+[ -]+(codeine|oxycodone|fentanyl|hydrocodone|hydromophone|mathadone|morphine sulfate|oxycodone|oxycontin|roxicodone|tramadol|hydrocodone/acetaminophen|oxycodone/acetaminophen)+[ -]+[(]+[0-9]+", x))))
No expert, but your expression looks great, I would maybe just slightly modify that to:
^pharmacy\s*-\s*(codeine|oxycodone|fentanyl|hydrocodone|hydromophone|mathadone|morphine sulfate|oxycodone|oxycontin|roxicodone|tramadol|hydrocodone\/acetaminophen|oxycodone\/acetaminophen)\s*\(\s*[0-9]+(\.[0-9]+)?\s*\)$
In this demo, the expression is explained, if you might be interested.
Make sure about required escaping for R.
RegEx Circuit
jex.im visualizes regular expressions:
You need to escape special characters (with a double backslash \\ in R) or the regex will throw an error.
In regex, + means match a character one or more times. So pharmacy+ matches pharmac followed by one or an infinite number of y, which is probably unnecessary.
I'd recommend using \\s instead of a simple whitespace. \\s matches any whitespace character [ \t\r\n\f] and is therefore more versatile.
Here's how I would do it.
viz <- data.frame(
med_2 = c(
"pharmacy - oxycodone/acetaminophen (3)",
"pharmacy - fentanyl (2.83)",
"pharmacy - hydromorphone (6.8)"
)
)
# list of the different drug names
drugs_ls <- c(
"codeine",
"oxycodone",
"fentanyl",
"hydrocodone",
"hydromophone",
"mathadone",
"morphine sulfate",
"oxycontin",
"roxicodone",
"tramadol",
"acetaminophen"
)
# concatenate and separate drug names with a pipe
drugs_re <- paste0(drugs_ls, collapse = "|")
# generate the regex
med_re <- paste0("^(?i)pharmacy[\\s-]+(?:", drugs_re, ")(?:\\/acetaminophen)?[\\s-]+\\(\\d")
viz$med_2 <- apply(viz, 1, function(x)as.integer(any(grep(med_re, x, perl = TRUE))))
viz
# med_2
#1 1
#2 1
#3 0
The whole regex looks like this:
^(?i)pharmacy[\\s-]+(?:codeine|oxycodone|fentanyl|hydrocodone|hydromophone|mathadone|morphine sulfate|oxycontin|roxicodone|tramadol|acetaminophen)(?:\\/acetaminophen)?[\\s-]+\\(\\d
(?i) makes the regex case insensitive.
(?:) creates a non-capturing group.
? matches a character / group or nothing.
\\d is a shorthand for [0-9].

R Regex to identify and replace characters between multiple dots

I have the following codes
"ABC.A.SVN.10.10.390.10.UDGGL"
"XYZ.Z.SVN.11.12.111.99.ASDDL"
and I need to replace the characters that exist between the 2nd and the 3rd dot. In this case it is SVN but it may well be any combination of between A and ZZZ, so really the only way to make this work is by using the dots.
The required outcome would be:
"ABC.A..10.10.390.10.UDGGL"
"XYZ.Z..11.12.111.99.ASDDL"
I tried variants of grep("^.+(\\.\\).$", "ABC.A.SVN.10.10.390.10.UDGGL") but I get an error.
Some examples of what I have tried with no success :
Link 1
Link 2
EDIT
I tried #Onyambu 's first method and I ran into a variant which I had not accounted for: "ABC.A.AB11.1.12.112.1123.UDGGL". In the replacement part, I also have numeric values. The desired outcome is "ABC.A..1.12.112.1123.UDGGL" and I get it using sub("\\.\\w+.\\B.",".",x) per the second part of his answer!
See code in use here
x <- c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sub("^(?:[^.]*\\.){2}\\K[^.]*", "", x, perl=T)
^ Assert position at the start of the line
(?:[^.]*\.){2} Match the following exactly twice
[^.]*\. Match any character except . any number of times, followed by .
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^.]* Match any character except . any number of times
Results in [1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
x= "ABC.A.SVN.10.10.390.10.UDGGL" "XYZ.Z.SVN.11.12.111.99.ASDDL"
sub("([A-Z]+)(\\.\\d+)","\\2",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
([A-Z]+) Capture any word that has the characters A-Z
(\\.\\d+) The captured word above, must be followed with a dot ie\\..This dot is then followed by numbers ie \\d+. This completes the capture.
so far the captured part of the string "ABC.A.SVN.10.10.390.10.UDGGL" is SVN.10 since this is the part that matches the regular expression. But this part was captured as SVN and .10. we do a backreference ie replace the whole SVN.10 with the 2nd part .10
Another logic that will work:
sub("\\.\\w+.\\B.",".",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
Not exactly regex but here is one more approach
#DATA
S = c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sapply(X = S,
FUN = function(str){
ind = unlist(gregexpr("\\.", str))[2:3]
paste(c(substring(str, 1, ind[1]),
"SUBSTITUTION",
substring(str, ind[2], )), collapse = "")
},
USE.NAMES = FALSE)
#[1] "ABC.A.SUBSTITUTION.10.10.390.10.UDGGL" "XYZ.Z.SUBSTITUTION.11.12.111.99.ASDDL"

How to get the front part of some words in R?

I am trying to get the front part of some words in a sentence.
For example the sentence is -
x <- c("Ace Bayou Reannounces Recall of Bean Bag Chairs Due to Low Rate of Consumer Response; Two Child Deaths Previously Reported; Consumers Urged to Install Repair", "Panasonic Recalls Metal Cutter Saws Due to Laceration Hazard")
Now I want get the front part of Recall or Recalls. I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
what I need is only
"Ace Bayou Reannounces"
"Panasonic"
Any help would be appreciated.
If you want to use regular expressions.
gsub(pattern = "(.*(?=Recall(s*)))(.*)", replacement = "\\1", x = x, perl = T)

Resources