How to search for strings with parentheses in R - r

Using R, I have a long list of keywords that I'm searching for in a dataset. One of the keywords needs to have parentheses around it in order to be included.
I've been attempting to replace the parenthesis in the keywords list with \\ then the parentheses, but have not been successful. If there is a way to modify the grepl() function to recognize them, that would also be helpful. Here is an example of what I'm trying to accomplish:
patterns<-c("dog","cat","(fish)")
data<-c("brown dog","black bear","salmon (fish)","red fish")
patterns2<- paste(patterns,collapse="|")
grepl(patterns2,data)
[1] TRUE FALSE TRUE TRUE
I would like salmon (fish) to give TRUE, and red fish to give FALSE.
Thank you!

As noted by #joran in the comments, the pattern should look like so:
patterns<-c("dog","cat","\\(fish\\)")
The \\s will tell R to read the parentheses literally when searching for the pattern.
Easiest way to achieve this if you don't want to make the change manually:
patterns <- gsub("([()])","\\\\\\1", patterns)
Which will result in:
[1] "dog" "cat" "\\(fish\\)"
If you're not very familiar with regular expressions, what happens here is that it looks for any one character within the the square brackets. The round brackets around that tell it to save whatever it finds that matches the contents. Then, the first four slashes in the second argument tell it to replace what it found with two slashes (each two slashes translate into one slash), and the \\1 tells it to add whatever it saved from the first argument - i.e., either ( or ).

Another option is to forget regex and use grepl with fixed = T
rowSums(sapply(patterns, grepl, data, fixed = T)) > 0
# [1] TRUE FALSE TRUE FALSE

Related

I need help figuring out why my regex does not match with what I am looking for

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE
The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

Regex to find words from list, when specific words not appear 3 words before

I want to find all matches of specific words from list, but when specific another words not appears in the range of 3 words before.
For example:
Find all the times that the words "good|best|better" appears in the text, but the words "no|not|none" not appears 3 words before.
I tried something like that:
(?<!\sno|\snot(\s|\s\w\s|\s\w\s\w\s))(\bgood\b|\bbest\b|\bbetter\b)
But it's not working.
You may be able to use this PCRE regex in R with perl=TRUE option:
\b(?:not?|none)(?:\s+\S+){0,2}\s+(good|best|better)\b(*SKIP)(*F)|\b(?:good|best|better)\b
RegEx Demo
In your R code use:
gregexpr("\\b(?:not?|none)(?:\\s+\\S+){0,2}\\s+(good|best|better)\\b(*SKIP)(*F)|\\b(?:good|best|better)\\b", mystr, perl=TRUE)
In PCRE, verbs (*SKIP)(*F) are used to fail and skip a match that we don't want to match.
If we would be only looking to fail no and other derivatives of that, we would be starting with a simple expression such as:
^(?!.*no).*times.*$
Then, we would add word boundary if necessary, and we would expand that to:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b).*times.*$
Demo 1
and finally we would add our desired words using:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b)(?=.*\bgood\b|.*\bbest\b|.*\bbetter\b).*times.*$
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

Using str_detect to select items between curly brackets

I have a column of items like this
{apple}
{orange}>s>
{pine--apple}
{kiwi}
{strawberry}>s>
I would like to filter it so that I only get items that are NOT just a word between brackets (but have other stuff before or after the bracket), so in this example I would like to select these two:
{orange}>s>
{strawberry}>s>
I have tried the following code using dplyr and stringr, but even though on https://regexr.com/ the regular expression works as expected, in R it does not (it just selected rows in which the var column is empty. What am I doing wrong?
d_filtered <- d %>%
filter(!str_detect(var, "\\{(.*?)\\}"))
Your pattern is saying "match anything where there are brackets, with or without stuff between them". Then you negate it with !, so filtering out anything that has a { followed by a } anywhere in the string.
Sounds like what you want to keep strings if there is something before or after the brackets, so let's match that. A . matches any (single) thing, so a pattern for "something before open bracket" is ".\\{". Similarly a pattern for "something after closing bracket" is "\\}.". We can connect them with | for "or". In your filter, use
filter(str_detect(var, ".\\{|\\}."))
This will solve your problem by testing if all character within the vector is within [a-zA-Z], { or }:
cl=c("{apple}",
"{orange}>s>",
"{pine--apple}",
"{kiwi}",
"{strawberry}>s>")
find=function(x){
x=unlist(strsplit(x,""))
poss=c(letters,LETTERS,"{","}")
all(x%in%poss)
}
cl=cl[!sapply(cl,find)]
One can also use grep of base R:
> d = c("<s{apple}", "{orange}>s>", "{pine--apple}", "{kiwi}", "{strawberry}>s>")
# I have added "<s" before {apple} in above vector
> d[grep(".\\{|}.", d)]
[1] "<s{apple}" "{orange}>s>" "{strawberry}>s>"

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

grepl not searching correctly in R

I want to search ".com" in a vector, but grepl isn't working out for me. Anyone know why? I am doing the following
vector <- c("fdsfds.com","fdsfcom")
grepl(".com",vector)
This returns
[1] TRUE TRUE
I want it to strictly refer to "fdsfds.com"
As #user20650 said in the comments above, use grepl("\\.com",vector). the dot (.) is a special character in regular expressions that matches any character, so it's matching the second "f" in "fdsfcom". The "\\" before the . "escapes" the dot so it's treated literally. Alternatively, you could use grepl(".com",vector, fixed = TRUE), which searches literally, not using regular expressions.

Resources