R sanitize pattern for regular expression detection [duplicate] - r

Here is sample string
x<-"My name is XYZ, I'm from ABc, working at PQR"
and want to detect "," in the string and using two forms:
> str_detect(x,",")
[1] TRUE
>
> str_detect(x,fixed(","))
[1] TRUE
Both returning same result. Then what is the difference b/w these two?

For this, we may need to use a different example with regex. Here, we are trying to check whether the upper case letter 'R' is at the end ($) of the string. With fixed as wrapper, it checks whether we have R$ as characterand without it, it evaluates$` as the end of the string as it is a metacharacter.
str_detect(x,fixed("R$"))
#[1] FALSE
str_detect(x,"R$")
#[1] TRUE
The , is not a metacharacter and is evaluated as , whether we are using with fixed or without fixed. In general, if we are specifically looking for finding the literal character, use the fixed wrapper and it should be fast as well.

Related

Crazy unexpected behavior of grepl

What explains the following very unexpected behavior of grepl?
I am using grepl for basic string matching here, and I think the default behavior as illustrated below is dangerous.
> grepl('a','a')
[1] TRUE
> grepl('a ()','a ()')
[1] TRUE
> grepl('a (b)','a (b)')
[1] FALSE
Adding fixed=TRUE fixes it. The documentation says:
pattern: character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector.
The average user should get from the message above that the default usage of grepl is NOT string matching but regular expression matching, which is not super clear. Someone unaware of regular expressions may not realize the dangers of leaving fixed to its default value. I think a warning should be added about this.
Posting here mainly to alert the community about this behavior. It took me a couple of hours of debugging to narrow down the issue I was experiencing in my Shiny app to this function. I would have never thought that grepl could be dangerous like this.
pattern: a ()
Breakdown: An a followed by a space and then a captured null/empty character ie Nothing.
The a and space matches the first part of the string. Thus the WHOLE pattern can be found in the string. RESULTS in TRUE
second part:
pattern: a (b)
Breakdown. Literally means a b ie a then space then b. But we capture the b hence the parenthesis around b.
String has a (b). Since b does not follow the space, the whole pattern cannot be obtained in the string hence FALSE

I need help figuring out why my regex does not match with what I am looking for

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE
The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

grepping special characters in R

I have a variable named full.path.
And I am checking if the string contained in it is having certain special character or not.
From my code below, I am trying to grep some special character. As the characters are not there, still the output that I get is true.
Could someone explain and help. Thanks in advance.
full.path <- "/home/xyz"
#This returns TRUE :(
grepl("[?.,;:'-_+=()!##$%^&*|~`{}]", full.path)
By plugging this regex into https://regexr.com/ I was able to spot the issue: if you have - in a character class, you will create a range. The range from ' to _ happens to include uppercase letters, so you get spurious matches.
To avoid this behaviour, you can put - first in the character class, which is how you signal you want to actually match - and not a range:
> grepl("[-?.,;:'_+=()!##$%^&*|~`{}]", full.path)
[1] FALSE

How to search for strings with parentheses in R

Using R, I have a long list of keywords that I'm searching for in a dataset. One of the keywords needs to have parentheses around it in order to be included.
I've been attempting to replace the parenthesis in the keywords list with \\ then the parentheses, but have not been successful. If there is a way to modify the grepl() function to recognize them, that would also be helpful. Here is an example of what I'm trying to accomplish:
patterns<-c("dog","cat","(fish)")
data<-c("brown dog","black bear","salmon (fish)","red fish")
patterns2<- paste(patterns,collapse="|")
grepl(patterns2,data)
[1] TRUE FALSE TRUE TRUE
I would like salmon (fish) to give TRUE, and red fish to give FALSE.
Thank you!
As noted by #joran in the comments, the pattern should look like so:
patterns<-c("dog","cat","\\(fish\\)")
The \\s will tell R to read the parentheses literally when searching for the pattern.
Easiest way to achieve this if you don't want to make the change manually:
patterns <- gsub("([()])","\\\\\\1", patterns)
Which will result in:
[1] "dog" "cat" "\\(fish\\)"
If you're not very familiar with regular expressions, what happens here is that it looks for any one character within the the square brackets. The round brackets around that tell it to save whatever it finds that matches the contents. Then, the first four slashes in the second argument tell it to replace what it found with two slashes (each two slashes translate into one slash), and the \\1 tells it to add whatever it saved from the first argument - i.e., either ( or ).
Another option is to forget regex and use grepl with fixed = T
rowSums(sapply(patterns, grepl, data, fixed = T)) > 0
# [1] TRUE FALSE TRUE FALSE

grepl not searching correctly in R

I want to search ".com" in a vector, but grepl isn't working out for me. Anyone know why? I am doing the following
vector <- c("fdsfds.com","fdsfcom")
grepl(".com",vector)
This returns
[1] TRUE TRUE
I want it to strictly refer to "fdsfds.com"
As #user20650 said in the comments above, use grepl("\\.com",vector). the dot (.) is a special character in regular expressions that matches any character, so it's matching the second "f" in "fdsfcom". The "\\" before the . "escapes" the dot so it's treated literally. Alternatively, you could use grepl(".com",vector, fixed = TRUE), which searches literally, not using regular expressions.

Resources