Crazy unexpected behavior of grepl - r

What explains the following very unexpected behavior of grepl?
I am using grepl for basic string matching here, and I think the default behavior as illustrated below is dangerous.
> grepl('a','a')
[1] TRUE
> grepl('a ()','a ()')
[1] TRUE
> grepl('a (b)','a (b)')
[1] FALSE
Adding fixed=TRUE fixes it. The documentation says:
pattern: character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector.
The average user should get from the message above that the default usage of grepl is NOT string matching but regular expression matching, which is not super clear. Someone unaware of regular expressions may not realize the dangers of leaving fixed to its default value. I think a warning should be added about this.
Posting here mainly to alert the community about this behavior. It took me a couple of hours of debugging to narrow down the issue I was experiencing in my Shiny app to this function. I would have never thought that grepl could be dangerous like this.

pattern: a ()
Breakdown: An a followed by a space and then a captured null/empty character ie Nothing.
The a and space matches the first part of the string. Thus the WHOLE pattern can be found in the string. RESULTS in TRUE
second part:
pattern: a (b)
Breakdown. Literally means a b ie a then space then b. But we capture the b hence the parenthesis around b.
String has a (b). Since b does not follow the space, the whole pattern cannot be obtained in the string hence FALSE

Related

I need help figuring out why my regex does not match with what I am looking for

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE
The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

grepping special characters in R

I have a variable named full.path.
And I am checking if the string contained in it is having certain special character or not.
From my code below, I am trying to grep some special character. As the characters are not there, still the output that I get is true.
Could someone explain and help. Thanks in advance.
full.path <- "/home/xyz"
#This returns TRUE :(
grepl("[?.,;:'-_+=()!##$%^&*|~`{}]", full.path)
By plugging this regex into https://regexr.com/ I was able to spot the issue: if you have - in a character class, you will create a range. The range from ' to _ happens to include uppercase letters, so you get spurious matches.
To avoid this behaviour, you can put - first in the character class, which is how you signal you want to actually match - and not a range:
> grepl("[-?.,;:'_+=()!##$%^&*|~`{}]", full.path)
[1] FALSE

How to search for strings with parentheses in R

Using R, I have a long list of keywords that I'm searching for in a dataset. One of the keywords needs to have parentheses around it in order to be included.
I've been attempting to replace the parenthesis in the keywords list with \\ then the parentheses, but have not been successful. If there is a way to modify the grepl() function to recognize them, that would also be helpful. Here is an example of what I'm trying to accomplish:
patterns<-c("dog","cat","(fish)")
data<-c("brown dog","black bear","salmon (fish)","red fish")
patterns2<- paste(patterns,collapse="|")
grepl(patterns2,data)
[1] TRUE FALSE TRUE TRUE
I would like salmon (fish) to give TRUE, and red fish to give FALSE.
Thank you!
As noted by #joran in the comments, the pattern should look like so:
patterns<-c("dog","cat","\\(fish\\)")
The \\s will tell R to read the parentheses literally when searching for the pattern.
Easiest way to achieve this if you don't want to make the change manually:
patterns <- gsub("([()])","\\\\\\1", patterns)
Which will result in:
[1] "dog" "cat" "\\(fish\\)"
If you're not very familiar with regular expressions, what happens here is that it looks for any one character within the the square brackets. The round brackets around that tell it to save whatever it finds that matches the contents. Then, the first four slashes in the second argument tell it to replace what it found with two slashes (each two slashes translate into one slash), and the \\1 tells it to add whatever it saved from the first argument - i.e., either ( or ).
Another option is to forget regex and use grepl with fixed = T
rowSums(sapply(patterns, grepl, data, fixed = T)) > 0
# [1] TRUE FALSE TRUE FALSE

How to write a regex OR statement within strapply in R

I have been using strapplyc in R to select different portions of a string that match one particular set of criteria. These have worked successfully until I found a portion of the string where the required portion could be defined one of two ways.
Here is an example of the string which is liberally sprinkled with \t:
\t\t\tsome words here\t\t\tDefect: some more words here Action: more words
I can write the strapply statement to capture the text between Defect: and the start of Action:
strapplyc(record[i], "Defect:(.*?)Action")
This works and selects the chosen text between Defect: and Action. In some cases there is no action section to the string and I've used the following code to capture these cases.
strapplyc(record[i], "Defect:(.*?)$")
What I have been trying to do is capture the text that either ends with Action, or with the end of the string (using $).
This is the bit that keeps failing. It returns nothing for either option. Here is my failing code:
strapplyc(record[i], "Defect:(.*?)Action|$")
Any idea where I'm going wrong, or a better solution would be much appreciated.
If you are up for a more efficient solution, you could drop the .*? matching and unroll your pattern like:
Defect:((?:[^A]+|A(?!ction))*)
This matches Defect: followed by any amount of characters that are not an A or are an A and not followed by ction. This avoids the expanding that is needed for the lazy dot matching. It will work for both ways, as it does stop matching when it hits Action or the end of your string.
As suggested by Wiktor, you can also use
Defect:([^A]*(?:A(?!ction)[^A]*)*)
Which is a little bit faster when there are many As in the string.
You might want to consider to use A(?!ction:) or A(?!ction\s*:), to avoid false early matches.
The alternation operator | is the regex operator with the lowest precedence. That means the regex Defect:(.*?)Action|$ is actually a combination of Defect:(.*?)Action and $ - since an empty string is a valid match for $, your regex returns the empty string.
To solve that, you should combine the regexes Defect:(.*?)Action and Defect:(.*?)$ with an OR:
Defect:(.*?)Action|Defect:(.*?)$
Or you can enclose Action|$ in a group as Sebastian Proske said in the comments:
Defect:(.*?)(?:Action|$)

Matching emails format using R

I was having an intro class at datacamp.com and ran into a problem.
Goal: find right emails using grep. "Right emails" defined by having an "#", end with ".edu").
Emails vector:
emails <- c("john.doe#ivyleague.edu", "education#world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
I was thinking of
grep("#*\\.edu$",emails)
and it gave me
[1] 1 4 5
because I thought "*" matches "multiple characters". Later I found that it doesn't work like that.
Turned out the right code is
grep("#.*\\.edu$",emails)
I googled some documentation and only have a vague sense of how to get the correct answer. Can someone explain how exactly R match the right emails? Thanks a bunch!!
You've already been advised the using the asterisk quantifier wasn't giving you the specificity you needed, so use the "+" quantifier, which forces at least one such match. I decided to make the problem more complex by adding some where there were duplicated at-signs:
emails <- c("john.doe##ivyleague.edu", "education##world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
grep( "^[^#]+#[^#]+\\.edu$", emails)
#[1] 5
That uses the regex character-class structure where items inside flankking square-brackets are taken as literals except when there is an initial up-caret ("^"), in which case it is the negation of the character class, i.e. in this case any character except "#". This will also exclude situations where the at-sign is the first character. Thanks to KonradRudolph who pointed out that adding "^" as the first character in the pattern (which signifies the point just before the first character of a potential match) would prevent allowing Items with an initial "##" from being matched.

Resources