regex match ONLY the first occurence of a character with R [duplicate] - r

When analyzing the logs you often need to find all lines containing some specific word in the log file. The problem is when you do a regular search in notepad++ it returns the same line multiple times, if it contains this word in different positions. To alleviate that I switch to regex search and use the following expression
(.*\K)(text)
Where .* matches the full line, \K discards the selection and then (text) matches the last occurrence of text on the line.
This method looks ugly and is not very fast. Is there any better way to do it?

To match only the first occurrence you will have to search many characters from beginning of line, discard that search and look for text that you are looking for.
Following regex does the same.
Regex: (^.*?)\Ktrue true is my text here.
Dummy Input
Log date 12/12/2015
Sr No desc amount status
1 true $10000 true
2 true $10000 false
3 true $10000 true
4 true $10000 false
5 true $10000 true
Regex101 Demo
Notepad++ Demo

Related

Regex validation of two patterns multiple comma-separated doesn't work

In an ASP.NET form I am trying to find a pattern allowing multiple comma-separated elements but it doesn't seem to work. I need to allow either 4 letters and 2 digits (JEAN01) or 2 digits and 4 letters (01JEAN) any number of times: JEAN01,JEAN02,03JEAN,JEAN04
My first attempt (see https://regex101.com/r/E4JZVv/1) is:
/^([a-z-A-Z]{4}[0-9]{2}|[a-z-A-Z]{2}[0-9]{4})(,[a-z-A-Z]{4}[0-9]{2}|[a-z-A-Z]{2}[0-9]{4})*$
My second attempt (https://regex101.com/r/HU9cOS/1) is
((^|[,])[a-z-A-Z]{4}[0-9]{2}|[a-z-A-Z]{2}[0-9]{4})+
The first accepts only a couple of elements.
That should do the trick:
^((\w{4}\d{2}|\d{2}\w{4})(,|$))+
See: https://regex101.com/r/8hoNXl/1
Explanation:
^: Asserts position at start of the string
\w{4}\d{2}: 4 letters and 2 digits
\d{2}\w{4}: 2 digits and 4 letters
(,|$): Either , or the end of the string
+: Repeat between once and an unlimited number of times (greedy)
The setup of your first try is good, only:
you want to match 4 letters and 2 digits or 2 digits and 4 letters which is different from this part [a-z-A-Z]{2}[0-9]{4}
In the repetition of the group in the second part, you have to repeat the comma and put that following alternation itself also in a group so that the comma can be prepended for both alternations
Using a case insensitive match:
(?i)^(?:[a-z]{4}[0-9]{2}|[0-9]{2}[a-z]{4})(?:,(?:[a-z]{4}[0-9]{2}|[0-9]{2}[a-z]{4}))*$
See a regex demo.
For the second pattern that you tried, the assertion for the start of the string should not be in an alternation, or else you can have partial matches as it will also allow a comma.
You can repeat the alternation with an optional comma at the end. If you don't want to allow the string to start with a comma, you can for example using a negative lookahead (?!,) and use the alternation after asserting the start of the string.
(?i)^(?!,)((?:,|^)(?:[a-z]{4}[0-9]{2}|[0-9]{2}[a-z]{4}))+$
See another regex demo.

Crazy unexpected behavior of grepl

What explains the following very unexpected behavior of grepl?
I am using grepl for basic string matching here, and I think the default behavior as illustrated below is dangerous.
> grepl('a','a')
[1] TRUE
> grepl('a ()','a ()')
[1] TRUE
> grepl('a (b)','a (b)')
[1] FALSE
Adding fixed=TRUE fixes it. The documentation says:
pattern: character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector.
The average user should get from the message above that the default usage of grepl is NOT string matching but regular expression matching, which is not super clear. Someone unaware of regular expressions may not realize the dangers of leaving fixed to its default value. I think a warning should be added about this.
Posting here mainly to alert the community about this behavior. It took me a couple of hours of debugging to narrow down the issue I was experiencing in my Shiny app to this function. I would have never thought that grepl could be dangerous like this.
pattern: a ()
Breakdown: An a followed by a space and then a captured null/empty character ie Nothing.
The a and space matches the first part of the string. Thus the WHOLE pattern can be found in the string. RESULTS in TRUE
second part:
pattern: a (b)
Breakdown. Literally means a b ie a then space then b. But we capture the b hence the parenthesis around b.
String has a (b). Since b does not follow the space, the whole pattern cannot be obtained in the string hence FALSE

R sanitize pattern for regular expression detection [duplicate]

Here is sample string
x<-"My name is XYZ, I'm from ABc, working at PQR"
and want to detect "," in the string and using two forms:
> str_detect(x,",")
[1] TRUE
>
> str_detect(x,fixed(","))
[1] TRUE
Both returning same result. Then what is the difference b/w these two?
For this, we may need to use a different example with regex. Here, we are trying to check whether the upper case letter 'R' is at the end ($) of the string. With fixed as wrapper, it checks whether we have R$ as characterand without it, it evaluates$` as the end of the string as it is a metacharacter.
str_detect(x,fixed("R$"))
#[1] FALSE
str_detect(x,"R$")
#[1] TRUE
The , is not a metacharacter and is evaluated as , whether we are using with fixed or without fixed. In general, if we are specifically looking for finding the literal character, use the fixed wrapper and it should be fast as well.

Regex to find words from list, when specific words not appear 3 words before

I want to find all matches of specific words from list, but when specific another words not appears in the range of 3 words before.
For example:
Find all the times that the words "good|best|better" appears in the text, but the words "no|not|none" not appears 3 words before.
I tried something like that:
(?<!\sno|\snot(\s|\s\w\s|\s\w\s\w\s))(\bgood\b|\bbest\b|\bbetter\b)
But it's not working.
You may be able to use this PCRE regex in R with perl=TRUE option:
\b(?:not?|none)(?:\s+\S+){0,2}\s+(good|best|better)\b(*SKIP)(*F)|\b(?:good|best|better)\b
RegEx Demo
In your R code use:
gregexpr("\\b(?:not?|none)(?:\\s+\\S+){0,2}\\s+(good|best|better)\\b(*SKIP)(*F)|\\b(?:good|best|better)\\b", mystr, perl=TRUE)
In PCRE, verbs (*SKIP)(*F) are used to fail and skip a match that we don't want to match.
If we would be only looking to fail no and other derivatives of that, we would be starting with a simple expression such as:
^(?!.*no).*times.*$
Then, we would add word boundary if necessary, and we would expand that to:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b).*times.*$
Demo 1
and finally we would add our desired words using:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b)(?=.*\bgood\b|.*\bbest\b|.*\bbetter\b).*times.*$
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

How to search for strings with parentheses in R

Using R, I have a long list of keywords that I'm searching for in a dataset. One of the keywords needs to have parentheses around it in order to be included.
I've been attempting to replace the parenthesis in the keywords list with \\ then the parentheses, but have not been successful. If there is a way to modify the grepl() function to recognize them, that would also be helpful. Here is an example of what I'm trying to accomplish:
patterns<-c("dog","cat","(fish)")
data<-c("brown dog","black bear","salmon (fish)","red fish")
patterns2<- paste(patterns,collapse="|")
grepl(patterns2,data)
[1] TRUE FALSE TRUE TRUE
I would like salmon (fish) to give TRUE, and red fish to give FALSE.
Thank you!
As noted by #joran in the comments, the pattern should look like so:
patterns<-c("dog","cat","\\(fish\\)")
The \\s will tell R to read the parentheses literally when searching for the pattern.
Easiest way to achieve this if you don't want to make the change manually:
patterns <- gsub("([()])","\\\\\\1", patterns)
Which will result in:
[1] "dog" "cat" "\\(fish\\)"
If you're not very familiar with regular expressions, what happens here is that it looks for any one character within the the square brackets. The round brackets around that tell it to save whatever it finds that matches the contents. Then, the first four slashes in the second argument tell it to replace what it found with two slashes (each two slashes translate into one slash), and the \\1 tells it to add whatever it saved from the first argument - i.e., either ( or ).
Another option is to forget regex and use grepl with fixed = T
rowSums(sapply(patterns, grepl, data, fixed = T)) > 0
# [1] TRUE FALSE TRUE FALSE

Resources