Full word match with grepl

Full word match with grepl - r

I would like to have TRUE FALSE instead of the following. Any suggestion?
testLines <- c("buried","medium-buried")
grepl('\\<buried\\>',testLines)
[1] TRUE TRUE

Perhaps this?
testLines <- c("buried","medium-buried")
grepl('^buried$',testLines)
#[1] TRUE FALSE
My understanding (and regex is not my forte) is that ^ denotes the start of the string and $ the end.

Related

Match multiple patterns in a string in R

I'm looking for a way in R to match multiple patterns in a string. For example:
test <- c("abcdefg", "defabc", "abcghdeft" , "abegrabc", "ghdefab", "dabce rdeft", "dedef abceg")
I want to look for 2 exact patterns "abc" and "def" in the string, and return TRUE if both of them are in the string regardless of position and order. So based on that the result would be:
TRUE TRUE TRUE FALSE FALSE TRUE TRUE
I can't seem to find the AND operator in regex like the OR operator |, I've tried other combinations like abc.*def|def.*abc but they didn't work.
Thank you in advance for your help!

We can use grepl
grepl("abc.*def|def.*abc", test)
#[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE

Grep when pattern is found exactly n times [duplicate]

This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I am looking for a regex expression to capture strings where the pattern is repeated n times. Here is an example with expected output.
# find sentences with 2 occurrences of the word "is"
z = c("this is what it is and is not", "this is not", "this is it it is")
regex_function(z)
[1] FALSE FALSE TRUE
I have gotten this far:
grepl("(.*\\bis\\b.*){2}",z)
[1] TRUE FALSE TRUE
But this will return TRUE if there are at least 2 matches. How can I force it to look for strings with exactly 2 occurrences?

To find where the word is is contained two times you can remove all is with gsub and compare the length of the strings with nchar.
nchar(z) - nchar(gsub("(\\bis\\b)", "", z)) == 4
#[1] FALSE FALSE TRUE
or count the hits of gregexpr like:
sapply(gregexpr("\\bis\\b", z), function(x) sum(x>0)) == 2
#[1] FALSE FALSE TRUE
or with a regex in grepl
grepl("^(?!(.*\\bis\\b){3})(.*\\bis\\b){2}.*$", z, perl=TRUE)
#[1] FALSE FALSE TRUE

This is an option that works but needs 2 regex calls. I am still looking for a compact regex call which correctly solves this issue.
grepl("(.*\\bis\\b.*){2}",z) & !grepl("(.*\\bis\\b.*){3}",z)
Basically adding a grepl of n+1 and only keeping the ones that satisfy grep no 1 and do not satisfy grep no2.

library(stringi)
stri_count_regex(z, "\\bis\\b") == 2L
# [1] FALSE FALSE TRUE

with stringr:
library(stringr)
library(magrittr)
regex_function = function(str){
str_extract_all(str,"\\bis\\b")%>%
lapply(.,function(x){length(x) == 2}) %>%
unlist()
}
> regex_function(z)
[1] FALSE FALSE TRUE

Find strings start with alpha except multi specific characters

Target to find 1st & 2nd elements which start with alpha except "H" or "G"**
DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
grepl("^[[:alpha:]][^H|G]",DD)
Found all start with alpha including "H" and "G".
How can I achieve this ?
grepl("^D|Q",DD) is not what I need, actual data has other alpha patterns.

You may use a PCRE regex like ^(?![HG])\p{L} or ^(?![HG])[[:alpha:]]:
> DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
> grepl("^(?![HG])\\p{L}",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Or ^[^\\P{L}HG]:
> grepl("^[^\\P{L}HG]",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
The ^(?![HG])[[:alpha:]] pattern matches
^ - start of string
(?![HG]) - no H or G is allowed immediately to the right of the current location
[[:alpha:]] or \p{L} - a letter.
The ^[^\P{L}HG] matches the start of a string (^) and then matches any char other than a non-letter, H and G.

Just as an alternative. Wiktors solution is more general and practical.
grepl("^[a-zA-FI-Z][0-9a-zA-Z]+$",DD)
You could define a class of values that are allowed to appear in the first place and then define the following positions.
If everything else is allowed to follow, simply use:
grepl("^[a-zA-FI-Z]",DD)

Regex optional character preceded by Negative Lookback in R

Suppose I have a set of strings:
test <- c('MTB', 'NOT MTB', 'TB', 'NOT TB')
I want to write a regular expression to match either 'TB' or 'MTB' (e.g., the expression "M?TB") strictly when this FAILS to be preceeded by the phrase "NOT " (space included).
My intended result, therefore, is
TRUE FALSE TRUE FALSE
So far I have tried a couple of variations of
grepl("(?<!NOT )M?TB", test, perl = T)
TRUE TRUE TRUE FALSE
Unsuccessfully. As you can see, the phrase 'NOT MTB' meets the criteria for my regular expression.
It seems like including the optional character "M?" seems to make R think that the negative lookbehind is also optional. I have been looking into using parentheses to group the patterns, such as
grepl("(?<!NOT )(M?TB)")
TRUE TRUE TRUE FALSE
Which also fails to exclude the phrase 'NOT MTB'. Admittedly, I am unclear on how parentheses work in regex or eeven what "grouping" means in this context. I have had trouble finding a question related to how to group, require, and "optionalize" different parts of a regex so that I can match a phrase beginning with an optional character and preceeded by a negative lookback. What is the proper way to write an expression like this?

We could use the start (^) and end ($) to match only those words
grepl("^M?TB$", test)
#[1] TRUE FALSE TRUE FALSE
If there are other strings as #Wiktor Stribiżew mentioned in the comments, then one option would be
test1 <- c(test, "THIS MTB")
!grepl("\\bNOT M?TB\\b", test1) & grepl("\\bM?TB\\b", test1)
#[1] TRUE FALSE TRUE FALSE TRUE

test = c("MTB", "NOT MTB", "TB", "NOT TB", "THIS TB", "THIS NOT TB")
grepl("\\b(?<!NOT\\s)M?TB\\b",test,perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE

There is some question on what the question intends but here is some code to try depending on what is wanted.
Added: Poster clarified that #2 and #3 are along the lines looked for.
1) This can be done without regular expressions like this:
test %in% c("TB", "MTB")
## [1] TRUE FALSE TRUE FALSE
2) If the problem is not about exact matches then return matches to M?TB which do not also match NOT M?TB:
grepl("M?TB", test) & !grepl("NOT M?TB",test)
## [1] TRUE FALSE TRUE FALSE
3) Another alternative is to replace NOT M?TB with X and then grepl on M?TB:
grepl("M?TB", sub("NOT M?TB", "X", test))
## [1] TRUE FALSE TRUE FALSE

Partial string matching with grep and regular expressions

I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?

I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE

If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Full word match with grepl - r

I would like to have TRUE FALSE instead of the following. Any suggestion? testLines <- c("buried","medium-buried") grepl('\\<buried\\>',testLines) [1] TRUE TRUE

Perhaps this? testLines <- c("buried","medium-buried") grepl('^buried$',testLines) #[1] TRUE FALSE My understanding (and regex is not my forte) is that ^ denotes the start of the string and $ the end.

Related

Match multiple patterns in a string in R

Grep when pattern is found exactly n times [duplicate]

Find strings start with alpha except multi specific characters

Regex optional character preceded by Negative Lookback in R

Partial string matching with grep and regular expressions

Categories

Resources