Partial string matching with grep and regular expressions

Partial string matching with grep and regular expressions - r

I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?

I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE

If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE

Related

RegEx R: match strings with same character exact number of times anywhere in string

I guess this is straight forward but I can't manage to get my RegEx right and haven't found an exact example yet...
How can I match only strings with a specific character an exact number of times (not necessarily repeating!)
Let's look at a data set
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
I want to select all strings that have exactly two underscores ('_')
I can try:
stringr::str_detect(terms, "_{2}")
no luck
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or
terms[stringr::str_detect(terms, "._.{2,}")]
gives
[1] "breeding_season" "foraging_time" "seabird_ecology"
[4] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
but I want only
[1] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
Thank you RegEx masters

What you're missing is the negated character class. You want to match something that is not an underscore and then an underscore. It's generally like [^X].
/^[^_]*_[^_]*_[^_]*$/
# or
/^(?:[^_]*_){2}[^_]*$/
That is:
beginning of string
anything not an underscore
underscore
anything not an underscore
underscore
anything not an underscore
end of string
This is just one way to do it.
HTH

Here's why your efforts were failing:
"_{2}" #matches two underscores beside each other
"._.{2,}" matches any char, "_", exactly 2 chars, "_", any char
Simplest (but not quite what you asked for) with grepl or str_detect would be:
grepl("_.*_", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The "*" allows an arbitrary number of characters, so use negated charactyer class fore, aft and in the middle to get exactly 2 Underscores separating non underscores.:
grepl("^[^_]_[^_]*_[^_]$", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Added the "^" and the "$" to indicate the beginning and end. The "^" operator is different in meaning inside and outside of character-class brackets.

Solution using the stringr package. It is a little more straight forward without having to use a complex regular expression. Here we are using the str_count function to count the number of matches in a string.
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
library(stringr)
terms[str_count(terms, "_") == 2]
#> [1] "annual_reproductive_success" "sea_surface_temperature"
#> [3] "mean_chick_weight"
Created on 2021-03-08 by the reprex package (v0.3.0)

Using grep with value = TRUE :
grep('^\\w+_\\w+_\\w+$', terms, value = TRUE)
#[1] "annual_reproductive_success" "sea_surface_temperature"
#[3] "mean_chick_weight"
In stringr you can use str_subset with same regex :
stringr::str_subset(terms, '^\\w+_\\w+_\\w+$')

Match multiple patterns in a string in R

I'm looking for a way in R to match multiple patterns in a string. For example:
test <- c("abcdefg", "defabc", "abcghdeft" , "abegrabc", "ghdefab", "dabce rdeft", "dedef abceg")
I want to look for 2 exact patterns "abc" and "def" in the string, and return TRUE if both of them are in the string regardless of position and order. So based on that the result would be:
TRUE TRUE TRUE FALSE FALSE TRUE TRUE
I can't seem to find the AND operator in regex like the OR operator |, I've tried other combinations like abc.*def|def.*abc but they didn't work.
Thank you in advance for your help!

We can use grepl
grepl("abc.*def|def.*abc", test)
#[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE

Full word match with grepl

I would like to have TRUE FALSE instead of the following. Any suggestion?
testLines <- c("buried","medium-buried")
grepl('\\<buried\\>',testLines)
[1] TRUE TRUE

Perhaps this?
testLines <- c("buried","medium-buried")
grepl('^buried$',testLines)
#[1] TRUE FALSE
My understanding (and regex is not my forte) is that ^ denotes the start of the string and $ the end.

Find strings start with alpha except multi specific characters

Target to find 1st & 2nd elements which start with alpha except "H" or "G"**
DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
grepl("^[[:alpha:]][^H|G]",DD)
Found all start with alpha including "H" and "G".
How can I achieve this ?
grepl("^D|Q",DD) is not what I need, actual data has other alpha patterns.

You may use a PCRE regex like ^(?![HG])\p{L} or ^(?![HG])[[:alpha:]]:
> DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
> grepl("^(?![HG])\\p{L}",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Or ^[^\\P{L}HG]:
> grepl("^[^\\P{L}HG]",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
The ^(?![HG])[[:alpha:]] pattern matches
^ - start of string
(?![HG]) - no H or G is allowed immediately to the right of the current location
[[:alpha:]] or \p{L} - a letter.
The ^[^\P{L}HG] matches the start of a string (^) and then matches any char other than a non-letter, H and G.

Just as an alternative. Wiktors solution is more general and practical.
grepl("^[a-zA-FI-Z][0-9a-zA-Z]+$",DD)
You could define a class of values that are allowed to appear in the first place and then define the following positions.
If everything else is allowed to follow, simply use:
grepl("^[a-zA-FI-Z]",DD)

Regex optional character preceded by Negative Lookback in R

Suppose I have a set of strings:
test <- c('MTB', 'NOT MTB', 'TB', 'NOT TB')
I want to write a regular expression to match either 'TB' or 'MTB' (e.g., the expression "M?TB") strictly when this FAILS to be preceeded by the phrase "NOT " (space included).
My intended result, therefore, is
TRUE FALSE TRUE FALSE
So far I have tried a couple of variations of
grepl("(?<!NOT )M?TB", test, perl = T)
TRUE TRUE TRUE FALSE
Unsuccessfully. As you can see, the phrase 'NOT MTB' meets the criteria for my regular expression.
It seems like including the optional character "M?" seems to make R think that the negative lookbehind is also optional. I have been looking into using parentheses to group the patterns, such as
grepl("(?<!NOT )(M?TB)")
TRUE TRUE TRUE FALSE
Which also fails to exclude the phrase 'NOT MTB'. Admittedly, I am unclear on how parentheses work in regex or eeven what "grouping" means in this context. I have had trouble finding a question related to how to group, require, and "optionalize" different parts of a regex so that I can match a phrase beginning with an optional character and preceeded by a negative lookback. What is the proper way to write an expression like this?

We could use the start (^) and end ($) to match only those words
grepl("^M?TB$", test)
#[1] TRUE FALSE TRUE FALSE
If there are other strings as #Wiktor Stribiżew mentioned in the comments, then one option would be
test1 <- c(test, "THIS MTB")
!grepl("\\bNOT M?TB\\b", test1) & grepl("\\bM?TB\\b", test1)
#[1] TRUE FALSE TRUE FALSE TRUE

test = c("MTB", "NOT MTB", "TB", "NOT TB", "THIS TB", "THIS NOT TB")
grepl("\\b(?<!NOT\\s)M?TB\\b",test,perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE

There is some question on what the question intends but here is some code to try depending on what is wanted.
Added: Poster clarified that #2 and #3 are along the lines looked for.
1) This can be done without regular expressions like this:
test %in% c("TB", "MTB")
## [1] TRUE FALSE TRUE FALSE
2) If the problem is not about exact matches then return matches to M?TB which do not also match NOT M?TB:
grepl("M?TB", test) & !grepl("NOT M?TB",test)
## [1] TRUE FALSE TRUE FALSE
3) Another alternative is to replace NOT M?TB with X and then grepl on M?TB:
grepl("M?TB", sub("NOT M?TB", "X", test))
## [1] TRUE FALSE TRUE FALSE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Partial string matching with grep and regular expressions - r

If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions: example = c("AWA","WOO","AZW","WWP") substr(example, 2, 2) == "W" # [1] TRUE FALSE FALSE TRUE

Related

RegEx R: match strings with same character exact number of times anywhere in string

Match multiple patterns in a string in R

Full word match with grepl

Find strings start with alpha except multi specific characters

Regex optional character preceded by Negative Lookback in R

Categories

Resources