Find strings start with alpha except multi specific characters

Find strings start with alpha except multi specific characters - r

Target to find 1st & 2nd elements which start with alpha except "H" or "G"**
DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
grepl("^[[:alpha:]][^H|G]",DD)
Found all start with alpha including "H" and "G".
How can I achieve this ?
grepl("^D|Q",DD) is not what I need, actual data has other alpha patterns.

You may use a PCRE regex like ^(?![HG])\p{L} or ^(?![HG])[[:alpha:]]:
> DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
> grepl("^(?![HG])\\p{L}",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Or ^[^\\P{L}HG]:
> grepl("^[^\\P{L}HG]",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
The ^(?![HG])[[:alpha:]] pattern matches
^ - start of string
(?![HG]) - no H or G is allowed immediately to the right of the current location
[[:alpha:]] or \p{L} - a letter.
The ^[^\P{L}HG] matches the start of a string (^) and then matches any char other than a non-letter, H and G.

Just as an alternative. Wiktors solution is more general and practical.
grepl("^[a-zA-FI-Z][0-9a-zA-Z]+$",DD)
You could define a class of values that are allowed to appear in the first place and then define the following positions.
If everything else is allowed to follow, simply use:
grepl("^[a-zA-FI-Z]",DD)

Related

RegEx R: match strings with same character exact number of times anywhere in string

I guess this is straight forward but I can't manage to get my RegEx right and haven't found an exact example yet...
How can I match only strings with a specific character an exact number of times (not necessarily repeating!)
Let's look at a data set
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
I want to select all strings that have exactly two underscores ('_')
I can try:
stringr::str_detect(terms, "_{2}")
no luck
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or
terms[stringr::str_detect(terms, "._.{2,}")]
gives
[1] "breeding_season" "foraging_time" "seabird_ecology"
[4] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
but I want only
[1] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
Thank you RegEx masters

What you're missing is the negated character class. You want to match something that is not an underscore and then an underscore. It's generally like [^X].
/^[^_]*_[^_]*_[^_]*$/
# or
/^(?:[^_]*_){2}[^_]*$/
That is:
beginning of string
anything not an underscore
underscore
anything not an underscore
underscore
anything not an underscore
end of string
This is just one way to do it.
HTH

Here's why your efforts were failing:
"_{2}" #matches two underscores beside each other
"._.{2,}" matches any char, "_", exactly 2 chars, "_", any char
Simplest (but not quite what you asked for) with grepl or str_detect would be:
grepl("_.*_", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The "*" allows an arbitrary number of characters, so use negated charactyer class fore, aft and in the middle to get exactly 2 Underscores separating non underscores.:
grepl("^[^_]_[^_]*_[^_]$", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Added the "^" and the "$" to indicate the beginning and end. The "^" operator is different in meaning inside and outside of character-class brackets.

Solution using the stringr package. It is a little more straight forward without having to use a complex regular expression. Here we are using the str_count function to count the number of matches in a string.
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
library(stringr)
terms[str_count(terms, "_") == 2]
#> [1] "annual_reproductive_success" "sea_surface_temperature"
#> [3] "mean_chick_weight"
Created on 2021-03-08 by the reprex package (v0.3.0)

Using grep with value = TRUE :
grep('^\\w+_\\w+_\\w+$', terms, value = TRUE)
#[1] "annual_reproductive_success" "sea_surface_temperature"
#[3] "mean_chick_weight"
In stringr you can use str_subset with same regex :
stringr::str_subset(terms, '^\\w+_\\w+_\\w+$')

Full word match with grepl

I would like to have TRUE FALSE instead of the following. Any suggestion?
testLines <- c("buried","medium-buried")
grepl('\\<buried\\>',testLines)
[1] TRUE TRUE

Perhaps this?
testLines <- c("buried","medium-buried")
grepl('^buried$',testLines)
#[1] TRUE FALSE
My understanding (and regex is not my forte) is that ^ denotes the start of the string and $ the end.

Regex optional character preceded by Negative Lookback in R

Suppose I have a set of strings:
test <- c('MTB', 'NOT MTB', 'TB', 'NOT TB')
I want to write a regular expression to match either 'TB' or 'MTB' (e.g., the expression "M?TB") strictly when this FAILS to be preceeded by the phrase "NOT " (space included).
My intended result, therefore, is
TRUE FALSE TRUE FALSE
So far I have tried a couple of variations of
grepl("(?<!NOT )M?TB", test, perl = T)
TRUE TRUE TRUE FALSE
Unsuccessfully. As you can see, the phrase 'NOT MTB' meets the criteria for my regular expression.
It seems like including the optional character "M?" seems to make R think that the negative lookbehind is also optional. I have been looking into using parentheses to group the patterns, such as
grepl("(?<!NOT )(M?TB)")
TRUE TRUE TRUE FALSE
Which also fails to exclude the phrase 'NOT MTB'. Admittedly, I am unclear on how parentheses work in regex or eeven what "grouping" means in this context. I have had trouble finding a question related to how to group, require, and "optionalize" different parts of a regex so that I can match a phrase beginning with an optional character and preceeded by a negative lookback. What is the proper way to write an expression like this?

We could use the start (^) and end ($) to match only those words
grepl("^M?TB$", test)
#[1] TRUE FALSE TRUE FALSE
If there are other strings as #Wiktor Stribiżew mentioned in the comments, then one option would be
test1 <- c(test, "THIS MTB")
!grepl("\\bNOT M?TB\\b", test1) & grepl("\\bM?TB\\b", test1)
#[1] TRUE FALSE TRUE FALSE TRUE

test = c("MTB", "NOT MTB", "TB", "NOT TB", "THIS TB", "THIS NOT TB")
grepl("\\b(?<!NOT\\s)M?TB\\b",test,perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE

There is some question on what the question intends but here is some code to try depending on what is wanted.
Added: Poster clarified that #2 and #3 are along the lines looked for.
1) This can be done without regular expressions like this:
test %in% c("TB", "MTB")
## [1] TRUE FALSE TRUE FALSE
2) If the problem is not about exact matches then return matches to M?TB which do not also match NOT M?TB:
grepl("M?TB", test) & !grepl("NOT M?TB",test)
## [1] TRUE FALSE TRUE FALSE
3) Another alternative is to replace NOT M?TB with X and then grepl on M?TB:
grepl("M?TB", sub("NOT M?TB", "X", test))
## [1] TRUE FALSE TRUE FALSE

How to match strings matching [a-z_]* but with non repetitive symbol "_"

I would like to match strings :
That are composed by [a-z_] ;
That doesn't start or end with "_" ;
That doesn't include repetitive "_" symbol.
So for example the expected matching results would be :
"x"; "x_x" > TRUE
"_x"; "x_"; "_x_"; "x__x" > FALSE
My problems to achieve this is that I can exclude strings ending or starting with "_" but my regexp also excludes length 1 strings.
grepl("^[a-z][a-z_]*[a-z]$", my.string)
My second issue is that I don't know how to negate a match for double characters grepl("(_)\\1", my.string) and how I can integrate it with the 1st part of my regexp.
If possible I would like to do this with perl = FALSE.

You need to use the following TRE regex:
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
See the regex demo
Details:
^ - start of string
[a-z]+ - one or more ASCII letters
(?:_[a-z]+)* - zero or more sequences (*) of
_ - an underscore
[a-z]+ - one or more ASCII letters
$ - end of string.
See R demo:
my.string <- c("x" ,"x_x", "x_x_x_x_x","_x", "x_", "_x_", "x__x")
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
## => [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE

This seems to identify the items correctly:
dat <- c("x" ,"x_x","_x", "x_", "_x_", "x__x")
grep("^_|__|_$", dat, invert=TRUE)
[1] 1 2
So try:
!grepl("^_|__|_$", dat)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
Just uses negation and a pattern with three conditions separated by the regex logical OR operator "|".

Another regex that uses grouping ( and the * for numeration.
myString <- c("x_", "x", "_x", "x_x_x", "x_x", "x__x")
grepl("^([a-z]_)*[a-z]$", myString)
[1] FALSE TRUE FALSE TRUE TRUE FALSE
So ^([a-z]_)* matches 0 or more pairs of "[a-z]_" at the beginning of the string and [a-z]$ assures that the final character is a lower case alphabetical character.

Partial string matching with grep and regular expressions

I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?

I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE

If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find strings start with alpha except multi specific characters - r

Related

RegEx R: match strings with same character exact number of times anywhere in string

Full word match with grepl

Regex optional character preceded by Negative Lookback in R

How to match strings matching [a-z_]* but with non repetitive symbol "_"

Partial string matching with grep and regular expressions

Categories

Resources