How to match strings matching [a-z_]* but with non repetitive symbol "_" - r

I would like to match strings :
That are composed by [a-z_] ;
That doesn't start or end with "_" ;
That doesn't include repetitive "_" symbol.
So for example the expected matching results would be :
"x"; "x_x" > TRUE
"_x"; "x_"; "_x_"; "x__x" > FALSE
My problems to achieve this is that I can exclude strings ending or starting with "_" but my regexp also excludes length 1 strings.
grepl("^[a-z][a-z_]*[a-z]$", my.string)
My second issue is that I don't know how to negate a match for double characters grepl("(_)\\1", my.string) and how I can integrate it with the 1st part of my regexp.
If possible I would like to do this with perl = FALSE.

You need to use the following TRE regex:
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
See the regex demo
Details:
^ - start of string
[a-z]+ - one or more ASCII letters
(?:_[a-z]+)* - zero or more sequences (*) of
_ - an underscore
[a-z]+ - one or more ASCII letters
$ - end of string.
See R demo:
my.string <- c("x" ,"x_x", "x_x_x_x_x","_x", "x_", "_x_", "x__x")
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
## => [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE

This seems to identify the items correctly:
dat <- c("x" ,"x_x","_x", "x_", "_x_", "x__x")
grep("^_|__|_$", dat, invert=TRUE)
[1] 1 2
So try:
!grepl("^_|__|_$", dat)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
Just uses negation and a pattern with three conditions separated by the regex logical OR operator "|".

Another regex that uses grouping ( and the * for numeration.
myString <- c("x_", "x", "_x", "x_x_x", "x_x", "x__x")
grepl("^([a-z]_)*[a-z]$", myString)
[1] FALSE TRUE FALSE TRUE TRUE FALSE
So ^([a-z]_)* matches 0 or more pairs of "[a-z]_" at the beginning of the string and [a-z]$ assures that the final character is a lower case alphabetical character.

Related

RegEx R: match strings with same character exact number of times anywhere in string

I guess this is straight forward but I can't manage to get my RegEx right and haven't found an exact example yet...
How can I match only strings with a specific character an exact number of times (not necessarily repeating!)
Let's look at a data set
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
I want to select all strings that have exactly two underscores ('_')
I can try:
stringr::str_detect(terms, "_{2}")
no luck
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or
terms[stringr::str_detect(terms, "._.{2,}")]
gives
[1] "breeding_season" "foraging_time" "seabird_ecology"
[4] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
but I want only
[1] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
Thank you RegEx masters
What you're missing is the negated character class. You want to match something that is not an underscore and then an underscore. It's generally like [^X].
/^[^_]*_[^_]*_[^_]*$/
# or
/^(?:[^_]*_){2}[^_]*$/
That is:
beginning of string
anything not an underscore
underscore
anything not an underscore
underscore
anything not an underscore
end of string
This is just one way to do it.
HTH
Here's why your efforts were failing:
"_{2}" #matches two underscores beside each other
"._.{2,}" matches any char, "_", exactly 2 chars, "_", any char
Simplest (but not quite what you asked for) with grepl or str_detect would be:
grepl("_.*_", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The "*" allows an arbitrary number of characters, so use negated charactyer class fore, aft and in the middle to get exactly 2 Underscores separating non underscores.:
grepl("^[^_]_[^_]*_[^_]$", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Added the "^" and the "$" to indicate the beginning and end. The "^" operator is different in meaning inside and outside of character-class brackets.
Solution using the stringr package. It is a little more straight forward without having to use a complex regular expression. Here we are using the str_count function to count the number of matches in a string.
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
library(stringr)
terms[str_count(terms, "_") == 2]
#> [1] "annual_reproductive_success" "sea_surface_temperature"
#> [3] "mean_chick_weight"
Created on 2021-03-08 by the reprex package (v0.3.0)
Using grep with value = TRUE :
grep('^\\w+_\\w+_\\w+$', terms, value = TRUE)
#[1] "annual_reproductive_success" "sea_surface_temperature"
#[3] "mean_chick_weight"
In stringr you can use str_subset with same regex :
stringr::str_subset(terms, '^\\w+_\\w+_\\w+$')

Find strings start with alpha except multi specific characters

Target to find 1st & 2nd elements which start with alpha except "H" or "G"**
DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
grepl("^[[:alpha:]][^H|G]",DD)
Found all start with alpha including "H" and "G".
How can I achieve this ?
grepl("^D|Q",DD) is not what I need, actual data has other alpha patterns.
You may use a PCRE regex like ^(?![HG])\p{L} or ^(?![HG])[[:alpha:]]:
> DD <- c("DD2123","QD2123","HC12231","HCEF","GC2123","1232","--",NA)
> grepl("^(?![HG])\\p{L}",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Or ^[^\\P{L}HG]:
> grepl("^[^\\P{L}HG]",DD, perl=TRUE)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
The ^(?![HG])[[:alpha:]] pattern matches
^ - start of string
(?![HG]) - no H or G is allowed immediately to the right of the current location
[[:alpha:]] or \p{L} - a letter.
The ^[^\P{L}HG] matches the start of a string (^) and then matches any char other than a non-letter, H and G.
Just as an alternative. Wiktors solution is more general and practical.
grepl("^[a-zA-FI-Z][0-9a-zA-Z]+$",DD)
You could define a class of values that are allowed to appear in the first place and then define the following positions.
If everything else is allowed to follow, simply use:
grepl("^[a-zA-FI-Z]",DD)

Match elements from a character range n times

Assume I have a string like this:
id = "ce91ffbe-8218-e211-86da-000c29e211a0"
What regex can I write in R that will verify that this string is 36 characters long and only contains letters, numbers, and dashes?
There is nothing in the documentation on how to use a character range (e.g. [0-9A-z-]) with a quantifier (e.g. {36}). The following code is always returning TRUE regardless of the quantifier. I'm sure I'm missing something simple here...
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("[0-9A-z-]{36}", id)
#> [1] TRUE
grepl("[0-9A-z-]{34}", id)
#> [1] TRUE
This behavior only starts when I add the check for the numbers 0-9 in the character range.
Could you please try following:
grepl("^[0-9a-zA-Z-]{36}$",id)
OR
grepl("^[[:alnum:]-]{36}$",id)
After running it we will get following output.
grepl("^[0-9a-zA-Z-]{36}$",id)
[1] TRUE
Explanation: Adding following for only explanation purposes here.
grepl(" ##using grepl to check if regex mentioned in it gives TRUE or FALSE result.
^ ##^ means shows starting of the line.
[[:alnum:]-] ##Mentioning character class [[:alnum:]] with a dash(-) in it means match alphabets with digits and dashes in regex.
{36} ##Look for only 36 occurences of alphabets with dashes.
$", ##$ means check from starting(^) to till end of the variable's value.
id) ##Mentioning id value here.
You want to use:
^[0-9a-z-]{36}$
^ Assert position start of line.
[0-9a-z-] Character set for numbers, letters a to z and dashes -.
{36} Match preceding pattern 36 times.
$ Assert position end of line.
Try it here.
If the string can have other characters before or after the target characters, try
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id)
#[1] FALSE
And this will still work.
id2 <- paste0(":+)!#", id)
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id2)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id2)
#[1] FALSE

R regex extract numbers from string depending of context

s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq')
How can I use a regex to get the numbers that are beside at least one underscore ("_"). In effect I would like to get outputs like this :
> output # The result
[1] 1 2
> output_l # Alternatively
[1] TRUE TRUE FALSE FALSE
We can use regex lookarounds
grep("(?<=_)\\d+", s, perl = TRUE)
grepl("(?<=_)\\d+", s, perl = TRUE)
#[1] TRUE TRUE FALSE FALSE
If you need to get just indices, use grep with a simple TRE regex (no lookarounds are necessary):
> grep("_\\d+", s)
[1] 1 2
To get the numbers themselves, use a PCRE regex with a positive lookahead with regmatches / gregexpr:
> unlist(regmatches(s, gregexpr("(?<=_)[0-9]+", s, perl=TRUE)))
[1] "1" "2"
Details:
(?<=_) - a positive lookbehind that requires _ to appear immediately to the left of the current position
[0-9]+ - 1+ digits
EDIT: If the digits to the left of _ should also be considered, use 1) "(^|_)\\d|\\d(_|$)" with grep solution and 2) "(?<![^_])\\d+|\\d+(?![^_])" with the number extraction solution.
Using this regex :
[_]([0-9]){1}
And selecting group 1 you'll get your digit, if you want more, use
[_]([0-9]+)
And it will not match the last two strings
You can use this tool : https://regex101.com/
with stringr:
s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq', 'a_1_b')
library(stringr)
which(!is.na(str_match(s, '_\\d|\\d_')))
# [1] 1 2 5

Check if a string contains at least one numeric character in R [duplicate]

This question already has answers here:
Test for numeric elements in a character string
(6 answers)
Closed 7 years ago.
I have tried the following, however, it goes wrong when the string contains any other character, say a space. As you can see below, there is a string called "subway 10", which does contain numeric characters, however, it is reported as false because of the space.
My string may contain any other character, but if it contains at least a single digit, I would like to get the indices of those strings from the array.
> mywords<- c("harry","met","sally","subway 10","1800Movies","12345")
> numbers <- grepl("^[[:digit:]]+$", mywords)
> letters <- grepl("^[[:alpha:]]+$", mywords)
> both <- grepl("^[[:digit:][:alpha:]]+$", mywords)
>
> mywords[xor((letters | numbers), both)] # letters & numbers mixed
[1] "1800Movies"
using \\d works for me:
grepl("\\d", mywords)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
so does [[:digit:]]:
grepl("[[:digit:]]", mywords)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
As #nrussel mentionned, you're testing if the strings contain only digits between the beginning ^ of the string till the end $.
You could also check if the strings contain something else than letters, using ^ inside brackets to negate the letters, but then "something else" is not only digits:
grepl("[^a-zA-Z]", mywords)
[1] FALSE FALSE FALSE TRUE TRUE TRUE

Resources