Test if String contains Letters and Characters - r

I got this vector:
bar <- c("aaa:something", "111:something", "a1a1:something", "1a:something")
I want to check whether before the colon (:) there are letters and numbers. It can be abitrarily many, but both need to be in there, so the result should be
FALSE, FALSE, TRUE, TRUE
How can I do that?

Assuming the numbers and letters will be in any order you can do :
grepl('([a-zA-Z]+[0-9]+)|([0-9]+[a-zA-Z]+):', bar)
#[1] FALSE FALSE TRUE TRUE

You can combine two grepl like:
grepl("[[:digit:]].*:", bar) & grepl("[[:alpha:]].*:", bar)
#[1] FALSE FALSE TRUE TRUE
#grepl("[0-9].*:", bar) & grepl("[a-zA-Z].*:", bar) #Alternative
To make it in one go you can use a non consuming expression:
grepl("(?=.*[[:digit:]]).*[[:alpha:]].*:", bar, perl=TRUE)
#[1] FALSE FALSE TRUE TRUE

grepl("[a-z]+\\d+.*\\:|\\d+[a-z]+.*\\:", bar, ignore.case = TRUE)

Related

regex a before b with and without whitespaces

I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo

Match multiple patterns in a string in R

I'm looking for a way in R to match multiple patterns in a string. For example:
test <- c("abcdefg", "defabc", "abcghdeft" , "abegrabc", "ghdefab", "dabce rdeft", "dedef abceg")
I want to look for 2 exact patterns "abc" and "def" in the string, and return TRUE if both of them are in the string regardless of position and order. So based on that the result would be:
TRUE TRUE TRUE FALSE FALSE TRUE TRUE
I can't seem to find the AND operator in regex like the OR operator |, I've tried other combinations like abc.*def|def.*abc but they didn't work.
Thank you in advance for your help!
We can use grepl
grepl("abc.*def|def.*abc", test)
#[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE

str_detect, validate phone number with specific pattern

I want to check if the numbers I have in the list matches specific formatting (nnn.nnn.nnnn). I am expecting the code to return a boolean (FALSE, TRUE, FALSE, TRUE, FALSE, FALSE) but the last element returns TRUE when I want it to be FALSE.
library(stringr)
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345')
str_detect(numbers, "[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}")
If I use:
str_detect(numbers, "[:digit:]{4}\\.[:digit:]{3}\\.[:digit:]{4}")
I get (FALSE, FALSE, FALSE, FALSE, FALSE, TRUE), so I know the pattern for the exact matches work but I am not sure why the first block of code returns TRUE for the last element when there are 4 numbers and not 3 before the '.'
It is because that last value has `345.234.2345' at the end and you don't have a requirement that your pattern start and end with the matching values.
Try this pattern:
"^[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}$"
If you wanted to match with a string possibly inside or one that was separate at the end or beginning by a space it might be more general to use:
"(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)"
Testing:
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345', "interior test 456.456.4566 other",
'456.456.4566 beginning test', "end test 456.456.4566")
str_detect(numbers, "(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)")
#[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
And as Wictor is pointing out you could also use the word boundary operator as long as you double escape it in R patterns.
grepl("\\b[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}\\b", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Caveat: The stringr functions (which if I remember correctly are based on stringi functions) appear to be different than the "ordinary" R regex functions in that they allow using the special character classes without double bracketing.
grepl("(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)", numbers)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
grepl("(^|[ ])[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}([ ]|$)", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Apparently this is via an implicit setting of "fixed" to TRUE.

Regex optional character preceded by Negative Lookback in R

Suppose I have a set of strings:
test <- c('MTB', 'NOT MTB', 'TB', 'NOT TB')
I want to write a regular expression to match either 'TB' or 'MTB' (e.g., the expression "M?TB") strictly when this FAILS to be preceeded by the phrase "NOT " (space included).
My intended result, therefore, is
TRUE FALSE TRUE FALSE
So far I have tried a couple of variations of
grepl("(?<!NOT )M?TB", test, perl = T)
TRUE TRUE TRUE FALSE
Unsuccessfully. As you can see, the phrase 'NOT MTB' meets the criteria for my regular expression.
It seems like including the optional character "M?" seems to make R think that the negative lookbehind is also optional. I have been looking into using parentheses to group the patterns, such as
grepl("(?<!NOT )(M?TB)")
TRUE TRUE TRUE FALSE
Which also fails to exclude the phrase 'NOT MTB'. Admittedly, I am unclear on how parentheses work in regex or eeven what "grouping" means in this context. I have had trouble finding a question related to how to group, require, and "optionalize" different parts of a regex so that I can match a phrase beginning with an optional character and preceeded by a negative lookback. What is the proper way to write an expression like this?
We could use the start (^) and end ($) to match only those words
grepl("^M?TB$", test)
#[1] TRUE FALSE TRUE FALSE
If there are other strings as #Wiktor Stribiżew mentioned in the comments, then one option would be
test1 <- c(test, "THIS MTB")
!grepl("\\bNOT M?TB\\b", test1) & grepl("\\bM?TB\\b", test1)
#[1] TRUE FALSE TRUE FALSE TRUE
test = c("MTB", "NOT MTB", "TB", "NOT TB", "THIS TB", "THIS NOT TB")
grepl("\\b(?<!NOT\\s)M?TB\\b",test,perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE
There is some question on what the question intends but here is some code to try depending on what is wanted.
Added: Poster clarified that #2 and #3 are along the lines looked for.
1) This can be done without regular expressions like this:
test %in% c("TB", "MTB")
## [1] TRUE FALSE TRUE FALSE
2) If the problem is not about exact matches then return matches to M?TB which do not also match NOT M?TB:
grepl("M?TB", test) & !grepl("NOT M?TB",test)
## [1] TRUE FALSE TRUE FALSE
3) Another alternative is to replace NOT M?TB with X and then grepl on M?TB:
grepl("M?TB", sub("NOT M?TB", "X", test))
## [1] TRUE FALSE TRUE FALSE

regex to detect string separated by non alphabet characters (or nothing)

I'd like to write a regex to detect the string "el" (stands for "eliminated" and is inside a bunch of poorly formatted score data).
For example
tests <- c("el", "hello", "123el", "el/27")
Here I'm looking for the result TRUE, FALSE, TRUE, TRUE. My sad attempts which don't work for obvious reasons:
library(stringr)
str_detect(tests, "el") # TRUE TRUE TRUE TRUE
str_detect(tests, "[^a-z]el") # FALSE FALSE TRUE FALSE
Use the regex (\\b|[^[:alpha:]])el(\\b|[^[:alpha:]]) along with grepl:
> tests <- c("el", "hello", "123el", "el/27")
> y <- grepl("(\\b|[^[:alpha:]])el(\\b|[^[:alpha:]])", tests)
> y
[1] TRUE FALSE TRUE TRUE
Your condition for whether el appears as an entity is that both sides either have a word boundary (\b) or a non alpha character (represented by the character class [^[:alpha:]] in R).

Resources