str_detect, validate phone number with specific pattern - r

I want to check if the numbers I have in the list matches specific formatting (nnn.nnn.nnnn). I am expecting the code to return a boolean (FALSE, TRUE, FALSE, TRUE, FALSE, FALSE) but the last element returns TRUE when I want it to be FALSE.
library(stringr)
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345')
str_detect(numbers, "[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}")
If I use:
str_detect(numbers, "[:digit:]{4}\\.[:digit:]{3}\\.[:digit:]{4}")
I get (FALSE, FALSE, FALSE, FALSE, FALSE, TRUE), so I know the pattern for the exact matches work but I am not sure why the first block of code returns TRUE for the last element when there are 4 numbers and not 3 before the '.'

It is because that last value has `345.234.2345' at the end and you don't have a requirement that your pattern start and end with the matching values.
Try this pattern:
"^[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}$"
If you wanted to match with a string possibly inside or one that was separate at the end or beginning by a space it might be more general to use:
"(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)"
Testing:
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345', "interior test 456.456.4566 other",
'456.456.4566 beginning test', "end test 456.456.4566")
str_detect(numbers, "(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)")
#[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
And as Wictor is pointing out you could also use the word boundary operator as long as you double escape it in R patterns.
grepl("\\b[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}\\b", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Caveat: The stringr functions (which if I remember correctly are based on stringi functions) appear to be different than the "ordinary" R regex functions in that they allow using the special character classes without double bracketing.
grepl("(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)", numbers)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
grepl("(^|[ ])[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}([ ]|$)", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Apparently this is via an implicit setting of "fixed" to TRUE.

Related

regex a before b with and without whitespaces

I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo

Test if String contains Letters and Characters

I got this vector:
bar <- c("aaa:something", "111:something", "a1a1:something", "1a:something")
I want to check whether before the colon (:) there are letters and numbers. It can be abitrarily many, but both need to be in there, so the result should be
FALSE, FALSE, TRUE, TRUE
How can I do that?
Assuming the numbers and letters will be in any order you can do :
grepl('([a-zA-Z]+[0-9]+)|([0-9]+[a-zA-Z]+):', bar)
#[1] FALSE FALSE TRUE TRUE
You can combine two grepl like:
grepl("[[:digit:]].*:", bar) & grepl("[[:alpha:]].*:", bar)
#[1] FALSE FALSE TRUE TRUE
#grepl("[0-9].*:", bar) & grepl("[a-zA-Z].*:", bar) #Alternative
To make it in one go you can use a non consuming expression:
grepl("(?=.*[[:digit:]]).*[[:alpha:]].*:", bar, perl=TRUE)
#[1] FALSE FALSE TRUE TRUE
grepl("[a-z]+\\d+.*\\:|\\d+[a-z]+.*\\:", bar, ignore.case = TRUE)

extract words from a string into different strings

I'm very new with coding, and I have to clean a table with string variables. One of the columns I'm trying to clean includes several variables in itself. So if I take one row from my column it looks like this
string<- ("'casual': True,'classy': False,'divey': False,'hipster': False,'intimate': False,'romantic': False,'touristy': False,'trendy': False,'upscale': False")
I'm trying to extract Boolean values for each of the categories into separate columns.So my outcome should have 9 columns(each for every category) and rows should include True/ False values.
What am I supposed to use in this case?
An option is to use str_extract_all to extract the word (\\w+) that succeeds a a space followed by a :
library(stringr)
as.logical(str_extract_all(string, "(?<=: )\\w+")[[1]])
#[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If we need to parse into a data.frame, it would be better to use fromJSON from jsonlite
library(jsonlite)
lst1 <- fromJSON(paste0("{", gsub("'", "", gsub("\\b(\\w+)\\b",
'"\\1"', string)), "}"))
data.frame(lapply(lst1, as.logical))
# casual classy divey hipster intimate romantic touristy trendy upscale
#1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or in base R
as.logical(regmatches(string, gregexpr("(?<=: )\\w+", string, perl = TRUE))[[1]])

Regex optional character preceded by Negative Lookback in R

Suppose I have a set of strings:
test <- c('MTB', 'NOT MTB', 'TB', 'NOT TB')
I want to write a regular expression to match either 'TB' or 'MTB' (e.g., the expression "M?TB") strictly when this FAILS to be preceeded by the phrase "NOT " (space included).
My intended result, therefore, is
TRUE FALSE TRUE FALSE
So far I have tried a couple of variations of
grepl("(?<!NOT )M?TB", test, perl = T)
TRUE TRUE TRUE FALSE
Unsuccessfully. As you can see, the phrase 'NOT MTB' meets the criteria for my regular expression.
It seems like including the optional character "M?" seems to make R think that the negative lookbehind is also optional. I have been looking into using parentheses to group the patterns, such as
grepl("(?<!NOT )(M?TB)")
TRUE TRUE TRUE FALSE
Which also fails to exclude the phrase 'NOT MTB'. Admittedly, I am unclear on how parentheses work in regex or eeven what "grouping" means in this context. I have had trouble finding a question related to how to group, require, and "optionalize" different parts of a regex so that I can match a phrase beginning with an optional character and preceeded by a negative lookback. What is the proper way to write an expression like this?
We could use the start (^) and end ($) to match only those words
grepl("^M?TB$", test)
#[1] TRUE FALSE TRUE FALSE
If there are other strings as #Wiktor Stribiżew mentioned in the comments, then one option would be
test1 <- c(test, "THIS MTB")
!grepl("\\bNOT M?TB\\b", test1) & grepl("\\bM?TB\\b", test1)
#[1] TRUE FALSE TRUE FALSE TRUE
test = c("MTB", "NOT MTB", "TB", "NOT TB", "THIS TB", "THIS NOT TB")
grepl("\\b(?<!NOT\\s)M?TB\\b",test,perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE
There is some question on what the question intends but here is some code to try depending on what is wanted.
Added: Poster clarified that #2 and #3 are along the lines looked for.
1) This can be done without regular expressions like this:
test %in% c("TB", "MTB")
## [1] TRUE FALSE TRUE FALSE
2) If the problem is not about exact matches then return matches to M?TB which do not also match NOT M?TB:
grepl("M?TB", test) & !grepl("NOT M?TB",test)
## [1] TRUE FALSE TRUE FALSE
3) Another alternative is to replace NOT M?TB with X and then grepl on M?TB:
grepl("M?TB", sub("NOT M?TB", "X", test))
## [1] TRUE FALSE TRUE FALSE

find where in boolean vector TRUE followed by FALSE

I have an boolean vector and need to indicate where it changes (from TRUE to FALSE).
input <- c(rep(TRUE,3), rep(FALSE,2), TRUE, FALSE)
input
[1] TRUE TRUE TRUE FALSE FALSE TRUE FALSE
The result should be c(4, 7). Does something for doing so already exist (in base)? thx, J

Resources