regex to detect string separated by non alphabet characters (or nothing)

regex to detect string separated by non alphabet characters (or nothing) - r

I'd like to write a regex to detect the string "el" (stands for "eliminated" and is inside a bunch of poorly formatted score data).
For example
tests <- c("el", "hello", "123el", "el/27")
Here I'm looking for the result TRUE, FALSE, TRUE, TRUE. My sad attempts which don't work for obvious reasons:
library(stringr)
str_detect(tests, "el") # TRUE TRUE TRUE TRUE
str_detect(tests, "[^a-z]el") # FALSE FALSE TRUE FALSE

Use the regex (\\b|[^[:alpha:]])el(\\b|[^[:alpha:]]) along with grepl:
> tests <- c("el", "hello", "123el", "el/27")
> y <- grepl("(\\b|[^[:alpha:]])el(\\b|[^[:alpha:]])", tests)
> y
[1] TRUE FALSE TRUE TRUE
Your condition for whether el appears as an entity is that both sides either have a word boundary (\b) or a non alpha character (represented by the character class [^[:alpha:]] in R).

Related

extract words from a string into different strings

I'm very new with coding, and I have to clean a table with string variables. One of the columns I'm trying to clean includes several variables in itself. So if I take one row from my column it looks like this
string<- ("'casual': True,'classy': False,'divey': False,'hipster': False,'intimate': False,'romantic': False,'touristy': False,'trendy': False,'upscale': False")
I'm trying to extract Boolean values for each of the categories into separate columns.So my outcome should have 9 columns(each for every category) and rows should include True/ False values.
What am I supposed to use in this case?

An option is to use str_extract_all to extract the word (\\w+) that succeeds a a space followed by a :
library(stringr)
as.logical(str_extract_all(string, "(?<=: )\\w+")[[1]])
#[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If we need to parse into a data.frame, it would be better to use fromJSON from jsonlite
library(jsonlite)
lst1 <- fromJSON(paste0("{", gsub("'", "", gsub("\\b(\\w+)\\b",
'"\\1"', string)), "}"))
data.frame(lapply(lst1, as.logical))
# casual classy divey hipster intimate romantic touristy trendy upscale
#1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or in base R
as.logical(regmatches(string, gregexpr("(?<=: )\\w+", string, perl = TRUE))[[1]])

Find strings in R with this specific pattern "digitsXdigits"

I'm trying to clean a list of strings by finding strings with a particular pattern, but do not know how to write the regex to find them.
I am using grepl(), but do not know how to define the pattern.
The pattern is digits then [must include x, maybe special characters, letter] then digits again.
Here are some examples: OUTPUT from grepl()
"kills kld ldks 2087x-2714" TRUE
"sdlsn dklsk 4.75x25" TRUE
"dkks klsdk 3x4x135" TRUE
"djnlsdkl250shd" FALSE
"kdls, skfndkl 24gx.75" TRUE
"ski lsdkcm lskd 12.6" FALSE
"klslc ksldml 3.0 dnjsl 67n030" FALSE
It's a little bit of a complicated pattern. Basically it must include digits on both sides of the x, but can also have special characters and numbers in the mix.

It seems like there's no real restriction on what can occur on either side of the x, apart from at least some digits being present. So we can use [^ ] to match anything that's not a space:
grepl("[^ ]*\\d+[^ ]*x[^ ]*\\d+[^ ]*", x, perl = TRUE)
This gives your expected output on the example, but I can't guarantee that it'll work for all cases unless you can narrow down the restrictions.
As ikegami suggests, if all you need to do is detect these patterns (and not pull them out of the string), you can simplify this to:
grepl("\\d[^ ]*x[^ ]*\\d", x, perl = TRUE)
This could be a lot faster depending on your input, because things like [^ ]* can be very slow in regex (search "regex backtracking" to get an overview)

Using str_detect from the stringr package. I've added two additional test strings at the end of x.
The pattern is: a digit, zero or 1 occurrence of something that isn't a space, an x, zero or 1 occurrence of something that isn't a space, a digit
x <- c("kills kld ldks 2087x-2714",
"sdlsn dklsk 4.75x25",
"dkks klsdk 3x4x135",
"djnlsdkl250shd",
"kdls, skfndkl 24gx.75",
"ski lsdkcm lskd 12.6",
"klslc ksldml 3.0 dnjsl 67n030",
"5x25",
"kdls skfndkl x24g.75")
str_detect(x, "\\d\\S?x\\S?\\d")
#[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE

Maybe you can use this pattern
grepl("\\d.*x.*\\d",x)
#[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE
data
x <- c("kills kld ldks 2087x-2714","sdlsn dklsk 4.75x25",
"dkks klsdk 3x4x135","djnlsdkl250shd",
"kdls, skfndkl 24gx.75","ski lsdkcm lskd 12.6",
"klslc ksldml 3.0 dnjsl 67n030")

str_detect, validate phone number with specific pattern

I want to check if the numbers I have in the list matches specific formatting (nnn.nnn.nnnn). I am expecting the code to return a boolean (FALSE, TRUE, FALSE, TRUE, FALSE, FALSE) but the last element returns TRUE when I want it to be FALSE.
library(stringr)
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345')
str_detect(numbers, "[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}")
If I use:
str_detect(numbers, "[:digit:]{4}\\.[:digit:]{3}\\.[:digit:]{4}")
I get (FALSE, FALSE, FALSE, FALSE, FALSE, TRUE), so I know the pattern for the exact matches work but I am not sure why the first block of code returns TRUE for the last element when there are 4 numbers and not 3 before the '.'

It is because that last value has `345.234.2345' at the end and you don't have a requirement that your pattern start and end with the matching values.
Try this pattern:
"^[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}$"
If you wanted to match with a string possibly inside or one that was separate at the end or beginning by a space it might be more general to use:
"(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)"
Testing:
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345', "interior test 456.456.4566 other",
'456.456.4566 beginning test', "end test 456.456.4566")
str_detect(numbers, "(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)")
#[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
And as Wictor is pointing out you could also use the word boundary operator as long as you double escape it in R patterns.
grepl("\\b[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}\\b", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Caveat: The stringr functions (which if I remember correctly are based on stringi functions) appear to be different than the "ordinary" R regex functions in that they allow using the special character classes without double bracketing.
grepl("(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)", numbers)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
grepl("(^|[ ])[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}([ ]|$)", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Apparently this is via an implicit setting of "fixed" to TRUE.

Regex optional character preceded by Negative Lookback in R

Suppose I have a set of strings:
test <- c('MTB', 'NOT MTB', 'TB', 'NOT TB')
I want to write a regular expression to match either 'TB' or 'MTB' (e.g., the expression "M?TB") strictly when this FAILS to be preceeded by the phrase "NOT " (space included).
My intended result, therefore, is
TRUE FALSE TRUE FALSE
So far I have tried a couple of variations of
grepl("(?<!NOT )M?TB", test, perl = T)
TRUE TRUE TRUE FALSE
Unsuccessfully. As you can see, the phrase 'NOT MTB' meets the criteria for my regular expression.
It seems like including the optional character "M?" seems to make R think that the negative lookbehind is also optional. I have been looking into using parentheses to group the patterns, such as
grepl("(?<!NOT )(M?TB)")
TRUE TRUE TRUE FALSE
Which also fails to exclude the phrase 'NOT MTB'. Admittedly, I am unclear on how parentheses work in regex or eeven what "grouping" means in this context. I have had trouble finding a question related to how to group, require, and "optionalize" different parts of a regex so that I can match a phrase beginning with an optional character and preceeded by a negative lookback. What is the proper way to write an expression like this?

We could use the start (^) and end ($) to match only those words
grepl("^M?TB$", test)
#[1] TRUE FALSE TRUE FALSE
If there are other strings as #Wiktor Stribiżew mentioned in the comments, then one option would be
test1 <- c(test, "THIS MTB")
!grepl("\\bNOT M?TB\\b", test1) & grepl("\\bM?TB\\b", test1)
#[1] TRUE FALSE TRUE FALSE TRUE

test = c("MTB", "NOT MTB", "TB", "NOT TB", "THIS TB", "THIS NOT TB")
grepl("\\b(?<!NOT\\s)M?TB\\b",test,perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE

There is some question on what the question intends but here is some code to try depending on what is wanted.
Added: Poster clarified that #2 and #3 are along the lines looked for.
1) This can be done without regular expressions like this:
test %in% c("TB", "MTB")
## [1] TRUE FALSE TRUE FALSE
2) If the problem is not about exact matches then return matches to M?TB which do not also match NOT M?TB:
grepl("M?TB", test) & !grepl("NOT M?TB",test)
## [1] TRUE FALSE TRUE FALSE
3) Another alternative is to replace NOT M?TB with X and then grepl on M?TB:
grepl("M?TB", sub("NOT M?TB", "X", test))
## [1] TRUE FALSE TRUE FALSE

Partial string matching with grep and regular expressions

I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?

I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE

If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

regex to detect string separated by non alphabet characters (or nothing) - r

Related

extract words from a string into different strings

Find strings in R with this specific pattern "digitsXdigits"

str_detect, validate phone number with specific pattern

Regex optional character preceded by Negative Lookback in R

Partial string matching with grep and regular expressions

Categories

Resources