I'm trying to clean a list of strings by finding strings with a particular pattern, but do not know how to write the regex to find them.
I am using grepl(), but do not know how to define the pattern.
The pattern is digits then [must include x, maybe special characters, letter] then digits again.
Here are some examples: OUTPUT from grepl()
"kills kld ldks 2087x-2714" TRUE
"sdlsn dklsk 4.75x25" TRUE
"dkks klsdk 3x4x135" TRUE
"djnlsdkl250shd" FALSE
"kdls, skfndkl 24gx.75" TRUE
"ski lsdkcm lskd 12.6" FALSE
"klslc ksldml 3.0 dnjsl 67n030" FALSE
It's a little bit of a complicated pattern. Basically it must include digits on both sides of the x, but can also have special characters and numbers in the mix.
It seems like there's no real restriction on what can occur on either side of the x, apart from at least some digits being present. So we can use [^ ] to match anything that's not a space:
grepl("[^ ]*\\d+[^ ]*x[^ ]*\\d+[^ ]*", x, perl = TRUE)
This gives your expected output on the example, but I can't guarantee that it'll work for all cases unless you can narrow down the restrictions.
As ikegami suggests, if all you need to do is detect these patterns (and not pull them out of the string), you can simplify this to:
grepl("\\d[^ ]*x[^ ]*\\d", x, perl = TRUE)
This could be a lot faster depending on your input, because things like [^ ]* can be very slow in regex (search "regex backtracking" to get an overview)
Using str_detect from the stringr package. I've added two additional test strings at the end of x.
The pattern is: a digit, zero or 1 occurrence of something that isn't a space, an x, zero or 1 occurrence of something that isn't a space, a digit
x <- c("kills kld ldks 2087x-2714",
"sdlsn dklsk 4.75x25",
"dkks klsdk 3x4x135",
"djnlsdkl250shd",
"kdls, skfndkl 24gx.75",
"ski lsdkcm lskd 12.6",
"klslc ksldml 3.0 dnjsl 67n030",
"5x25",
"kdls skfndkl x24g.75")
str_detect(x, "\\d\\S?x\\S?\\d")
#[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
Maybe you can use this pattern
grepl("\\d.*x.*\\d",x)
#[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE
data
x <- c("kills kld ldks 2087x-2714","sdlsn dklsk 4.75x25",
"dkks klsdk 3x4x135","djnlsdkl250shd",
"kdls, skfndkl 24gx.75","ski lsdkcm lskd 12.6",
"klslc ksldml 3.0 dnjsl 67n030")
Related
I guess this is straight forward but I can't manage to get my RegEx right and haven't found an exact example yet...
How can I match only strings with a specific character an exact number of times (not necessarily repeating!)
Let's look at a data set
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
I want to select all strings that have exactly two underscores ('_')
I can try:
stringr::str_detect(terms, "_{2}")
no luck
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or
terms[stringr::str_detect(terms, "._.{2,}")]
gives
[1] "breeding_season" "foraging_time" "seabird_ecology"
[4] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
but I want only
[1] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
Thank you RegEx masters
What you're missing is the negated character class. You want to match something that is not an underscore and then an underscore. It's generally like [^X].
/^[^_]*_[^_]*_[^_]*$/
# or
/^(?:[^_]*_){2}[^_]*$/
That is:
beginning of string
anything not an underscore
underscore
anything not an underscore
underscore
anything not an underscore
end of string
This is just one way to do it.
HTH
Here's why your efforts were failing:
"_{2}" #matches two underscores beside each other
"._.{2,}" matches any char, "_", exactly 2 chars, "_", any char
Simplest (but not quite what you asked for) with grepl or str_detect would be:
grepl("_.*_", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The "*" allows an arbitrary number of characters, so use negated charactyer class fore, aft and in the middle to get exactly 2 Underscores separating non underscores.:
grepl("^[^_]_[^_]*_[^_]$", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Added the "^" and the "$" to indicate the beginning and end. The "^" operator is different in meaning inside and outside of character-class brackets.
Solution using the stringr package. It is a little more straight forward without having to use a complex regular expression. Here we are using the str_count function to count the number of matches in a string.
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
library(stringr)
terms[str_count(terms, "_") == 2]
#> [1] "annual_reproductive_success" "sea_surface_temperature"
#> [3] "mean_chick_weight"
Created on 2021-03-08 by the reprex package (v0.3.0)
Using grep with value = TRUE :
grep('^\\w+_\\w+_\\w+$', terms, value = TRUE)
#[1] "annual_reproductive_success" "sea_surface_temperature"
#[3] "mean_chick_weight"
In stringr you can use str_subset with same regex :
stringr::str_subset(terms, '^\\w+_\\w+_\\w+$')
Suppose I have a set of strings:
test <- c('MTB', 'NOT MTB', 'TB', 'NOT TB')
I want to write a regular expression to match either 'TB' or 'MTB' (e.g., the expression "M?TB") strictly when this FAILS to be preceeded by the phrase "NOT " (space included).
My intended result, therefore, is
TRUE FALSE TRUE FALSE
So far I have tried a couple of variations of
grepl("(?<!NOT )M?TB", test, perl = T)
TRUE TRUE TRUE FALSE
Unsuccessfully. As you can see, the phrase 'NOT MTB' meets the criteria for my regular expression.
It seems like including the optional character "M?" seems to make R think that the negative lookbehind is also optional. I have been looking into using parentheses to group the patterns, such as
grepl("(?<!NOT )(M?TB)")
TRUE TRUE TRUE FALSE
Which also fails to exclude the phrase 'NOT MTB'. Admittedly, I am unclear on how parentheses work in regex or eeven what "grouping" means in this context. I have had trouble finding a question related to how to group, require, and "optionalize" different parts of a regex so that I can match a phrase beginning with an optional character and preceeded by a negative lookback. What is the proper way to write an expression like this?
We could use the start (^) and end ($) to match only those words
grepl("^M?TB$", test)
#[1] TRUE FALSE TRUE FALSE
If there are other strings as #Wiktor Stribiżew mentioned in the comments, then one option would be
test1 <- c(test, "THIS MTB")
!grepl("\\bNOT M?TB\\b", test1) & grepl("\\bM?TB\\b", test1)
#[1] TRUE FALSE TRUE FALSE TRUE
test = c("MTB", "NOT MTB", "TB", "NOT TB", "THIS TB", "THIS NOT TB")
grepl("\\b(?<!NOT\\s)M?TB\\b",test,perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE
There is some question on what the question intends but here is some code to try depending on what is wanted.
Added: Poster clarified that #2 and #3 are along the lines looked for.
1) This can be done without regular expressions like this:
test %in% c("TB", "MTB")
## [1] TRUE FALSE TRUE FALSE
2) If the problem is not about exact matches then return matches to M?TB which do not also match NOT M?TB:
grepl("M?TB", test) & !grepl("NOT M?TB",test)
## [1] TRUE FALSE TRUE FALSE
3) Another alternative is to replace NOT M?TB with X and then grepl on M?TB:
grepl("M?TB", sub("NOT M?TB", "X", test))
## [1] TRUE FALSE TRUE FALSE
I'd like to write a regex to detect the string "el" (stands for "eliminated" and is inside a bunch of poorly formatted score data).
For example
tests <- c("el", "hello", "123el", "el/27")
Here I'm looking for the result TRUE, FALSE, TRUE, TRUE. My sad attempts which don't work for obvious reasons:
library(stringr)
str_detect(tests, "el") # TRUE TRUE TRUE TRUE
str_detect(tests, "[^a-z]el") # FALSE FALSE TRUE FALSE
Use the regex (\\b|[^[:alpha:]])el(\\b|[^[:alpha:]]) along with grepl:
> tests <- c("el", "hello", "123el", "el/27")
> y <- grepl("(\\b|[^[:alpha:]])el(\\b|[^[:alpha:]])", tests)
> y
[1] TRUE FALSE TRUE TRUE
Your condition for whether el appears as an entity is that both sides either have a word boundary (\b) or a non alpha character (represented by the character class [^[:alpha:]] in R).
Despite reading the help page of R regex
Finally, to include a literal -, place it first or last (or, for perl
= TRUE only, precede it by a backslash).
I can't understand the difference between
grepl(pattern=paste("^thing1\\-",sep=""),x="thing1-thing2")
and
grepl(pattern=paste("^thing1-",sep=""),x="thing1-thing2")
Both return TRUE. Should I escape or not here? What is the best practice?
The hyphen is mostly a normal character in regular expressions.
You do not need to escape the hyphen outside of a character class; it has no special meaning.
Within a character class [ ] you can place a hyphen as the first or last character in the range. If you place the hyphen anywhere else you need to escape it in order to add it to your class.
Examples:
grepl('^thing1-', x='thing1-thing2')
[1] TRUE
grepl('[-a-z]+', 'foo-bar')
[1] TRUE
grepl('[a-z-]+', 'foo-bar')
[1] TRUE
grepl('[a-z\\-\\d]+', 'foo-bar')
[1] TRUE
Note: It is more common to find a hyphen placed first or last within a character class.
To see what it means for - to have a special meaning inside of a character class (and how putting it last gives it its literal meaning), try the following:
grepl("[w-y]", "x")
# [1] TRUE
grepl("[w-y]", "-")
# [1] FALSE
grepl("[wy-]", "-")
# [1] TRUE
grepl("[wy-]", "x")
# [1] FALSE
They are both matching the exact same text in these instances. I.e.:
x <- "thing1-thing2"
regmatches(x,regexpr("^thing1\\-",x))
#[1] "thing1-"
regmatches(x,regexpr("^thing1-",x))
#[1] "thing1-"
Using a - is a special character in certain situations though, for specifying ranges of values, such as characters between a and z when specifed inside [], e.g.:
regmatches(x,regexpr("[a-z]+",x))
#[1] "thing"
I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?
I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE
If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE