Can't figure out the regex pattern for matching aa but not aaa.
x <- c("ab", "abc", "abcc", "abccc", "abcccc", "abccccc")
grep(pattern="c{2,3}", x, value=TRUE, perl=TRUE)
## [1] "abcc" "abccc" "abcccc" "abccccc"
grep(pattern="^((?!c{4,}).)*$", x, value=TRUE, perl=TRUE)
## [1] "ab" "abc" "abcc" "abccc"
But what's the pattern to yield:
grep(pattern=..., x, value=TRUE, perl=TRUE)
## [1] "abcc" "abccc"
This should work for your test cases:
^[^c]+c{2,3}$
But what's the pattern to yield 1 "abcc" "abccc"
You need to assure that the 2 or 3 cs are not preceded nor followed with c:
grep(pattern="(^|[^c])c{2,3}([^c]|$)", x, value=TRUE)
Details:
(^|[^c]) - an alternation group matching start of string (^ anchor) or any char other than c (with a negated character class (bracket expression) [^c])
c{2,3} - 2 or 3 cs
([^c]|$) - an alternation group matching end of string ($ anchor) or any char other than c
Alternatively, use a PCRE regex with lookarounds if your c is actually a placeholder for a multicharacter substring:
grep(pattern="(?<!c)c{2,3}(?!c)", x, value=TRUE, perl=TRUE)
See the R demo
The (?<!c) negative lookbehind will fail the match if there is a c right before 2 or 3 cs and (?!c) negative lookahead fails the match if there is a c right after the 2 or 3 cs.
Related
I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"
Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.
You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"
A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"
Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"
Lets say I have a string in R:
str <- "abc abc cde cde"
and I use regmatches and gregexpr to find how many "b"´s there is in my string
regmatches(str, gregexpr("b",str))
but I want an output of everything that cointains the letter b.
So an output like: "abc", "abc".
Thank you!
tmp <- "abc abc cde cde"
Split the string up into separate elements, grep for "b", return elements:
grep("b", unlist(strsplit(tmp, split = " ")), value = TRUE)
Look for non-space before and after, something like:
regmatches(str, gregexpr("\\S*b\\S*", s))
# [[1]]
# [1] "abc" "abc"
Special regex characters are documented in ?regex. For this case, \\s matches "any space-like character", and \\S is its negation, so any non-space-like character. You could be more specific, such as \\w ('word' character, same as [[:alnum:]_]). The * means zero-or-more, and + means one-or-more (forcing something).
I suppose you mean you want to find words that contain b. One regex that does this is
\w*b\w*
\w* matches 0 or more word characters, which is a-z, A-Z, 0-9, and the underscore character.
Demo
Here is a base R option using strsplit and grepl:
str <- "abc abc cde cde"
words <- strsplit(str, "\\s+")[[1]]
idx <- sapply(words, function(x) { grepl("b", x)})
matches <- words[idx]
matches
[1] "abc" "abc"
Alternaion with quantifies in gregexpr and str_extract_all function
require(stringr)
gregexpr(pattern = "(h|a)*", "xxhx")
[[1]]
[1] 1 2 3 4
attr(,"match.length")
[1] 0 0 1 0
attr(,"useBytes")
[1] TRUE
str_extract_all(pattern = "(h|a)*", "xxhx")
[[1]]
[1] "" "" "h" "" ""
why gregexpr indicates 3 voids while str_extract_all indicates 4 voids
This is the difference between how TRE (gregexpr) and ICU (str_extract_all) regex engines deal with empty (also called "zero length") regex matches. TRE regex advances the regex index after a zero length match, while ICU allows testing the same position twice.
It becomes obvious what positions are tried by both engines if you use replacing functions:
> gsub("(h|a)*", "-\\1", "xxhx")
[1] "-x-x-hx-"
> str_replace_all("xxhx", "(h|a)*", "-\\1")
[1] "-x-x-h-x-"
The TRE engine matched h and moved the index after x while ICU engine matched h and stopped right after h before x to match the empty location before it.
s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq')
How can I use a regex to get the numbers that are beside at least one underscore ("_"). In effect I would like to get outputs like this :
> output # The result
[1] 1 2
> output_l # Alternatively
[1] TRUE TRUE FALSE FALSE
We can use regex lookarounds
grep("(?<=_)\\d+", s, perl = TRUE)
grepl("(?<=_)\\d+", s, perl = TRUE)
#[1] TRUE TRUE FALSE FALSE
If you need to get just indices, use grep with a simple TRE regex (no lookarounds are necessary):
> grep("_\\d+", s)
[1] 1 2
To get the numbers themselves, use a PCRE regex with a positive lookahead with regmatches / gregexpr:
> unlist(regmatches(s, gregexpr("(?<=_)[0-9]+", s, perl=TRUE)))
[1] "1" "2"
Details:
(?<=_) - a positive lookbehind that requires _ to appear immediately to the left of the current position
[0-9]+ - 1+ digits
EDIT: If the digits to the left of _ should also be considered, use 1) "(^|_)\\d|\\d(_|$)" with grep solution and 2) "(?<![^_])\\d+|\\d+(?![^_])" with the number extraction solution.
Using this regex :
[_]([0-9]){1}
And selecting group 1 you'll get your digit, if you want more, use
[_]([0-9]+)
And it will not match the last two strings
You can use this tool : https://regex101.com/
with stringr:
s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq', 'a_1_b')
library(stringr)
which(!is.na(str_match(s, '_\\d|\\d_')))
# [1] 1 2 5
I would like to extract cat and dog in any order
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
What I have now extracts cat and dog, but also the text in-between
stringr::str_extract(string1, "cat.*dog|dog.*cat"
I would like the output to be
cat dog
and
dog cat
for string1 and string2, respectively
You may use sub with the following PCRE regex:
.*(?|(dog).*(cat)|(cat).*(dog)).*
See the regex demo.
Details
.* - any 0+ chars other than line break chars (to match all chars add (?s) at the pattern start)
(?|(dog).*(cat)|(cat).*(dog)) - a branch reset group (?|...|...) matching either of the two alternatives:
(dog).*(cat) - Group 1 capturing dog, then any 0+ chars as many as possible, and Group 2 capturing cat
| - or
(cat).*(dog) - Group 1 capturing cat, then any 0+ chars as many as possible, and Group 2 capturing dog (in a branch reset group, group IDs reset to the value before the group + 1)
.* - any 0+ chars other than line break chars
The \1 \2 replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog or cat, a space, and a cat or dog).
See an R demo online, too:
x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"
To return NA in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn to apply custom replacement logic:
> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"
Here,
^ - start of the string anchor
(?:.*((dog).*(cat)|(cat).*(dog)).*|.*) - a non-capturing group that matches either of the two alternatives:
.*((dog).*(cat)|(cat).*(dog)).*:
.* - any 0+ chars as many as possible
((dog).*(cat)|(cat).*(dog)) - a capturing group matching either of the two alternatives:
(dog).*(cat) - dog (Group 2, assigned to a variable), any 0+ chars as many as possible, and then cat (Group 3, assigned to b variable)
|
(cat).*(dog) - dog (Group 4, assigned to y variable), any 0+ chars as many as possible, and then cat (Group 5, assigned to z variable)
.* - any 0+ chars as many as possible
| - or
.* - any 0+ chars
$ - end of the string anchor.
The x in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar, and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA.
We can use str_extract_all from the stringr package with capture groups.
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
string3 <- "asfdadsfads asfdadsfadf"
library(stringr)
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)")
# [[1]]
# [1] "cat" "dog"
#
# [[2]]
# [1] "dog" "cat"
#
# [[3]]
# character(0)
We can also set simplify = TRUE. The output would be a matrix.
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)", simplify = TRUE)
# [,1] [,2]
# [1,] "cat" "dog"
# [2,] "dog" "cat"
# [3,] "" ""
Or,
> regmatches(string1,gregexpr("cat|dog",string1))
[[1]]
[1] "cat" "dog"
> regmatches(string2,gregexpr("cat|dog",string2))
[[1]]
[1] "dog" "cat"