I want to combine the following commands using AND operator:
grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
Here is an example for OR operator
http://r.789695.n4.nabble.com/grep-for-multiple-pattern-td4685244.html#a4685247
Would you please advise?
The search for an AND operator in regex (whether in R or elsewhere) can be a long and sad search. The boolean AND means that both of two statements have to be true. How would you apply that to regex? Consider the regex pattern "ab", in grep("ab", strings). Even this simple pattern has several requirements, ALL of which have to be true. It has to have an "a", AND it has to have a "b", AND the "b" has to follow the "a" directly.
strings <- c("abraham, not ahab", "no it was ahab",
"abraham was the one they left on ceti alpha V",
"You're talking about Sherlock Holmes", "He tasks me", "ab")
grep("ab", strings, value = TRUE)
# [1] "abraham, not ahab"
# [2] "no it was ahab"
# [3] "abraham was the one they left on ceti alpha V"
# [4] "You're talking about Sherlock Holmes"
# [5] "ab"
If what you'd like is to match strings that BOTH start with "ab" AND end with "ab", then #r2evans pattern will work for you: grep("^ab.*ab$", strings, value = TRUE) will show them to you. This means it starts with "ab", has zero or more other characters, and then ends with "ab".
grep("^ab.*ab$", strings, value = TRUE)
# [1] "abraham, not ahab"
# NOTICE THAT THIS DOESN'T MATCH "ab", despite "ab" being at the beginning
# AND the end
If what you'd like is to match all the strings that start with an "a" immediately followed by a "b", AND ALSO all those that end with an "a" immediately followed by a "b", then you actually want grep("(^ab)|(ab$)", strings, value = TRUE)
grep("^ab|ab$", strings, value = TRUE)
# [1] "abraham, not ahab"
# [2] "no it was ahab"
# [3] "abraham was the one they left on ceti alpha V"
# [4] "ab"
So what about that solitary "ab" case? What regex pattern would match that and only that?
grep("^ab$", strings, value = TRUE)
# [1] "ab"
In this case, we wanted all of the matches to BOTH start AND end with "ab", but it had to be the same "ab". Of course, we could combine this with the other "AND" version, and get all of the matches where ab was at the start and ab was at the end:
grep("^ab$|^ab.*ab$", strings, value = TRUE)
# [1] "abraham, not ahab" "ab"
..and one more thing:
We can use #r2evans comment to demonstrate a sort of DeMorgan's law with regex. Notice that the pattern with the | metacharacter produces the same thing that you would by subsetting the strings object with the logical vector produced by combining both individual regex matches with a boolean AND:
strings[grepl("^ab", strings) & grepl("ab$", strings)]
# [1] "abraham, not ahab" "ab"
Here grepl returns a logical vector, and we use it twice. The first is TRUE for every element of strings that matches "^ab", and the second for every element that matches "ab$". Combining those logical vectors with an & operator produces the same thing as a pattern with a | metacharacter.
You may use
grep("^ab(.*ab)?$", strings, value = TRUE)
The pattern matches a string that starts with ab and then has an optional substring ending with ab and then end of string should follow:
^ - start of string
ab - an ab substring
(.*ab)? - 1 or 0 repetitions (due to ? quantifier) of
.* - any 0+ chars, as many as possible
ab - an ab substring
$ - end of string.
See the regex demo.
Related
This question already has answers here:
R regex to get partly match
(2 answers)
Closed 6 days ago.
I want to use stri_replace_all_regex to replace string see as follows:
It's known that R default to greedy matching, but why it appears lazy matching here?
library(stringi)
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)
The result is [1] "ab" "xyc" "mn", which is not what I want. I
expected "abc" "xyc" "mnb".
You are calling stri_replace_all_regex with four arguments:
a is length 3. That's the str argument.
"\\b" %s+% b %s+% "\\S+" is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+"), but that's beside the point.) That's the pattern argument.
b is length 5. That's the replacement argument.
The last argument is vectorize_all=FALSE.
What it tries to do is documented as follows:
However, for stri_replace_all*, if vectorize_all is FALSE, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and - independently - over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i]). Note that you
must set length(pattern) >= length(replacement).
That's pretty sloppy documentation (I want to know what it does, not "something like" what it does!), but I think the process is as follows:
Your first pattern is "\\bab\\S+". That says "word boundary followed by ab followed by one or more non-whitespace chars". That matches all of a[1], so a[1] is replaced by b[1], which is "ab". It then tries the four other patterns, but none of them match, so you get "ab" as output.
The handling of a[3] is more complicated. The first match replaces it with "mnb", based on pattern[4]. Then a second replacement happens, because "mnb" matches pattern[5], and it gets changed again to "mn".
When you say R defaults to greedy matching, that's when doing a single regular expression match. You're doing five separate greedy matches, not one big greedy match.
EDITED to add:
I don't know the stringi functions well, but in the base regex functions you can do this with just one regex:
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
# Build a big pattern:
# "|" means "or", "(" ... ") capture the match
pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
pattern
#> [1] "\\b(ab)\\S+|\\b(abc)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
# \\1 etc contain whatever matched the parenthesized
# patterns. Only one will match, the rest will be empty
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "ab" "xyc" "mnb"
# I would have guessed the greedy rule would have found "abc"
# Try again:
pattern <- paste0("\\b(", b[c(2, 1, 3:5)], ")\\S+", collapse = "|")
pattern
#> [1] "\\b(abc)\\S+|\\b(ab)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "abc" "xyc" "mnb"
Created on 2023-02-13 with reprex v2.0.2
It appears the "|" takes the first match, not the greedy match. I don't think the R docs specify it one way or the other.
I'm looking to wrap parts of a string in R, following certain rules, in a vectorised way.
Put simply, if I had a vector:
c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b")
the function would sweep through each element and wrap I() around those parts where there is an exponent, resulting in the following output:
c("I(x^2)", "I(x^2):z", "z", "x:z", "z:x:b", "z:I(x^2):b")
I've tried various approaches where I first split by : and then gsub, but this isn't particularly scalable.
Something like below?
> gsub("(x(\\^\\d+))", "I(\\1)", c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b"))
[1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
[6] "z:I(x^2):b"
These seem reasonably general. They don't assume that the variable in the complex fields must be named x but handle any names made of word characters and also don't assume that the arithmetic expression must be an exponential but handle any arithmetic expression that includes non-word characters. for example, they would surround y+pi with I(...).
1) This one liner captures each field and processes it using the indicated function, expressed in formula notation. It surrounds each field that contains a non-word character with I(...) . It works with any variables whose names are made from word character.
library(gsubfn)
gsubfn("[^:]+", ~ if (grepl("\\W", x)) sprintf("I(%s)", x) else x, s)
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
2) This surrounds any field containing a character that is not :, letter or number with I(...)
gsub("([^:]*[^:[:alnum:]][^:]*)", "I(\\1)", s)
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
3) In this alternative we split the strings at colon, then surround fields containing a non-word character with I(...) and paste them back together.
surround <- function(x) ifelse(grepl("\\W", x), sprintf("I(%s)", x), x)
s |>
strsplit(":") |>
sapply(function(x) paste(surround(x), collapse = ":"))
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
Note
The input used is the following:
s <- c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b")
If I have a sentence separated by spaces
s<-("C java","C++ java")
grep("C",s)
gives output as
[1] [2]
while I only require
[1]
How to do that? ( I have used c\++ to identify c++ separately but matching with C gives [1] and [2] both as the output)
If we want to match 1 only, then we can use the start (^) and end ($) of the string to denote that there are no characters after or before 'C'
grep("^C$",s)
#[1] 1
data
s<- c("C","C++","java")
s<-c("C","C++","java")
which(s %in% "C")
grep() gives a positive result for any match within a string
I want to split the string whenever it encounters "a", provided "a" should not be followed by "b"
string <- "abcgualoo87ahhabta"
I should get output as
string <- [1]abcgua
[2]loo87a
[3]hhabta
You can split your string with the pattern "a not followed by b" with the regex a(?=[^b]) in strsplit:
split_str <- strsplit("abcgualoo87ahhabta", "a(?=[^b])", perl=TRUE)[[1]]
split_str
#[1] "abcgu" "loo87" "hhabta"
explanation of the split pattern: a lookahead ((?=)) is used with, as "look-ahead" pattern, anything except a b ([^b]) (the ^ sign indicates the negation). In order for the lookahead to work (be interpreted), we need to set parameter perl to TURE
Then you can add the removed "a" at the end of the splitted part, except last:
split_str <- paste0(c(rep("a", length(split_str)-1), ""))
#[1] "abcgua" "loo87a" "hhabta"
A nice one-step alternative provided by #nicola in the comments:
split_str <- strsplit("abcgualoo87ahhabta","(?<=a)(?!b)", perl=TRUE)[[1]]
#[1] "abcgua" "loo87a" "hhabta"
string <- "abcgualoo87ahhabta"
unlist(strsplit(gsub("a([^b])", "a \\1", string), split=" "))
# [1] "abcgua" "loo87a" "hhabta"
Suppose there is a vector of sequences of the form "foo" or "foo|baz|bar" (one single word or multiple words separated by special character like "|"), and we are also given a word and we want to find to which items of the vector it has a whole word match.
For example the word "foo" has a whole match in "foo|baz|bar", but not a whole match in either "foobaz|bar" or "bazfoo".
First I tried to use "\\b" that indicates either the start or the end edges of a whole word and it works successfully:
grep("\\bfoo\\b", "foo") # match
grep("\\bfoo\\b", "foobaz|bar") # mismatch
grep("\\bfoo\\b", "bazfoo") # mismatch
Then I tried to add "|" as the other possible separator of both ends, and group it with "\\b" using [ and ]:
grep("[|\\b]foo[|\\b]", "foo|baz|bar") # mismatch!
grep("[|\\b]foo[|\\b]", "foo") # mismatch!
Later I found \\b is not indicator of start or end of the character string, but start or end of a whole word (so many characters like space and ,|-^. but not numbers and underline _ separate whole words). So "[|\\b]foo[|\\b]" matches to all of these strings: "foo", "foo|bar|baz", "foo-bar", "baz foo|bar" but not to "foo_bar" or "foo2".
But my question still remains: Why "[|\\b]foo[|\\b]" pattern fails to match with "foo"?
You could use strplit:
> "foo" %in% unlist(strsplit("foo|baz|bar", split = "|", fixed = TRUE))
[1] TRUE
Which you can vectorize:
> z <- c("foo|baz|bar", "foobaz|bar", "bazfoo")
> x <- c("foo", "foot")
> sapply(strsplit(z, split = "|", fixed = TRUE), function(x,y)y %in% x, x)
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
\b matches at the following positions
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character. (Word characters are a-zA-Z1-9_)
Since | stands for alternation operator in regex, you will have to escape it.
So the regex \bfoo\b would match foo in foo|bar because | is a non word character. There is no need to use the character set [\b\|]
Edit: As flodel pointed out below \b inside the character set represents the backspace character. So it would match the | inside [\b\|] and not word boundary.
Since | has special meaning in a regular expression, you need to escape it, i.e. use \\|:
ptn <- "\\bfoo[\\|\\b]"
grep(ptn, "foo|baz|bar")
[1] 1
grep(ptn, "foo")
integer(0)
This would also work:
gregexpr("foo|", "foo|baz|bar", fixed = TRUE)[[c(1, 1)]] > 0
gregexpr("foo|", "foobaz|bar", fixed = TRUE)[[c(1, 1)]] > 0
gregexpr("foo|", "bazfoo", fixed = TRUE)[[c(1, 1)]] > 0
This approach is different in that you can utilize spacing options that you supply gregexpr to find words consisting of two words:
gregexpr("foo|", "baz foo|", fixed = TRUE)[[c(1, 1)]] > 0
gregexpr(" foo|", "baz foo|", fixed = TRUE)[[c(1, 1)]] > 0