Extract 2 words in any order - r

I would like to extract cat and dog in any order
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
What I have now extracts cat and dog, but also the text in-between
stringr::str_extract(string1, "cat.*dog|dog.*cat"
I would like the output to be
cat dog
and
dog cat
for string1 and string2, respectively

You may use sub with the following PCRE regex:
.*(?|(dog).*(cat)|(cat).*(dog)).*
See the regex demo.
Details
.* - any 0+ chars other than line break chars (to match all chars add (?s) at the pattern start)
(?|(dog).*(cat)|(cat).*(dog)) - a branch reset group (?|...|...) matching either of the two alternatives:
(dog).*(cat) - Group 1 capturing dog, then any 0+ chars as many as possible, and Group 2 capturing cat
| - or
(cat).*(dog) - Group 1 capturing cat, then any 0+ chars as many as possible, and Group 2 capturing dog (in a branch reset group, group IDs reset to the value before the group + 1)
.* - any 0+ chars other than line break chars
The \1 \2 replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog or cat, a space, and a cat or dog).
See an R demo online, too:
x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"
To return NA in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn to apply custom replacement logic:
> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"
Here,
^ - start of the string anchor
(?:.*((dog).*(cat)|(cat).*(dog)).*|.*) - a non-capturing group that matches either of the two alternatives:
.*((dog).*(cat)|(cat).*(dog)).*:
.* - any 0+ chars as many as possible
((dog).*(cat)|(cat).*(dog)) - a capturing group matching either of the two alternatives:
(dog).*(cat) - dog (Group 2, assigned to a variable), any 0+ chars as many as possible, and then cat (Group 3, assigned to b variable)
|
(cat).*(dog) - dog (Group 4, assigned to y variable), any 0+ chars as many as possible, and then cat (Group 5, assigned to z variable)
.* - any 0+ chars as many as possible
| - or
.* - any 0+ chars
$ - end of the string anchor.
The x in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar, and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA.

We can use str_extract_all from the stringr package with capture groups.
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
string3 <- "asfdadsfads asfdadsfadf"
library(stringr)
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)")
# [[1]]
# [1] "cat" "dog"
#
# [[2]]
# [1] "dog" "cat"
#
# [[3]]
# character(0)
We can also set simplify = TRUE. The output would be a matrix.
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)", simplify = TRUE)
# [,1] [,2]
# [1,] "cat" "dog"
# [2,] "dog" "cat"
# [3,] "" ""

Or,
> regmatches(string1,gregexpr("cat|dog",string1))
[[1]]
[1] "cat" "dog"
> regmatches(string2,gregexpr("cat|dog",string2))
[[1]]
[1] "dog" "cat"

Related

R: using \\b and \\B in regex

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.

Get characters after and before a pattern match in R

Lets say I have a string in R:
str <- "abc abc cde cde"
and I use regmatches and gregexpr to find how many "b"´s there is in my string
regmatches(str, gregexpr("b",str))
but I want an output of everything that cointains the letter b.
So an output like: "abc", "abc".
Thank you!
tmp <- "abc abc cde cde"
Split the string up into separate elements, grep for "b", return elements:
grep("b", unlist(strsplit(tmp, split = " ")), value = TRUE)
Look for non-space before and after, something like:
regmatches(str, gregexpr("\\S*b\\S*", s))
# [[1]]
# [1] "abc" "abc"
Special regex characters are documented in ?regex. For this case, \\s matches "any space-like character", and \\S is its negation, so any non-space-like character. You could be more specific, such as \\w ('word' character, same as [[:alnum:]_]). The * means zero-or-more, and + means one-or-more (forcing something).
I suppose you mean you want to find words that contain b. One regex that does this is
\w*b\w*
\w* matches 0 or more word characters, which is a-z, A-Z, 0-9, and the underscore character.
Demo
Here is a base R option using strsplit and grepl:
str <- "abc abc cde cde"
words <- strsplit(str, "\\s+")[[1]]
idx <- sapply(words, function(x) { grepl("b", x)})
matches <- words[idx]
matches
[1] "abc" "abc"

Capture entire substring using regex if there is a match on a number

I have been unable to find the answer to this specific question, I am using R to clean some survey data.
I have some messy survey data with question names as columns, that sometimes include a number and sometimes don't. When they include a number, it will often contain some subcharacters as well indicating the question. Example, I have this vector:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
I want to extract the substrings that contain numbers, and return no results if there is no such match. Desired result (using R)
"1"
"1.a."
NA
"2"
"2.a."
"2.b."
NA
I know I can capture the first number, using
stri_extract_first_regex(questions, "[0-9]+")
But I am at a loss how to modify it to capture the whole string until the first whitespace if it finds a match using this pattern.
For you example data you might use:
[0-9]+(?:\.[a-z]\.)?
That will match:
[0-9]+ Match 1+ digits
(?: Non capturing group
\.[a-z]\. Match a dot, lowercase character and a dot
)? Close non capturing group and make it optional
For example:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
print(stri_extract_first_regex(questions, "[0-9]+(?:\\.[a-z]\\.)?"))
# [1] "1" "1.a." NA "2" "2.a." "2.b." NA
This might work:
hasnumber <- grepl("[0-9]+",questions)
firstspaces <- sapply(gregexpr(" ", questions), function(x) x[[1]])
res <- ifelse(hasnumber, substr(questions,1,firstspaces-1), NA)
> res
[1] "1" "1.a." NA "2" "2.a." "2.b." NA
The most difficult part I guess is to define where are the first spaces in each question, which could be done with loops or here sapply
You may use
questions <- sub("^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.*", "\\1", questions)
questions[questions==""] <- NA
questions
# => [1] "1" "1.a." NA "2" "2.a." "2.b." NA
The ^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.* matches
^ - start of string
(\\d+(?:\\.[a-z0-9]+)*) - Capturing group 1:
\\d+ - 1+ digits
(?:\\.[a-z0-9]+)* - 0 or more repetitions of
\\. - a dot
[a-z0-9]+ - 1 or more lowercase ASCII letters or digits
\\.? - an optional dot
.* - any 0+ chars to the end of the string
| - or
.* - the whole string.
Replaces with the contents of Group 1. If the second alternative matches, the result is an empty string, questions[questions==""] <- NA replaces these elements with NAs.

Substitute word with same word without initial # in R

I am trying to do a dataframe string substitution in R. I need to find all the words preceded by '#' (without space, e.g. #word) and change the '#' for '!' (e.g. from #word to !word). At the same time, it leaves intact the other instances of '#' (e.g. # or ## or #[#]). For example, this is my original dataframe (to change: #def, #jkl, #stu):
> df = data.frame(number = 1:4, text = c('abc #def ghi', '#jkl # mno', '#[#] pqr #stu', 'vwx ### yz'))
> df
number text
1 1 abc #def ghi
2 2 #jkl # mno
3 3 #[#] pqr #stu
4 4 vwx ### yz
And this is what I need it to look like:
> df_result = data.frame(number = 1:4, text = c('abc !def ghi', '!jkl # mno', '#[#] pqr !stu', 'vwx ### yz'))
> df_result
number text
1 1 abc !def ghi
2 2 !jkl # mno
3 3 #[#] pqr !stu
4 4 vwx ### yz
I have tried with
> gsub('#.+[a-z] ', '!', df$text)
[1] "abc !ghi" "!# mno" "!#stu" "vwx ### yz"
But the result is not the desired one. Any help is much appreciated.
Thank you.
How about
gsub("(^| )#(\\w)", "\\1!\\2", df$text)
# [1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
This matches an # symbol at beginning of a string, or after a space. Then, we capture the word character after the # symbol, and replace # with !.
Explanation courtesy of regex101.com:
(^| ) is the 1st Capturing Group; ^ asserts position at start of the string; | denotes "or"; blank space matches the space character literally
# matches the character # literally (case sensitive)
(\\w) is the 2nd Capturing Group, it denotes a word character
The replacement string \\1!\\2 replaces the regular expression match with the first capturing group (\\1), followed by !, followed by the second capturing group (\\2).
You can use a positive lookahead (?=...)
gsub("#(?=[A-Za-z])", "!", df$text, perl = TRUE)
[1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
From the "Regular Expressions as used in R" documentation page:
Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed.

Regex finding aa but not aaa with grep

Can't figure out the regex pattern for matching aa but not aaa.
x <- c("ab", "abc", "abcc", "abccc", "abcccc", "abccccc")
grep(pattern="c{2,3}", x, value=TRUE, perl=TRUE)
## [1] "abcc" "abccc" "abcccc" "abccccc"
grep(pattern="^((?!c{4,}).)*$", x, value=TRUE, perl=TRUE)
## [1] "ab" "abc" "abcc" "abccc"
But what's the pattern to yield:
grep(pattern=..., x, value=TRUE, perl=TRUE)
## [1] "abcc" "abccc"
This should work for your test cases:
^[^c]+c{2,3}$
But what's the pattern to yield 1 "abcc" "abccc"
You need to assure that the 2 or 3 cs are not preceded nor followed with c:
grep(pattern="(^|[^c])c{2,3}([^c]|$)", x, value=TRUE)
Details:
(^|[^c]) - an alternation group matching start of string (^ anchor) or any char other than c (with a negated character class (bracket expression) [^c])
c{2,3} - 2 or 3 cs
([^c]|$) - an alternation group matching end of string ($ anchor) or any char other than c
Alternatively, use a PCRE regex with lookarounds if your c is actually a placeholder for a multicharacter substring:
grep(pattern="(?<!c)c{2,3}(?!c)", x, value=TRUE, perl=TRUE)
See the R demo
The (?<!c) negative lookbehind will fail the match if there is a c right before 2 or 3 cs and (?!c) negative lookahead fails the match if there is a c right after the 2 or 3 cs.

Resources