R regex Grouping Not Working as Expected [duplicate] - r

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.

For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"

Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))

strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.

Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

Related

subset strings without a pattern stringr

I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.
In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"

R - Replace Group 1 match in regex but not full match

Suppose I want to extract all letters between the letter a and c. I've been so far using the stringr package which gives a clear idea of the full matches and the groups. The package for example would give the following.
library(stringr)
str_match_all("abc", "a([a-z])c")
# [[1]]
# [,1] [,2]
# [1,] "abc" "b"
Suppose I only want to replace the group, and not the full match---in this case the letter b. The following would, however, replace the full match.
str_replace_all("abc", "a([a-z])c", "z")
[1] "z"
# Desired result: "azc"
Would there be any good ways to replace only the capture group? suppose I wanted to do multiple matches.
str_match_all("abcdef", "a([a-z])c|d([a-z])f")
# [[1]]
# [,1] [,2] [,3]
# [1,] "abc" "b" NA
# [2,] "def" NA "e"
str_replace_all("abcdef", "a([a-z])c|d([a-z])f", "z")
# [1] "zz"
# Desired result: "azcdzf"
Matching groups was easy enough, but I haven't found a solution when a replacement is desired.
It is not the way regex was designed. Capturing is a mechanism to get the parts of strings you need and when replacing, it is used to keep parts of matches, not to discard.
Thus, a natural solution is to wrap what you need to keep with capturing groups.
In this case here, use
str_replace_all("abc", "(a)[a-z](c)", "\\1z\\2")
Or with lookarounds (if the lookbehind is a fixed/known width pattern):
str_replace_all("abc", "(?<=a)[a-z](?=c)", "z")
Usually when I want to replace certain pattern of characters in a text\string I use the grep family functions, that is what we call working with regular expressions.
You can use sub function of the grep family functions to make replacements in strings.
Exemple:
sub("b","z","abc")
[1] "azc"
You may face more challenges working with replacement, for that, grep family functions offers many functionality:
replacing all characters by your preference except a and c:
sub("[^ac]+","z","abBbbbc")
[1] "azc"
replacing the second match
sub("b{2}","z","abBbbbc")
[1] "abBzbc"
replacing all characters after the pattern:
sub("b.*","z","abc")
[1] "az"
the same above except c:
sub("b.*[^c]","z","abc")
[1] "abc"
So on...
You can look for "regular expressions in R using grep" into internet and find many ways to work with regular expressions.

Regular expression in R - extract only match

My strings look like as follows:
crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt
I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").
My regular expression is:
regex.f = "_f([[:alnum:]]+)_"
There is no string with more than one part matching the pattern. Why does the following command not work?
sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")
The command only removes "_f" from the string and returns the remaining string.
Can easily be achived with qdapRegex
df <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)
We can use sub extract the strings by matching the characterf followed by one or more characters that are not an underscore or numbers ([^_0-9]+), capture as a group ((...)), followed by 0 or more numbers (\\d*) followed by an _ and other characters. Replace with the backreference (\\1) of the captured group
sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv" "weo" "weo"
data
str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276, which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _.
x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
"crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")
regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))
Or with the stringr package.
library(stringr)
str_extract(x, "(?<=_f).*?(?=_)")
edited to start the match on _f instead of f.
NOTE
akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.
update: capture match using str_match
library(stringr)
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2
your regex not work because missing starting and ending match .* and use \w for shorthand [:alnum:]
sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")
We could use the package unglue :
library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
pattern <-
"crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv" "weo2" "weo2"
Created on 2019-10-09 by the reprex package (v0.3.0)

Check which words show up at least once within words from another vector

Let's say we have a list of words:
words = c("happy","like","chill")
Now I have another string variable:
s = "happyMeal"
I wanted to check which word in words has the matching part in s.
So s could be "happyTime", "happyFace", "happyHour", but as long as there's "happy" in there, I want my result to return the index of word "happy" in the string vector words.
This question is similar but not identical to from the question asked in the post: Find a string in another string in R.
You can loop through each of the words that you're searching for with sapply, using grepl to determine if that word appears in s:
sapply(words, grepl, s)
# happy like chill
# TRUE FALSE FALSE
If s is a single word then sapply with grepl returns a logical vector that you can use to determine the words that matched:
words[sapply(words, grepl, s)]
# [1] "happy"
When s contains multiple words, then sapply with grepl returns a logical matrix, and you can use the column sums to determine which words showed up at least once:
s <- c("happyTime", "chilling", "happyFace")
words[colSums(sapply(words, grepl, s)) > 0]
# [1] "happy" "chill"
Here is an option using stri_detect from stringi
library(stringi)
words[stri_detect_regex(s, words)]
#[1] "happy"

Extract websites links from a text in R

I have multiple texts that each may consist references to one or more web links. for example:
text1= "s#1212a as www.abcd.com asasa11".
How do I extract:
"www.abcd.com"
from this text in R? In other words I am looking to extract patterns that start with www and end with .com
regmatches This approach uses regexpr/grepgexpr and regmatches. I expanded the test data to include more examples.
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"blargwww.test.comasdf")
# Regular expressions take some practice.
# check out ?regex or the wikipedia page on regular expressions
# for more info on creating them yourself.
pattern <- "www\\..*?\\.com"
# Get information about where the pattern matches text1
m <- gregexpr(pattern, text1)
# Extract the matches from text1
regmatches(text1, m)
Which gives
> regmatches(text1, m) ##
[[1]]
[1] "www.abcd.com" "www.cats.com"
[[2]]
[1] "www.boo.com"
[[3]]
character(0)
[[4]]
[1] "www.test.com"
Notice it returns a list. If we want a vector you can just use unlist on the result. This is because we used gregexpr which implies there could be multiple matches in our string. If we know there is at most one match we could use regexpr instead
> m <- regexpr(pattern, text1)
> regmatches(text1, m)
[1] "www.abcd.com" "www.boo.com" "www.test.com"
Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the gsub method because that way will return the entire input if there is no result found. For example take a look:
> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1")
[1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com"
And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern.
Shameless silly self promotion: regmatches was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R.
strapplyc. An alternative which gives the same result as ## above is:
library(gsubfn)
strapplyc(test1, pattern)
The regular expression Here is some explanation on how to decipher the regular expression:
pattern <- "www\\..*?\\.com"
Explanation:
www matches the www portion
\\. We need to escape an actual 'dot' character using \\ because a plain . represents "any character" in regular expressions.
.*? The . represents any character, the * tells to match 0 or more times, and the ? following the * tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there.
\\. Once again we need to escape an actual dot character
com This part matches the ending 'com' that we want to match
Putting it all together it says: start with www. then match any characters until you reach the first ".com"
Check out the gsub function:
x = "s#1212a as www.abcd.com asasa11"
gsub(x=x, pattern=".*(www.*com).*", replace="\\1")
The basic idea is to surround the txt you want to retain in parenthesis, then replace the entire line with it. The replace parameter of gsub "\\1" refers to what was found in the parenthesis.
The solutions here are great and in base. For those that want a quick solution you can use qdap's genXtract. This functions basically takes a left and a right element(s) and it will extract everything in between. By setting with = TRUE it will include those elements:
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"http://www.talkstats.com/ and http://stackoverflow.com/",
"blargwww.test.comasdf")
library(qdap)
genXtract(text1, "www.", ".com", with=TRUE)
## > genXtract(text1, "www.", ".com", with=TRUE)
## $`www. : .com1`
## [1] "www.abcd.com" "www.cats.com"
##
## $`www. : .com2`
## [1] "www.boo.com"
##
## $`www. : .com3`
## character(0)
##
## $`www. : .com4`
## [1] "www.talkstats.com"
##
## $`www. : .com5`
## [1] "www.test.com"
PS if you look at code for the function it is a wrapper for Dason's solution.

Resources