How to match multiple capture groups with str_match(..., regex) - r

I am using str_match from the stringr package to capture text in between brackets.
library(stringr)
strs = c("P5P (abcde) + P5P (fghij)", "Glcext (abcdef)")
str_match(strs, "\\(([a-z]+)\\)")
gives me only the matches "abcde" and "abcdef". How can I capture the "fghij" as well with still using the same regex for both strings?

str_extract_all(strs, "\\(([a-z]+)\\)")
or as #JoshO'Brien mentions in his comment,
str_match_all(strs, "\\(([a-z]+)\\)")
This can just as easily be accomplished with base R:
regmatches(strs, gregexpr("\\(([a-z]+)\\)", strs))

Related

Regex get string between intervals underscores

I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput
I'm using R, stringr, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word
library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1")
[1] "text"
In base R, it is easier with strsplit and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE in regexpr
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"

Add symbol between the letter S and any number in a column dataframe

I am trying to add a - between letter S and any number in a column of a data frame. So, this is an example:
VariableA
TRS34
MMH22
GFSR104
GS23
RRTM55
P3
S4
My desired output is:
VariableA
TRS-34
MMH22
GFSR104
GS-23
RRTM55
P3
S-4
I was trying yo use gsub:
gsub('^([a-z])-([0-9]+)$','\\1d\\2',myDF$VariableA)
but this is not working.
How can I solve this?
Thanks!
Your ^([a-z])-([0-9]+)$ regex attempts to match strings that start with a letter, then have a - and then one or more digits. This can't work as there are no hyphens in the strings, you want to introduce it into the strings.
You can use
gsub('(S)([0-9])', '\\1-\\2', myDF$VariableA)
The (S)([0-9]) regex matches and captures S into Group 1 (\1) and then any digit is captured into Group 2 (\2) and the replacement pattern is a concatenation of group values with a hyphen in between.
If there is only one substitution expected, replace gsub with sub.
See the regex demo and the online R demo.
Other variations:
gsub('(S)(\\d)', '\\1-\\2', myDF$VariableA) # \d also matches digits
gsub('(?<=S)(?=\\d)', '-', myDF$VariableA, perl=TRUE) # Lookarounds make backreferences redundant
Here is the version I like using sub:
myDF$VariableA <- gsub('S(\\d)', 'S-\\1', myDF$VariableA)
This requires using only one capture group.
Using stringr package
library(stringr)
str_replace_all(myDF$VariableA, 'S(\\d)', 'S-\\1')
You could also use lookbehinds if you set perl=TRUE:
> gsub('(?<=S)([0-9]+)', '-\\1', myDF$VariableA, perl=TRUE)
[1] "TRS-34" "MMH22" "GFSR104" "GS-23" "RRTM55" "P3" "S-4"
>

How to get str_replace (and other stringr) functions to ignore regex (treat string literally)?

How can we make stringr functions (like str_replace) treat arguments as string literals rather than regex without escaping special characters manually?
Example
Escaping special characters works as expected
"dsfj?lddkfj" %>% str_replace("\\?ld", "AAA")
[1] "dsfjAAAdkfj"
But if we don't escape the special character(s), stringr functions treat the pattern as a regex. In this case we get an error.
"dsfj?lddkfj" %>% str_replace("?ld", "AAA")
Quesiton
Is there some standard way to force stringr functions to treat patterns as string literals (without escaping special characters manually)?
Notes
I can see perl = T is an option used for the function sub()
Wrap the pattern in fixed()
library(stringr)
"dsfj?lddkfj" %>% str_replace(fixed("?ld"), "AAA")
#[1] "dsfjAAAdkfj"
We can use sub in base R
sub("?ld", "AAA", "dsfj?lddkfj", fixed = TRUE)
#[1] "dsfjAAAdkfj"

Regular expression in R - extract only match

My strings look like as follows:
crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt
I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").
My regular expression is:
regex.f = "_f([[:alnum:]]+)_"
There is no string with more than one part matching the pattern. Why does the following command not work?
sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")
The command only removes "_f" from the string and returns the remaining string.
Can easily be achived with qdapRegex
df <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)
We can use sub extract the strings by matching the characterf followed by one or more characters that are not an underscore or numbers ([^_0-9]+), capture as a group ((...)), followed by 0 or more numbers (\\d*) followed by an _ and other characters. Replace with the backreference (\\1) of the captured group
sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv" "weo" "weo"
data
str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276, which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _.
x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
"crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")
regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))
Or with the stringr package.
library(stringr)
str_extract(x, "(?<=_f).*?(?=_)")
edited to start the match on _f instead of f.
NOTE
akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.
update: capture match using str_match
library(stringr)
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2
your regex not work because missing starting and ending match .* and use \w for shorthand [:alnum:]
sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")
We could use the package unglue :
library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
pattern <-
"crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv" "weo2" "weo2"
Created on 2019-10-09 by the reprex package (v0.3.0)

Extract a part of string on a particular reference

I need to extract number that comes after "&r=" in the below link.
http://asdf.com/product/eyewear/eyeglasses?Brand[]=Allen%20Solly&r=472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}
Here's what i tried
C has my link stored in.
sub(".*&r=", "",c)
"472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}"
This only gives me whole after part of the string .
I only need the number i.e 472020 .
Any idea?
Here is how to get it using sub
sub(".*=(\\d+)&.*", "\\1", z)
#[1] "472020"
or
as.integer(sub(".*=(\\d+)&.*", "\\1", z))
#[1] 472020
For completeness sake, here it is with the base R regmatches/regexpr combo:
regmatches(z, regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE))
It uses the same Perl-flavoured regex as #akrun's stringr version. regexpr (or gregexpr if several matches of the same pattern are expected in the same string) matches the pattern, while regmatches extracts it (it is vectorized so several strings can be matched/extracted at once).
> as.integer(regmatches(z,regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE)))
#[1] 472020
We can use str_extract
library(stringr)
as.numeric(str_extract(z, "(?<=\\&r\\=)\\d+"))
#[1] 472020
If there are several matches use str_extract_all in place of str_extract

Resources