Extract a part of string on a particular reference - r

I need to extract number that comes after "&r=" in the below link.
http://asdf.com/product/eyewear/eyeglasses?Brand[]=Allen%20Solly&r=472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}
Here's what i tried
C has my link stored in.
sub(".*&r=", "",c)
"472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}"
This only gives me whole after part of the string .
I only need the number i.e 472020 .
Any idea?

Here is how to get it using sub
sub(".*=(\\d+)&.*", "\\1", z)
#[1] "472020"
or
as.integer(sub(".*=(\\d+)&.*", "\\1", z))
#[1] 472020

For completeness sake, here it is with the base R regmatches/regexpr combo:
regmatches(z, regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE))
It uses the same Perl-flavoured regex as #akrun's stringr version. regexpr (or gregexpr if several matches of the same pattern are expected in the same string) matches the pattern, while regmatches extracts it (it is vectorized so several strings can be matched/extracted at once).
> as.integer(regmatches(z,regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE)))
#[1] 472020

We can use str_extract
library(stringr)
as.numeric(str_extract(z, "(?<=\\&r\\=)\\d+"))
#[1] 472020
If there are several matches use str_extract_all in place of str_extract

Related

Regex get string between intervals underscores

I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput
I'm using R, stringr, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word
library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1")
[1] "text"
In base R, it is easier with strsplit and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE in regexpr
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"

Regex replace everything after first digit including digit with another string

I have some strings with this pattern: abc_def_10_cat_dog and I want to use gsub to replace everything after abc_def with _diff. So in the end it should be abc_def_diff. What regex expression would I need to do this? Another way of thinking about it could be "How do I keep all the values before the first digit and then add _diff?" I am using dplyr, gsub, on R.
I know '^[^0-9]*' keeps what I want, but I'm not sure how to just keep those characters and drop the stuff afterwards. I tried using str_extract as well but it kept saying 'object not found'. My object is just a list of names.
object <- df %>% select(vars(ends_with("10_cat_dog")) %>% names()
You can match on a digit and any characters until you reach the end of the string, and then replace with diff:
library(stringr)
str_replace("abc_def_10_cat_dog", "_\\d.*$", "_diff")
#> [1] "abc_def_diff"
Created on 2020-10-30 by the reprex package (v0.3.0)
You can use base R sub :
x <- 'abc_def_10_cat_dog'
sub('\\d+.*', 'diff', x)
#[1] "abc_def_diff"
Another option with sub
sub("[0-9]+.*", "diff", x)
#[1] "abc_def_diff"
data
x <- 'abc_def_10_cat_dog'

Regular expression in R - extract only match

My strings look like as follows:
crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt
I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").
My regular expression is:
regex.f = "_f([[:alnum:]]+)_"
There is no string with more than one part matching the pattern. Why does the following command not work?
sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")
The command only removes "_f" from the string and returns the remaining string.
Can easily be achived with qdapRegex
df <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)
We can use sub extract the strings by matching the characterf followed by one or more characters that are not an underscore or numbers ([^_0-9]+), capture as a group ((...)), followed by 0 or more numbers (\\d*) followed by an _ and other characters. Replace with the backreference (\\1) of the captured group
sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv" "weo" "weo"
data
str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276, which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _.
x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
"crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")
regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))
Or with the stringr package.
library(stringr)
str_extract(x, "(?<=_f).*?(?=_)")
edited to start the match on _f instead of f.
NOTE
akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.
update: capture match using str_match
library(stringr)
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2
your regex not work because missing starting and ending match .* and use \w for shorthand [:alnum:]
sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")
We could use the package unglue :
library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
pattern <-
"crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv" "weo2" "weo2"
Created on 2019-10-09 by the reprex package (v0.3.0)

Extracting part of string by position in R

I have a vector of strings string which look like this
ABC_EFG_HIG_ADF_AKF_MNB
Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R
substr extracts a substring by position:
substr('ABC_EFG_HIG_ADF_AKF_MNB', 9, 11)
returns
[1] "HIG"
Here's one more possibility:
strsplit(str1,"_")[[1]][3]
#[1] "HIG"
The command strsplit() does what its name suggests: it splits a string. The second parameter is the character on which the string is split, wherever it is found within the string.
Perhaps somewhat surprisingly, strsplit() returns a list. So we can either use unlist() to access the resulting split parts of the original string, or in this case address them with the index of the list [[1]] since the list in this example has only one member, which consists of six character strings (cf. the output of str(strsplit(str1,"_"))).
To access the third entry of this list, we can specify [3] at the end of the command.
The string str1 is defined here as in the answer by #akrun.
We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).
sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"
Or another option is with scan
scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"
A similar option as mentioned by #RHertel would be to use read.table/read.csv on the string
read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]
data
str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"
If you know the place of the pattern you look for, and you know that it is fixed (here, between the 9 and 11 character), you can simply use str_sub(), from the stringr package.
MyString = 'ABC_EFG_HIG_ADF_AKF_MNB'
str_sub(MyString, 9, 11)
A new option is using the function str_split_i from the development version stringr which can also extract a string by position split by a certain string. Here is a reproducible example:
# devtools::install_github("tidyverse/stringr")
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 3)
#> [1] "HIG"
Created on 2022-09-10 with reprex v2.0.2
As you can see it extracted the third string. If you want the 6th you can change the 3 with 6 like this:
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 6)
#> [1] "MNB"
Created on 2022-09-10 with reprex v2.0.2

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Resources