Regular expression in R - extract only match - r

My strings look like as follows:
crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt
I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").
My regular expression is:
regex.f = "_f([[:alnum:]]+)_"
There is no string with more than one part matching the pattern. Why does the following command not work?
sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")
The command only removes "_f" from the string and returns the remaining string.

Can easily be achived with qdapRegex
df <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)

We can use sub extract the strings by matching the characterf followed by one or more characters that are not an underscore or numbers ([^_0-9]+), capture as a group ((...)), followed by 0 or more numbers (\\d*) followed by an _ and other characters. Replace with the backreference (\\1) of the captured group
sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv" "weo" "weo"
data
str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276, which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _.
x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
"crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")
regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))
Or with the stringr package.
library(stringr)
str_extract(x, "(?<=_f).*?(?=_)")
edited to start the match on _f instead of f.
NOTE
akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.

update: capture match using str_match
library(stringr)
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2
your regex not work because missing starting and ending match .* and use \w for shorthand [:alnum:]
sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")

We could use the package unglue :
library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
pattern <-
"crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv" "weo2" "weo2"
Created on 2019-10-09 by the reprex package (v0.3.0)

Related

Regex get string between intervals underscores

I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput
I'm using R, stringr, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word
library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1")
[1] "text"
In base R, it is easier with strsplit and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE in regexpr
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"

How to create a regex expression to get a substring between 2 pipes

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:
ENST00000000233.10|ENSG00000004059.11|OTTHUMG000
I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?
Here is a regex.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"
Created on 2022-05-03 by the reprex package (v2.0.1)
Explanation:
^ beginning of string;
[^\\|]* not the pipe character zero or more times;
\\| the pipe character needs to be escaped since it's a meta-character;
^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
([^\\|]+) group match anything but the pipe character at least once;
\\|.*$ the second pipe plus anything until the end of the string.
Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.
Another option is to get the second item after splitting the string on |.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]
# [1] "ENSG00000004059.11"
Or with tidyverse:
library(tidyverse)
str_split(x, "\\|") %>% map_chr(`[`, 2)
# [1] "ENSG00000004059.11"
Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".
The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).
library(stringr)
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")
[1] "ENSG00000004059.11"
Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).
Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

Add symbol between the letter S and any number in a column dataframe

I am trying to add a - between letter S and any number in a column of a data frame. So, this is an example:
VariableA
TRS34
MMH22
GFSR104
GS23
RRTM55
P3
S4
My desired output is:
VariableA
TRS-34
MMH22
GFSR104
GS-23
RRTM55
P3
S-4
I was trying yo use gsub:
gsub('^([a-z])-([0-9]+)$','\\1d\\2',myDF$VariableA)
but this is not working.
How can I solve this?
Thanks!
Your ^([a-z])-([0-9]+)$ regex attempts to match strings that start with a letter, then have a - and then one or more digits. This can't work as there are no hyphens in the strings, you want to introduce it into the strings.
You can use
gsub('(S)([0-9])', '\\1-\\2', myDF$VariableA)
The (S)([0-9]) regex matches and captures S into Group 1 (\1) and then any digit is captured into Group 2 (\2) and the replacement pattern is a concatenation of group values with a hyphen in between.
If there is only one substitution expected, replace gsub with sub.
See the regex demo and the online R demo.
Other variations:
gsub('(S)(\\d)', '\\1-\\2', myDF$VariableA) # \d also matches digits
gsub('(?<=S)(?=\\d)', '-', myDF$VariableA, perl=TRUE) # Lookarounds make backreferences redundant
Here is the version I like using sub:
myDF$VariableA <- gsub('S(\\d)', 'S-\\1', myDF$VariableA)
This requires using only one capture group.
Using stringr package
library(stringr)
str_replace_all(myDF$VariableA, 'S(\\d)', 'S-\\1')
You could also use lookbehinds if you set perl=TRUE:
> gsub('(?<=S)([0-9]+)', '-\\1', myDF$VariableA, perl=TRUE)
[1] "TRS-34" "MMH22" "GFSR104" "GS-23" "RRTM55" "P3" "S-4"
>

Extract a certain pattern string from the text by R

I have a column of texts look like below:
str1 = "ABCID 123456789 is what I'm looking for, could you help me to check this Item's status?"
I want to use gsub function in R to extract "ABCID 123456789" from there. The number might change with different numbers, but ABCID is a constant. Can someone know the solution with that please? Thanks very much!
We can use str_extract to select the fixed word followed by space and one or more numbers (\\d+)
library(stringr)
str_extract(df1$col1, "ABCID \\d+")
If there are multiple instances, use str_extract_all
str_extract_all(df1$col1, "ABCID \\d+")
NOTE: The OP states that to extract "ABCID 123456789" from there
If the number has constant length (9) you could you use positive lookbehind:
sub("(?<=ABCID \\d{9}).*", "", str1, perl = TRUE)
# [1] "ABCID 123456789"
Match the beginning of string (^) leading letters (ABCID), a space, digits (\d+) and everything else (.*) and replace it all with the captured portion, i.e. the portion within parentheses. Note that we want to use sub, not gsub, here because there is only one substitution.
sub("^(ABCID \\d+).*", "\\1", str1)
## [1] "ABCID 123456789"

Extracting part of string by position in R

I have a vector of strings string which look like this
ABC_EFG_HIG_ADF_AKF_MNB
Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R
substr extracts a substring by position:
substr('ABC_EFG_HIG_ADF_AKF_MNB', 9, 11)
returns
[1] "HIG"
Here's one more possibility:
strsplit(str1,"_")[[1]][3]
#[1] "HIG"
The command strsplit() does what its name suggests: it splits a string. The second parameter is the character on which the string is split, wherever it is found within the string.
Perhaps somewhat surprisingly, strsplit() returns a list. So we can either use unlist() to access the resulting split parts of the original string, or in this case address them with the index of the list [[1]] since the list in this example has only one member, which consists of six character strings (cf. the output of str(strsplit(str1,"_"))).
To access the third entry of this list, we can specify [3] at the end of the command.
The string str1 is defined here as in the answer by #akrun.
We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).
sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"
Or another option is with scan
scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"
A similar option as mentioned by #RHertel would be to use read.table/read.csv on the string
read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]
data
str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"
If you know the place of the pattern you look for, and you know that it is fixed (here, between the 9 and 11 character), you can simply use str_sub(), from the stringr package.
MyString = 'ABC_EFG_HIG_ADF_AKF_MNB'
str_sub(MyString, 9, 11)
A new option is using the function str_split_i from the development version stringr which can also extract a string by position split by a certain string. Here is a reproducible example:
# devtools::install_github("tidyverse/stringr")
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 3)
#> [1] "HIG"
Created on 2022-09-10 with reprex v2.0.2
As you can see it extracted the third string. If you want the 6th you can change the 3 with 6 like this:
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 6)
#> [1] "MNB"
Created on 2022-09-10 with reprex v2.0.2

Resources