Extracting part of string by position in R - r

I have a vector of strings string which look like this
ABC_EFG_HIG_ADF_AKF_MNB
Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R

substr extracts a substring by position:
substr('ABC_EFG_HIG_ADF_AKF_MNB', 9, 11)
returns
[1] "HIG"

Here's one more possibility:
strsplit(str1,"_")[[1]][3]
#[1] "HIG"
The command strsplit() does what its name suggests: it splits a string. The second parameter is the character on which the string is split, wherever it is found within the string.
Perhaps somewhat surprisingly, strsplit() returns a list. So we can either use unlist() to access the resulting split parts of the original string, or in this case address them with the index of the list [[1]] since the list in this example has only one member, which consists of six character strings (cf. the output of str(strsplit(str1,"_"))).
To access the third entry of this list, we can specify [3] at the end of the command.
The string str1 is defined here as in the answer by #akrun.

We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).
sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"
Or another option is with scan
scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"
A similar option as mentioned by #RHertel would be to use read.table/read.csv on the string
read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]
data
str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"

If you know the place of the pattern you look for, and you know that it is fixed (here, between the 9 and 11 character), you can simply use str_sub(), from the stringr package.
MyString = 'ABC_EFG_HIG_ADF_AKF_MNB'
str_sub(MyString, 9, 11)

A new option is using the function str_split_i from the development version stringr which can also extract a string by position split by a certain string. Here is a reproducible example:
# devtools::install_github("tidyverse/stringr")
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 3)
#> [1] "HIG"
Created on 2022-09-10 with reprex v2.0.2
As you can see it extracted the third string. If you want the 6th you can change the 3 with 6 like this:
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 6)
#> [1] "MNB"
Created on 2022-09-10 with reprex v2.0.2

Related

find position of bracket in a string

I'm looking to find the position of a bracket within a string.
mystring <- "VAR_c(1:9)_XYZ"
I'd like to find the position of "(".
You could use gregexpr to find your bracket (Note you should add \\ to find the bracket) with unlist, if you have multiple brackets it will show all the positions, like this:
mystring <- "VAR_c(1:9)_XYZ"
unlist(gregexpr('\\(', mystring))
#> [1] 6
Example with another bracket to show it will give you all the positions like this:
mystring2 <- "VAR_c(1:9)_XYZ("
unlist(gregexpr('\\(', mystring2))
#> [1] 6 15
Created on 2023-02-16 with reprex v2.0.2
split the string to a vector of characters, then use grep to find the character "(" (which has to be escaped, hence the \\) in this vector.
grep("\\(", strsplit(mystring, "")[[1]])
As an alternative to the answers above, if you want to find multiple occurences you can use stri_locate_all() from the stringi package:
stringi::stri_locate_all(regex = "\\(", "VAR_c(1:9)_(XYZ")
or faster for simple patterns like yours above:
stringi::stri_locate_all(fixed = "(", "VAR_c(1:9)_(XYZ")
You could look for a sub-string that ends with "(" and then count how long that string is using nchar(). The (.*\\() matches a string of any characters that ends with an open bracket. The .* after indicates that there may be other characters following the open bracket, but that those should not be captured. The gsub() function does replacement, so what you're really doing is replacing the full string with the sub-string that ends with an open bracket. Using "\\1" in the replacement argument means you want to replace the string with the match of the sub-string in the first set of un-escaped brackets, in this case .*\\(.
mystring <- "VAR_c(1:9)_XYZ"
nchar(gsub("(.*\\().*", "\\1", mystring))
#> [1] 6
Created on 2023-02-16 by the reprex package (v2.0.1)

How to create a regex expression to get a substring between 2 pipes

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:
ENST00000000233.10|ENSG00000004059.11|OTTHUMG000
I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?
Here is a regex.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"
Created on 2022-05-03 by the reprex package (v2.0.1)
Explanation:
^ beginning of string;
[^\\|]* not the pipe character zero or more times;
\\| the pipe character needs to be escaped since it's a meta-character;
^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
([^\\|]+) group match anything but the pipe character at least once;
\\|.*$ the second pipe plus anything until the end of the string.
Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.
Another option is to get the second item after splitting the string on |.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]
# [1] "ENSG00000004059.11"
Or with tidyverse:
library(tidyverse)
str_split(x, "\\|") %>% map_chr(`[`, 2)
# [1] "ENSG00000004059.11"
Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".
The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).
library(stringr)
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")
[1] "ENSG00000004059.11"
Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).
Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

Regex replace everything after first digit including digit with another string

I have some strings with this pattern: abc_def_10_cat_dog and I want to use gsub to replace everything after abc_def with _diff. So in the end it should be abc_def_diff. What regex expression would I need to do this? Another way of thinking about it could be "How do I keep all the values before the first digit and then add _diff?" I am using dplyr, gsub, on R.
I know '^[^0-9]*' keeps what I want, but I'm not sure how to just keep those characters and drop the stuff afterwards. I tried using str_extract as well but it kept saying 'object not found'. My object is just a list of names.
object <- df %>% select(vars(ends_with("10_cat_dog")) %>% names()
You can match on a digit and any characters until you reach the end of the string, and then replace with diff:
library(stringr)
str_replace("abc_def_10_cat_dog", "_\\d.*$", "_diff")
#> [1] "abc_def_diff"
Created on 2020-10-30 by the reprex package (v0.3.0)
You can use base R sub :
x <- 'abc_def_10_cat_dog'
sub('\\d+.*', 'diff', x)
#[1] "abc_def_diff"
Another option with sub
sub("[0-9]+.*", "diff", x)
#[1] "abc_def_diff"
data
x <- 'abc_def_10_cat_dog'

Regular expression in R - extract only match

My strings look like as follows:
crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt
I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").
My regular expression is:
regex.f = "_f([[:alnum:]]+)_"
There is no string with more than one part matching the pattern. Why does the following command not work?
sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")
The command only removes "_f" from the string and returns the remaining string.
Can easily be achived with qdapRegex
df <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)
We can use sub extract the strings by matching the characterf followed by one or more characters that are not an underscore or numbers ([^_0-9]+), capture as a group ((...)), followed by 0 or more numbers (\\d*) followed by an _ and other characters. Replace with the backreference (\\1) of the captured group
sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv" "weo" "weo"
data
str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276, which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _.
x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
"crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")
regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))
Or with the stringr package.
library(stringr)
str_extract(x, "(?<=_f).*?(?=_)")
edited to start the match on _f instead of f.
NOTE
akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.
update: capture match using str_match
library(stringr)
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2
your regex not work because missing starting and ending match .* and use \w for shorthand [:alnum:]
sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")
We could use the package unglue :
library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
pattern <-
"crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv" "weo2" "weo2"
Created on 2019-10-09 by the reprex package (v0.3.0)

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Resources