How to extract parts from a string - r

I have an string called PATTERN:
PATTERN <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
and I would like to parse the string using a pattern matching function, like grep, sub, ... to obtain a string variable MODEL equal to "Name.model", a string variable OUTCOME equal to "any.outcome" and an integer variable IMP equal to number.
If MODEL, OUTCOME and IMP were all integers, I could get the values using function sub:
PATTERN <- "MODEL_002-OUTCOME_007-IMP_001"
pattern_build <- "MODEL_([0-9]+)-OUTCOME_([0-9]+)-IMP_([0-9]+)"
MODEL <- as.integer(sub(pattern_build, "\\1", PATTERN))
OUTCOME <- as.integer(sub(pattern_build, "\\2", PATTERN))
IMP <- as.integer(sub(pattern_build, "\\3", PATTERN))
Do you have any idea of how to match the string contained in variable PATTERN?
Possible tricky patterns are:
PATTERN <- "MODEL_PS2-OUTCOME_stroke_i-IMP_001"
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"

A solution which is also able to deal with the 'tricky' patterns:
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
lst <- strsplit(PATTERN, '([A-Z]+_)')[[1]][2:4]
lst <- sub('-$','',lst)
which gives:
> lst
[1] "linear-model" "stroke_i" "001"
And if you want that in a dataframe:
df <- as.data.frame.list(lst)
names(df) <- c('MODEL','OUTCOME','IMP')
which gives:
> df
MODEL OUTCOME IMP
1 linear-model stroke_i 001

A minimal-regex approach,
sapply(strsplit(PATTERN, '-'), function(i) sub('(.*?_){1}', '', i))
# [,1]
#[1,] "PS2"
#[2,] "stroke_i"
#[3,] "001"

You may use a pattern with capturing groups matching any chars, as few as possible between known delimiting substrings:
MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)
See the regex demo. Note that the last .* is greedy since you get all the rest of the string into this capture.
You may precise this pattern to only allow matching expected characters (say, to match digits into the last capturing group, use ([0-9]+) rather than (.*).
Use it with, say, str_match from stringr:
> library(stringr)
> x <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
> res <- str_match(x, "MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)")
> res[,2]
[1] "Name.model"
> res[,3]
[1] "any.outcome"
> res[,4]
[1] "number"
>
A base R solution using the same regex will involve a regmatches / regexec:
> res <- regmatches(x, regexec("MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)", x))[[1]]
> res[2]
[1] "Name.model"
> res[3]
[1] "any.outcome"
> res[4]
[1] "number"
>

Related

How to get the number between two characters in R

I have a vector a.
I want to extract the numbers between PUBMED and \nREFERENCE, which means the number is 32634600
I don't know how to code it using str_extract().
a = "234 4dfd 123PUBMED 32634600\nREFERENCE"
# expected output is 32634600
Using a lookbehind and stringr:
library(stringr)
str_extract_all(a, "(?<=PUBMED )[0-9]+")
[[1]]
[1] "32634600"
We can use sub() here with a capture group:
a <- "234 4dfd 123PUBMED 32634600\nREFERENCE"
num <- sub(".*PUBMED\\s*(\\d+)\\s*\\bREFERENCE\\b.*", "\\1", a)
num
[1] "32634600"

how to extract part of a string matching pattern with separation in r

I'm trying to extract part of a file name that matches a set of letters with variable length. The file names consist of several parameters separated by "_", but they vary in the number of parts. I'm trying to pull some of the parameters out to use separately.
Example file names:
a = "Vel_Mag_ft_modelExisting_350cfs_blah3.tif"
b = "Depth_modelDesign_11000cfs_blah2.tif"
I'm trying to pull out the parts that start with "model" so I end up with
"modelExisting"
"modelDesign"
The filenames are stored as a variable in a data.frame
I've tried
library(tidyverse)
tibble(files = c(a,b))%>%
mutate(attempt1 = str_extract(files, "model"),
attempt2 = str_match(str_split(files, "_"), "model"))
but just ended up with the "model" in all cases and not the "model...." that I need.
The pieces I need are a consisent number of pieces from the end, but I couldn't figure out how to specify that either. I tried
str_split(files, "_")[-3]
but this threw an error that it must be size 480 or 1 not size 479
We can create a function to capture the word before the _ and one or more digits (\\1), in the replacement, specify the backreference (\\1) of the captured group
f1 <- function(x) sub(".*_([[:alpha:]]+)_\\d+.*", "\\1", x)
-testing
> f1(a)
[1] "modelExisting"
> f1(b)
[1] "modelDesign"
We can use strsplit or regmatches like below
> s <- c("Vel_Mag_ft_modelExisting_350cfs_blah3.tif", "Depth_modelDesign_11000cfs_blah2.tif")
> lapply(strsplit(s, "_"), function(x) x[which(grepl("^\\d+", x)) - 1])
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"
> regmatches(s, gregexpr("[[:alpha:]]+(?=_\\d+)", s, perl = TRUE))
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"

regex strsplit expression in R so it only applies once to the first occurrence of a specific character in each string?

I have a list filled with strings:
string<- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L")
I need to split the strings so they appear like:
"SPG_L", "subgenual_ACC_R", "SPG_R", "MTG_L_pole", "MTG_L_pole", "CerebellumGM_L"
I tried using the following regex expression to split the strings:
str_split(string,'(?<=[[RL]|pole])_')
But this leads to:
"SPG_L", "subgenual" "ACC_R", "SPG_R", "MTG_L", "pole", "MTG_L", "pole", "CerebellumGM_L"
How do I edit the regex expression so it splits each string element at the "_" after the first occurrence of "R", "L" unless the first occurrence of "R" or "L" is followed by "pole", then it splits the string element after the first occurrence of "pole" and only splits each string element once?
I suggest a matching approach using
^(.*?[RL](?:_pole)?)_(.*)
See the regex demo
Details
^ - start of string
(.*?[RL](?:_pole)?) - Group 1:
.*? - any zero or more chars other than line break chars as few as possible
[RL](?:_pole)? - R or L optionally followed with _pole
_ - an underscore
(.*) - Group 2: any zero or more chars other than line break chars as many as possible
See the R demo:
library(stringr)
x <- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L", "SFG_pole_R_IFG_triangularis_L", "SFG_pole_R_IFG_opercularis_L" )
res <- str_match_all(x, "^(.*?[RL](?:_pole)?)_(.*)")
lapply(res, function(x) x[-1])
Output:
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
[[4]]
[1] "SFG_pole_R" "IFG_triangularis_L"
[[5]]
[1] "SFG_pole_R" "IFG_opercularis_L"
split_again = function(x){
if(length(x) > 1){
return(x)
}
else{
str_split(
string = x,
pattern = '(?<=[R|L])_',
n = 2)
}
}
str_split(
string = string,
pattern = '(?<=pole)_',
n = 2) %>%
lapply(split_again) %>%
unlist()
you could use sub then strsplit as shown:
strsplit(sub("^.*?[LR](?:_pole)?\\K_",":",string,perl=TRUE),":")
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"

Split a string every 5 characters

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?
Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.
No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.
You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

R: splitting a string between two characters using strsplit()

Let's say I have the following string:
s <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
I would like to recover the strings between ";" and "=" to get the following output:
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
Can I use strsplit() with more than one split element?
1) strsplit with matrix Try this:
> matrix(strsplit(s, "[;=]")[[1]], 2)[2,]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
2) strsplit with gsub or this use of strsplit with gsub:
> strsplit(gsub("[^=;]+=", "", s), ";")[[1]]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
3) strsplit with sub or this use of strsplit with sub:
> sub(".*=", "", strsplit(s, ";")[[1]])
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
4) strapplyc or this which extracts consecutive non-semicolons after equal signs:
> library(gsubfn)
> strapplyc(s, "=([^;]+)", simplify = unlist)
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
ADDED additional strplit solutions.
I know this is an old question, but I found the usage of lookaround regular expressions quite elegant for this problem:
library(stringr)
your_string <- '/this/file/name.txt'
result <- str_extract(string = your_string, pattern = "(?<=/)[^/]*(?=\\.)")
result
In words,
The (?<=...) part looks before the desired string for a... (in this case a forward slash).
The [^/]* then looks for as many characters in a row that are not a forward slash (in this case name.txt).
The (?=...) then looks after the desired string for a ... (in this case the special period character, which needs to be escaped as \\.).
This also works on dataframes:
library(dplyr)
strings <- c('/this/file/name1.txt', 'tis/other/file/name2.csv')
df <- as.data.frame(strings) %>%
mutate(name = str_extract(string = strings, pattern = "(?<=/)[^/]*(?=\\.)"))
# Optional
names <- df %>% pull(name)
Or, in your case:
your_string <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
result <- str_extract(string = your_string, pattern = "(?<=;Alias=)[^;]*(?=;)")
result # Outputs 'MIMAT0027618'

Resources