How to extract text between two separators in R? - r

I have a vector of strings like so:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
I only want the part between the two forward slashes, i.e., "10g" and "6g".

You could sub() here with a capture group:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
sub(".*/([^/]+)/.*", "\\1", mystr)
[1] "10g" "6g"

similar to Tim Biegeleisen, but with a lookbehind and lookahead, using srt_extract from stringr:
library(stringr)
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
str_extract(mystr,"(?<=/)[^/]+(?=/)")
[1] "10g" "6g"

More simply you can capitalize on the fact that the desired substring is one or more digits followed by literal g:
library(stringr)
str_extract(mystr, "\\d+g")
[1] "10g" "6g"

Here are a few alternatives. They use no packages and the first two do not use any regular expressions.
basename(dirname(mystr))
## [1] "10g" "6g"
read.table(text = mystr, sep = "/")[[2]]
## [1] "10g" "6g"
trimws(trimws(mystr,, "[^/]"),, "/")
## [1] "10g" "6g"
We could also reformulate these using pipes
mystr |> dirname() |> basename()
## [1] "10g" "6g"
read.table(text = mystr, sep = "/") |> (`[[`)(2)
## [1] "10g" "6g"
mystr |> trimws(, "[^/]") |> trimws(, "/")
## [1] "10g" "6g"
Note
From the question the input is
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")

Related

Convert string into vector

How can I convert this string into a vector?
"c(HJ229, HJ230, HJ231)"
The desired result is "HJ229" "HJ230" "HJ231".
I have tried using stringr, however the ( causes an issue because of regex.
t <- "c(HJ229, HJ230, HJ231)"
strsplit(str_remove(t, "c"), "(")[[1]]
You need to escape the parentheses to remove them with regex using \\ and provide multiple patterns to match separated by | (or).
library(stringr)
t <- "c(HJ229, HJ230, HJ231)"
str_split(str_remove_all(t, "c|\\(|\\)"), ", ")[[1]]
#> [1] "HJ229" "HJ230" "HJ231"
Created on 2022-02-25 by the reprex package (v2.0.1)
Another possible solution:
library(stringr)
s <-"c(HJ229, HJ230, HJ231)"
str_extract_all(s, "[A-Z]{2}\\d{3}")[[1]]
#> [1] "HJ229" "HJ230" "HJ231"
Using base R:
t = "c(HJ229, HJ230, HJ231)"
strsplit(gsub("[c()]", "", t), ", ")[[1]]
[1] "HJ229" "HJ230" "HJ231"
Using stringr:
library(stringr)
str_split(str_remove_all(t, "[c()]"), ", ")[[1]]
[1] "HJ229" "HJ230" "HJ231"
We can try
> scan(text = gsub("c\\((.*)\\)", "\\1", s), what = "", quiet = TRUE, sep = ",")
[1] "HJ229" " HJ230" " HJ231"

how to change the order of words before and after underscore in R

how to change order of words before and after underscore
For example
hello_wor -> wor_hello
hi_everyone -> everyone_hi
We could do:
sub("(.*)_(.*)", "\\2_\\1", str1)
[1] "wor_hello" "everyone_hi"
We can use regex to do this i.e. capture the words ((\\w+)) before and after the _ and in the replacement rearrange the backreferences
sub("^(\\w+)_(\\w+)$", "\\2_\\1", str1)
[1] "wor_hello" "everyone_hi"
data
str1 <- c("hello_wor", "hi_everyone")
With tidyverse approach:
library(tidyverse)
words <- c("Peter_Gabriel", "Tina_Turner")
map_chr(words, ~ str_extract_all(.x, "\\w+(?=_)|(?<=_)\\w+")
%>% flatten %>% rev %>% paste0(collapse = "_"))
#> [1] "Gabriel_Peter" "Turner_Tina"

gsub / sub to extract between certain characters

How can I extract the numbers / ID from the following string in R?
link <- "D:/temp/sample_data/0000098618-13-000011.htm"
I want to just extract 0000098618-13-000011
That is discard the .htm and the D:/temp/sample_data/.
I have tried grep and gsub without much luck.
1) basename Use basename followed by sub:
sub("\\..*", "", basename(link))
## [1] "0000098618-13-000011"
2) file_path_sans_ext
library(tools)
file_path_sans_ext(link)
## [1] "0000098618-13-000011"
3) sub
sub(".*/(.*)\\..*", "\\1", link)
## [1] "0000098618-13-000011"
4) gsub
gsub(".*/|\\.[^.]*$", "", link)
## [1] "0000098618-13-000011"
5) strsplit
sapply(strsplit(link, "[/.]"), function(x) tail(x, 2)[1])
## [1] "0000098618-13-000011"
6) read.table. If link is a vector this will only work if all elements have the same number of /-separated components. Also this assumes that the only dot is the one separting the extension.
DF <- read.table(text = link, sep = "/", comment = ".", as.is = TRUE)
DF[[ncol(DF)]]
## [1] "0000098618-13-000011"
Using stringr:
library(stringr)
str_extract(link , "[0-9-]+")
# "0000098618-13-000011"

Opposite of Hmisc::escapeRegex

The function Hmisc::escapeRegex escapes any special characters in a string.
library(Hmisc)
string <- "this\\(system) {is} [full]."
escapeRegex(string)
It is based on the gsub and regexp.
escapestring <- gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
escapestring
[1] "this\\\\\\(system\\) \\{is\\} \\[full\\]\\."
How to remove the backslashes from escapestring so that the original string is retrieved?
You actually only need to keep the character after each \ to un-escape.
string <- "this\\(system) {is} [full]."
library(Hmisc)
gsub("\\\\(.)", "\\1", escapeRegex(string))
#> [1] "this\\(system) {is} [full]."
Alternatively rex may make both escaping and un-escaping a little simpler.
library(rex)
re_substitutes(escape(string), rex("\\", capture(any)), "\\1", global = TRUE)
#> [1] "this\\(system) {is} [full]."
How about the regex
\\\\([.|()\\^{}+$*?]|\\[|\\])
replacing with capture group \1
Example usage
escapestring <- "this\\\\\\(system\\) \\{is\\} \\[full\\]\\."
string <- gsub("\\\\([.|()\\^{}+$*?]|\\[|\\])", "\\1", escapestring)
string
[1] "this\\(system) {is} [full]."
May be this also helps
gsub("\\\\[(](*SKIP)(*F)|\\\\", '', escapestring, perl=TRUE)
#[1] "this\\(system) {is} [full]."

R: splitting a string between two characters using strsplit()

Let's say I have the following string:
s <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
I would like to recover the strings between ";" and "=" to get the following output:
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
Can I use strsplit() with more than one split element?
1) strsplit with matrix Try this:
> matrix(strsplit(s, "[;=]")[[1]], 2)[2,]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
2) strsplit with gsub or this use of strsplit with gsub:
> strsplit(gsub("[^=;]+=", "", s), ";")[[1]]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
3) strsplit with sub or this use of strsplit with sub:
> sub(".*=", "", strsplit(s, ";")[[1]])
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
4) strapplyc or this which extracts consecutive non-semicolons after equal signs:
> library(gsubfn)
> strapplyc(s, "=([^;]+)", simplify = unlist)
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
ADDED additional strplit solutions.
I know this is an old question, but I found the usage of lookaround regular expressions quite elegant for this problem:
library(stringr)
your_string <- '/this/file/name.txt'
result <- str_extract(string = your_string, pattern = "(?<=/)[^/]*(?=\\.)")
result
In words,
The (?<=...) part looks before the desired string for a... (in this case a forward slash).
The [^/]* then looks for as many characters in a row that are not a forward slash (in this case name.txt).
The (?=...) then looks after the desired string for a ... (in this case the special period character, which needs to be escaped as \\.).
This also works on dataframes:
library(dplyr)
strings <- c('/this/file/name1.txt', 'tis/other/file/name2.csv')
df <- as.data.frame(strings) %>%
mutate(name = str_extract(string = strings, pattern = "(?<=/)[^/]*(?=\\.)"))
# Optional
names <- df %>% pull(name)
Or, in your case:
your_string <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
result <- str_extract(string = your_string, pattern = "(?<=;Alias=)[^;]*(?=;)")
result # Outputs 'MIMAT0027618'

Resources