Deleting nth delimiter in R - r

I am trying to delete the 5th delimiter in this string:
"Bacteria_Firmicutes_Clostridia_Clostridiales_Rumino_coccaceae_Ruminococcus_Ruminococcus_albus"
so it becomes:
"Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
This seems to work, but I feel like there should be a more elegant solution possibly with regex and str_replace
library(stringr)
name <- "Bacteria_Firmicutes_Clostridia_Clostridiales_Rumino_coccaceae_Ruminococcus_Ruminococcus_albus"
index <- str_locate_all(name, "_")[[1]]
str_sub(name, index[5, "start"], index[5, "end"]) <- ""
name

Try gsub:
> gsub("((?:[^_]+_){4}[^_]+)_", "\\1", name)
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>
Or a less "pretty" way:
> gsub("([^_]*_[^_]*_[^_]*_[^_]*_[^_]*)_", "\\1", name)
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>
Or with the strex library:
> library(strex)
> paste(str_before_nth(name, "_", 5), str_after_nth(name, "_", 5), sep="")
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>

Related

Extract numbers that appear last in the string

Is there a way to extract numbers from the strings that appear last
asd <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654")
asd1 <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf65421")
Expected output
new_asd
654
new_asd1
65421
This code extracts always the last numeric entries in a string:
(stringr::str_extract(asd, stringr::regex("(\\d+)(?!.*\\d)")))
"654"
(stringr::str_extract(asd1, stringr::regex("(\\d+)(?!.*\\d)")))
"65421"
If you want to get only the number when the very last character of the string is a number then you could implement a simple ifelse condition to check for that specifically, e.g.:
x<- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654f")
ifelse(!is.na(as.numeric(substr(x, nchar(x), nchar(x)))),
(stringr::str_extract(x, stringr::regex("(\\d+)(?!.*\\d)"))),
NA)
NA #returns NA because last entry of string is not numeric ("f")
A single regex is sufficient for your situation.
stringr::str_extract(asd, "(\\d+$)")
The $ anchors the capture group to the end of the string.
I would use sub combined with ifelse here:
x <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654", "abc", "123")
nums <- ifelse(grepl("\\d$", x), sub(".*?(\\d+)$", "\\1", x), "")
nums
[1] "654" "" "123"
You can do something like this:
library(stringr)
val <- str_extract_all(asd1, "\\d+")[[1]]
tail(val, 1)
"65421"
OR
as.numeric(gsub("[^\\d]+", "", asd, perl=TRUE))
val <- regmatches(asd1, gregexpr("[[:digit:]]+", asd1))[[1]]
tail(val, 1)
"65421"
One solution which first splits the string based on whitespace, then gets the last substring and removes any letters. This should work as long as there is only letters and numbers in the strings.
library(stringr)
get_num = function(x) {
str_remove_all(rev(unlist(str_split(x, " ")))[1], "[a-z]")
}
> get_num(asd)
[1] "654"
> get_num(asd1)
[1] "65421"
If string always ends with digits, then we can try gsub
> x <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654", "asdf sfsfsd54 sdfsdfsdf sdfsdfsf65421")
> gsub(".*\\D", "", x, perl = TRUE)
[1] "654" "65421"
We can also use stri_extract_last_regex from stringi
> library(stringi)
> stri_extract_last_regex(asd, "\\d+")
[1] "654"
> stri_extract_last_regex(asd1, "\\d+")
[1] "65421"

gsub / sub to extract between certain characters

How can I extract the numbers / ID from the following string in R?
link <- "D:/temp/sample_data/0000098618-13-000011.htm"
I want to just extract 0000098618-13-000011
That is discard the .htm and the D:/temp/sample_data/.
I have tried grep and gsub without much luck.
1) basename Use basename followed by sub:
sub("\\..*", "", basename(link))
## [1] "0000098618-13-000011"
2) file_path_sans_ext
library(tools)
file_path_sans_ext(link)
## [1] "0000098618-13-000011"
3) sub
sub(".*/(.*)\\..*", "\\1", link)
## [1] "0000098618-13-000011"
4) gsub
gsub(".*/|\\.[^.]*$", "", link)
## [1] "0000098618-13-000011"
5) strsplit
sapply(strsplit(link, "[/.]"), function(x) tail(x, 2)[1])
## [1] "0000098618-13-000011"
6) read.table. If link is a vector this will only work if all elements have the same number of /-separated components. Also this assumes that the only dot is the one separting the extension.
DF <- read.table(text = link, sep = "/", comment = ".", as.is = TRUE)
DF[[ncol(DF)]]
## [1] "0000098618-13-000011"
Using stringr:
library(stringr)
str_extract(link , "[0-9-]+")
# "0000098618-13-000011"

substring replace nth positions R

I need to replace the 6,7,8th position to "_". In substring, I mentioned the start and stop position. It didn't work.
> a=c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
> substring(a, 6,8) <- "_"
> a
[1] "UHI78_KJRH2V" "TYR32_FHASJKDG" "DHA92_NFSYFN34"
I need UHI78_RH2V TYR32_ASJKDG DHA92_SYFN34
Using sub, we can match on the pattern (?<=^.{5}).{3}, and then replace it by a single underscore:
a <- c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
out <- sub("(?<=^.{5}).{3}", "_", a, perl=TRUE)
out
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
Demo
We could also try doing substring operations here, but we would have to do some splicing:
out <- paste0(substr(a, 1, 5), "_", substr(a, 9, nchar(a)))
1) str_sub<- The str_sub<- replacement function in the stringr package can do that.
library(stringr)
str_sub(a, 6, 8) <- "_"
a
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
2 Base R With only base R you could do this. It replaces the entire string with the match to the first capture group, an underscore and the match to the second capture group.
sub("(.....)...(.*)", "\\1_\\2", a)
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
That regex could also be written as "(.{5}).{3}(.*)" .
3) separate/unite If a is a column in a data frame then we could use dplyr and tidyr to do this:
library(dplyr)
library(tidyr)
DF <- data.frame(a)
DF %>%
separate(a, into = c("pre", "junk", "post"), sep = c(5, 8)) %>%
select(-junk) %>%
unite(a)
giving:
a
1 UHI78_RH2V
2 TYR32_ASJKDG
3 DHA92_SYFN34
From the documentation:
If the portion to be replaced is longer than the replacement string, then only the portion the length of the string is replaced.
So we could do something like this:
substring(a, 6,8) <- "_##"
sub("#+", "", a)
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"

How to extract parts from a string

I have an string called PATTERN:
PATTERN <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
and I would like to parse the string using a pattern matching function, like grep, sub, ... to obtain a string variable MODEL equal to "Name.model", a string variable OUTCOME equal to "any.outcome" and an integer variable IMP equal to number.
If MODEL, OUTCOME and IMP were all integers, I could get the values using function sub:
PATTERN <- "MODEL_002-OUTCOME_007-IMP_001"
pattern_build <- "MODEL_([0-9]+)-OUTCOME_([0-9]+)-IMP_([0-9]+)"
MODEL <- as.integer(sub(pattern_build, "\\1", PATTERN))
OUTCOME <- as.integer(sub(pattern_build, "\\2", PATTERN))
IMP <- as.integer(sub(pattern_build, "\\3", PATTERN))
Do you have any idea of how to match the string contained in variable PATTERN?
Possible tricky patterns are:
PATTERN <- "MODEL_PS2-OUTCOME_stroke_i-IMP_001"
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
A solution which is also able to deal with the 'tricky' patterns:
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
lst <- strsplit(PATTERN, '([A-Z]+_)')[[1]][2:4]
lst <- sub('-$','',lst)
which gives:
> lst
[1] "linear-model" "stroke_i" "001"
And if you want that in a dataframe:
df <- as.data.frame.list(lst)
names(df) <- c('MODEL','OUTCOME','IMP')
which gives:
> df
MODEL OUTCOME IMP
1 linear-model stroke_i 001
A minimal-regex approach,
sapply(strsplit(PATTERN, '-'), function(i) sub('(.*?_){1}', '', i))
# [,1]
#[1,] "PS2"
#[2,] "stroke_i"
#[3,] "001"
You may use a pattern with capturing groups matching any chars, as few as possible between known delimiting substrings:
MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)
See the regex demo. Note that the last .* is greedy since you get all the rest of the string into this capture.
You may precise this pattern to only allow matching expected characters (say, to match digits into the last capturing group, use ([0-9]+) rather than (.*).
Use it with, say, str_match from stringr:
> library(stringr)
> x <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
> res <- str_match(x, "MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)")
> res[,2]
[1] "Name.model"
> res[,3]
[1] "any.outcome"
> res[,4]
[1] "number"
>
A base R solution using the same regex will involve a regmatches / regexec:
> res <- regmatches(x, regexec("MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)", x))[[1]]
> res[2]
[1] "Name.model"
> res[3]
[1] "any.outcome"
> res[4]
[1] "number"
>

R: splitting a string between two characters using strsplit()

Let's say I have the following string:
s <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
I would like to recover the strings between ";" and "=" to get the following output:
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
Can I use strsplit() with more than one split element?
1) strsplit with matrix Try this:
> matrix(strsplit(s, "[;=]")[[1]], 2)[2,]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
2) strsplit with gsub or this use of strsplit with gsub:
> strsplit(gsub("[^=;]+=", "", s), ";")[[1]]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
3) strsplit with sub or this use of strsplit with sub:
> sub(".*=", "", strsplit(s, ";")[[1]])
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
4) strapplyc or this which extracts consecutive non-semicolons after equal signs:
> library(gsubfn)
> strapplyc(s, "=([^;]+)", simplify = unlist)
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
ADDED additional strplit solutions.
I know this is an old question, but I found the usage of lookaround regular expressions quite elegant for this problem:
library(stringr)
your_string <- '/this/file/name.txt'
result <- str_extract(string = your_string, pattern = "(?<=/)[^/]*(?=\\.)")
result
In words,
The (?<=...) part looks before the desired string for a... (in this case a forward slash).
The [^/]* then looks for as many characters in a row that are not a forward slash (in this case name.txt).
The (?=...) then looks after the desired string for a ... (in this case the special period character, which needs to be escaped as \\.).
This also works on dataframes:
library(dplyr)
strings <- c('/this/file/name1.txt', 'tis/other/file/name2.csv')
df <- as.data.frame(strings) %>%
mutate(name = str_extract(string = strings, pattern = "(?<=/)[^/]*(?=\\.)"))
# Optional
names <- df %>% pull(name)
Or, in your case:
your_string <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
result <- str_extract(string = your_string, pattern = "(?<=;Alias=)[^;]*(?=;)")
result # Outputs 'MIMAT0027618'

Resources