Extract numbers that appear last in the string - r

Is there a way to extract numbers from the strings that appear last
asd <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654")
asd1 <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf65421")
Expected output
new_asd
654
new_asd1
65421

This code extracts always the last numeric entries in a string:
(stringr::str_extract(asd, stringr::regex("(\\d+)(?!.*\\d)")))
"654"
(stringr::str_extract(asd1, stringr::regex("(\\d+)(?!.*\\d)")))
"65421"
If you want to get only the number when the very last character of the string is a number then you could implement a simple ifelse condition to check for that specifically, e.g.:
x<- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654f")
ifelse(!is.na(as.numeric(substr(x, nchar(x), nchar(x)))),
(stringr::str_extract(x, stringr::regex("(\\d+)(?!.*\\d)"))),
NA)
NA #returns NA because last entry of string is not numeric ("f")

A single regex is sufficient for your situation.
stringr::str_extract(asd, "(\\d+$)")
The $ anchors the capture group to the end of the string.

I would use sub combined with ifelse here:
x <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654", "abc", "123")
nums <- ifelse(grepl("\\d$", x), sub(".*?(\\d+)$", "\\1", x), "")
nums
[1] "654" "" "123"

You can do something like this:
library(stringr)
val <- str_extract_all(asd1, "\\d+")[[1]]
tail(val, 1)
"65421"
OR
as.numeric(gsub("[^\\d]+", "", asd, perl=TRUE))
val <- regmatches(asd1, gregexpr("[[:digit:]]+", asd1))[[1]]
tail(val, 1)
"65421"

One solution which first splits the string based on whitespace, then gets the last substring and removes any letters. This should work as long as there is only letters and numbers in the strings.
library(stringr)
get_num = function(x) {
str_remove_all(rev(unlist(str_split(x, " ")))[1], "[a-z]")
}
> get_num(asd)
[1] "654"
> get_num(asd1)
[1] "65421"

If string always ends with digits, then we can try gsub
> x <- c("asdf sfsfsd54 sdfsdfsdf sdfsdfsf654", "asdf sfsfsd54 sdfsdfsdf sdfsdfsf65421")
> gsub(".*\\D", "", x, perl = TRUE)
[1] "654" "65421"

We can also use stri_extract_last_regex from stringi
> library(stringi)
> stri_extract_last_regex(asd, "\\d+")
[1] "654"
> stri_extract_last_regex(asd1, "\\d+")
[1] "65421"

Related

Deleting nth delimiter in R

I am trying to delete the 5th delimiter in this string:
"Bacteria_Firmicutes_Clostridia_Clostridiales_Rumino_coccaceae_Ruminococcus_Ruminococcus_albus"
so it becomes:
"Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
This seems to work, but I feel like there should be a more elegant solution possibly with regex and str_replace
library(stringr)
name <- "Bacteria_Firmicutes_Clostridia_Clostridiales_Rumino_coccaceae_Ruminococcus_Ruminococcus_albus"
index <- str_locate_all(name, "_")[[1]]
str_sub(name, index[5, "start"], index[5, "end"]) <- ""
name
Try gsub:
> gsub("((?:[^_]+_){4}[^_]+)_", "\\1", name)
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>
Or a less "pretty" way:
> gsub("([^_]*_[^_]*_[^_]*_[^_]*_[^_]*)_", "\\1", name)
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>
Or with the strex library:
> library(strex)
> paste(str_before_nth(name, "_", 5), str_after_nth(name, "_", 5), sep="")
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>

Removing special characters and spaces from strings

name1 <- "Adam & Eve"
name2 <- "Spartacus"
name3 <- "Fitness and Health"
I want to remove all spaces and special characters such as %&,. and the word and between the names, and then capitalize each string, so the names become:
name1 <- "ADAMEVE"
name2 <- "SPARTACUS"
name3 <- "FITNESSHEALTH"
We can use sub to remove the and string, then with gsub remove everything other (^) than the letters (upper, lower case) and convert the case to upper (toupper)
f1 <- function(x) toupper(gsub("[^A-Za-z]", "", sub("and", "", x, fixed = TRUE)))
-testing
> f1(name1)
[1] "ADAMEVE"
> f1(name2)
[1] "SPARTACUS"
> f1(name3)
[1] "FITNESSHEALTH"
Inspired by akrun's answer we could create a function and apply it to the vectors:
library(stringr)
my_function <- function(x){
x <- str_replace_all(x, "[^A-Za-z0-9]","")
x <- toupper(x)
x <- str_remove_all(x, "AND")
return(x)
}
my_function(c(name1, name2,name3))
Output:
[1] "ADAMEVE" "SPARTACUS" "FITNESSHEALTH"
You can use stringr with str_remove_all() and the patterns for "any non-word characters" ("\\D"), and the word "and" (use word boundaries here, \\b), and then change all to upper case with toupper()
library(stringr)
name1 %>% str_remove_all("\\D|\\band\\b") %>% toupper
If you want do define a function for that, you can do it as follows:
my_function <- function(x) { x %>% str_remove_all ("\\D|\\band\\b") %>% toupper }

How to sort strings with integers by numeric ordering?

I have a column in a data.table full of strings in the format string+integer. e.g.
string1, string2, string3, string4, string5,
When I use sort(), I put these strings in the wrong order.
string1, string10, string11, string12, string13, ..., string2, string20,
string21, string22, string23, ....
How would I sort these to be in the order
string01, string02, string03, string04, strin0g5, ... , string10,, string11,
string12, etc.
One method could be to add a 0 to each integer <10, 1-9?
I suspect you would extract the string with str_extract(dt$string_column, "[a-z]+") and then add a 0 to each single-digit integer...somehow with sprintf()
We can remove the characters that are not numbers to do the sorting
dt[order(as.integer(gsub("\\D+", "", col1)))]
You could go for mixedsort in gtools:
vec <- c("string1", "string10", "string11", "string12", "string13","string2",
"string20", "string21", "string22", "string23")
library(gtools)
mixedsort(vec)
#[1] "string1" "string2" "string10" "string11" "string12" "string13"
# "string20" "string21" "string22" "string23"
You could use the str_extract of stringr package to obtain the digits and order according to that
x = c("string1","string3","stringZ","string2","stringX","string10")
library(stringr)
c(x[grepl("\\d+",x)][order(as.integer(str_extract(x[grepl("\\d+",x)],"\\d+")))],
sort(x[!grepl("\\d+",x)]))
#[1] "string1" "string2" "string3" "string10" "stringX" "stringZ"
Assuming the string is something like below:
library(data.table)
library(stringr)
xstring <- data.table(x = c("string1","string11","string2",'string10',"stringx"))
extracts <- str_extract(xstring$x,"(?<=string)(\\d*)")
y_string <- ifelse(nchar(extracts)==2 | extracts=="",extracts,paste0("0",extracts))
fin_string <- str_replace(xstring$x,"(?<=string)(\\d*)",y_string)
sort(fin_string)
Output:
> sort(fin_string)
[1] "string01" "string02" "string10" "string11"
[5] "stringx"

How to extract parts from a string

I have an string called PATTERN:
PATTERN <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
and I would like to parse the string using a pattern matching function, like grep, sub, ... to obtain a string variable MODEL equal to "Name.model", a string variable OUTCOME equal to "any.outcome" and an integer variable IMP equal to number.
If MODEL, OUTCOME and IMP were all integers, I could get the values using function sub:
PATTERN <- "MODEL_002-OUTCOME_007-IMP_001"
pattern_build <- "MODEL_([0-9]+)-OUTCOME_([0-9]+)-IMP_([0-9]+)"
MODEL <- as.integer(sub(pattern_build, "\\1", PATTERN))
OUTCOME <- as.integer(sub(pattern_build, "\\2", PATTERN))
IMP <- as.integer(sub(pattern_build, "\\3", PATTERN))
Do you have any idea of how to match the string contained in variable PATTERN?
Possible tricky patterns are:
PATTERN <- "MODEL_PS2-OUTCOME_stroke_i-IMP_001"
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
A solution which is also able to deal with the 'tricky' patterns:
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
lst <- strsplit(PATTERN, '([A-Z]+_)')[[1]][2:4]
lst <- sub('-$','',lst)
which gives:
> lst
[1] "linear-model" "stroke_i" "001"
And if you want that in a dataframe:
df <- as.data.frame.list(lst)
names(df) <- c('MODEL','OUTCOME','IMP')
which gives:
> df
MODEL OUTCOME IMP
1 linear-model stroke_i 001
A minimal-regex approach,
sapply(strsplit(PATTERN, '-'), function(i) sub('(.*?_){1}', '', i))
# [,1]
#[1,] "PS2"
#[2,] "stroke_i"
#[3,] "001"
You may use a pattern with capturing groups matching any chars, as few as possible between known delimiting substrings:
MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)
See the regex demo. Note that the last .* is greedy since you get all the rest of the string into this capture.
You may precise this pattern to only allow matching expected characters (say, to match digits into the last capturing group, use ([0-9]+) rather than (.*).
Use it with, say, str_match from stringr:
> library(stringr)
> x <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
> res <- str_match(x, "MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)")
> res[,2]
[1] "Name.model"
> res[,3]
[1] "any.outcome"
> res[,4]
[1] "number"
>
A base R solution using the same regex will involve a regmatches / regexec:
> res <- regmatches(x, regexec("MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)", x))[[1]]
> res[2]
[1] "Name.model"
> res[3]
[1] "any.outcome"
> res[4]
[1] "number"
>

R: splitting a string between two characters using strsplit()

Let's say I have the following string:
s <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
I would like to recover the strings between ";" and "=" to get the following output:
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
Can I use strsplit() with more than one split element?
1) strsplit with matrix Try this:
> matrix(strsplit(s, "[;=]")[[1]], 2)[2,]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
2) strsplit with gsub or this use of strsplit with gsub:
> strsplit(gsub("[^=;]+=", "", s), ";")[[1]]
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
3) strsplit with sub or this use of strsplit with sub:
> sub(".*=", "", strsplit(s, ";")[[1]])
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
4) strapplyc or this which extracts consecutive non-semicolons after equal signs:
> library(gsubfn)
> strapplyc(s, "=([^;]+)", simplify = unlist)
[1] "MIMAT0027618" "MIMAT0027618" "hsa-miR-6859-5p" "MI0022705"
ADDED additional strplit solutions.
I know this is an old question, but I found the usage of lookaround regular expressions quite elegant for this problem:
library(stringr)
your_string <- '/this/file/name.txt'
result <- str_extract(string = your_string, pattern = "(?<=/)[^/]*(?=\\.)")
result
In words,
The (?<=...) part looks before the desired string for a... (in this case a forward slash).
The [^/]* then looks for as many characters in a row that are not a forward slash (in this case name.txt).
The (?=...) then looks after the desired string for a ... (in this case the special period character, which needs to be escaped as \\.).
This also works on dataframes:
library(dplyr)
strings <- c('/this/file/name1.txt', 'tis/other/file/name2.csv')
df <- as.data.frame(strings) %>%
mutate(name = str_extract(string = strings, pattern = "(?<=/)[^/]*(?=\\.)"))
# Optional
names <- df %>% pull(name)
Or, in your case:
your_string <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"
result <- str_extract(string = your_string, pattern = "(?<=;Alias=)[^;]*(?=;)")
result # Outputs 'MIMAT0027618'

Resources