Regex in R, matching strings - r

I have strings like this: "X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2" and I would like to match only numbers 1, 2 and 3 in between underscores but without them(underscores). The best solution I could come up with is this str_match(sample_names, "_+[1-3]?") I would really appreciate the help.

The simplest method is by using suband backreference:
Data:
d <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
Solution:
sub(".*_(\\d)_.*", "\\1", d)
Here, (\\d) defines the capturing group for a single number (if the number in question can be more than one digit, use \\d+) that is 'recalled' by the backreference \\1in subs replacement argument
Alternatively use str_extract and positive lookaround:
library(stringr)
str_extract(d, "(?<=_)\\d(?=_)")
(?<=_) is positive lookbehind which can be glossed as "If you see _ on the left..."
\\d is the number to be matched
(?=_) is positive lookahead, which can be glossed as "If you see _ on the right..."
Result:
[1] "1" "2" "3"

You can use Look Arounds, I personally rely heavily on the stringr Cheatsheets for these kind of regex, the syntax is a bit hard to remember, here is the rstudio page for Cheatsheets look for stringr ->LOOK AROUNDS
library(tidyverse)
codes <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
codes %>%
str_extract("(?<=_)[:digit:]+(?=_)")
#> [1] "1" "2" "3"
Created on 2020-06-14 by the reprex package (v0.3.0)

Using x in the Note at the end, read it in using read.table and pick off the second field. No packages or regular expressions are used.
read.table(text = x, sep = "_")[[2]]
## [1] 1 2 3
Note
x <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")

No need for any third-party module:
strings <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
pattern <- "(?<=_)(\\d+)(?=_)"
unlist(regmatches(strings, gregexpr(pattern, strings, perl = TRUE)))
Which yields:
[1] "1" "2" "3"

Related

Extract a number from a string which precedes a phrase in R

I am in R and would like to extract a two digit number 38y from the following string:
"/Users/files/folder/file_number_23a_version_38y_Control.txt"
I know that _Control always comes after the 38y and that 38y is preceded by an underscore. How can I use strsplit or other R commands to extract the 38y?
You could use
regmatches(x, regexpr("[^_]+(?=_Control)", x, perl = TRUE))
# [1] "38y"
or equivalently
stringr::str_extract(x, "[^_]+(?=_Control)")
# [1] "38y"
Using gsub.
gsub('.*_(.*)_Control.*', '\\1', x)
# [1] "38y"
See demo with detailed explanation.
A possible solution:
library(stringr)
text <- "/Users/files/folder/file_number_23a_version_38y_Control.txt"
str_extract(text, "(?<=_)\\d+\\D(?=_Control)")
#> [1] "38y"
You can find an explanation of the regex part at:
https://regex101.com/r/PQSZHX/1

Is it possible to use R's base::strsplit() without consuming pattern

I have a string that consists entirely of simple repeating patterns of a [:digit:]+[A-Z] for instance 12A432B4B.
I want to to use base::strsplit() to get:
[1] "12A" "432B" "4B"
I thought I could use lookahead to split by a LETTER and keep this pattern with unlist(strsplit("12A432B4B", "(?<=.)(?=[A-Z])", perl = TRUE)) but as can be seen I get the split wrongly:
[1] "12" "A432" "B4" "B"
Cant get my mind around a pattern that works with this strsplit strategy? Explanations would be really appreciated.
Bonus:
I also failed to use back reference in gsub (e.g. - pattern not working `gsub("([[:digit:]]+[A-Z])+", "\\1", "12A432B4B"), and can you retrieve more than \\1 to \\9 groups, say if [:digit:]+[A-Z] repeats for more than 9 times ?
We can use regex lookaround to split between an upper case letter and a digit
strsplit(str1, "(?<=[A-Z])(?=[0-9])", perl = TRUE)[[1]]
#[1] "12A" "432B" "4B"
data
str1 <- "12A432B4B"
The pattern mentioned in the post can be used as it is in str_extract_all :
str_extract_all(string, '[[:digit:]]+[A-Z]')[[1]]
#[1] "12A" "432B" "4B"
Or in base R :
regmatches(string, gregexpr('[[:digit:]]+[A-Z]', string))[[1]]
where string is :
string <- '12A432B4B'

Extract characters between first and third period

Basically what the title says, I have a vector of character strings and for each element I want to extract everything between the first and third period. E.g.
s <- c("random.0.0.word.1.0", "different.0.02.words.15.6", "different.0.1.words.4.2")
The result should be:
"0.0" "0.02" "0.1"
I have tried adapting code from here and here but failed. Any advice much appreciated!
We can capture as a group by matching characters not a . ([^.]+) from the start (^) of the string, followed by a dot (\\.) and then capture all the characters between the first and the third dot, in the replacement use the backreference (\\1) of the captured group ((...))
sub("^[^.]+\\.([^.]+\\.[^.]+)\\..*", "\\1", s)
#[1] "0.0" "0.02" "0.1"
Or it can be also done with substr after getting the position of the dots
lst1 <- gregexpr('.', s, fixed = TRUE)
substring(s, sapply(lst1, `[`, 1) + 1, sapply(lst1, `[`, 3) - 1)
#[1] "0.0" "0.02" "0.1"
An alternative way to do this, without using any fancy regex features, is just to split on . and then grab the bits we need:
library(stringr)
library(purrr)
str_split(s, "\\.") %>%
map_chr(~ paste0(.[2:3], collapse = "."))
We can use sub to capture as little as possible between 1st and 3rd period.
sub(".*?\\.(.*?\\..*?)\\..*", "\\1", s)
#[1] "0.0" "0.02" "0.1"
Here's a way with unglue, which some might find less intimidating :
library(unglue)
s <- c("random.0.0.word.1.0", "different.0.02.words.15.6", "different.0.1.words.4.2")
unglue_vec(s, "{=[^.]+}.{x}.{=[^.]+}.{=[^.]+}.{=[^.]+}")
#> [1] "0.0" "0.02" "0.1"
Created on 2020-01-16 by the reprex package (v0.3.0)
The subpatterns [^.]+ are sequences of "non dots", not named (nothing on the lhs of =) because we don't want to extract them.

How to extract everything after a specific string?

I'd like to extract everything after "-" in vector of strings in R.
For example in :
test = c("Pierre-Pomme","Jean-Poire","Michel-Fraise")
I'd like to get
c("Pomme","Poire","Fraise")
Thanks !
With str_extract. \\b is a zero-length token that matches a word-boundary. This includes any non-word characters:
library(stringr)
str_extract(test, '\\b\\w+$')
# [1] "Pomme" "Poire" "Fraise"
We can also use a back reference with sub. \\1 refers to string matched by the first capture group (.+), which is any character one or more times following a - at the end:
sub('.+-(.+)', '\\1', test)
# [1] "Pomme" "Poire" "Fraise"
This also works with str_replace if that is already loaded:
library(stringr)
str_replace(test, '.+-(.+)', '\\1')
# [1] "Pomme" "Poire" "Fraise"
Third option would be using strsplit and extract the second word from each element of the list (similar to word from #akrun's answer):
sapply(strsplit(test, '-'), `[`, 2)
# [1] "Pomme" "Poire" "Fraise"
stringr also has str_split variant to this:
str_split(test, '-', simplify = TRUE)[,2]
# [1] "Pomme" "Poire" "Fraise"
We can use sub to match characters (.*) until the - and in the replacement specify ""
sub(".*-", "", test)
Or another option is word
library(stringr)
word(test, 2, sep="-")
I think the other answers might be what you're looking for, but if you don't want to lose the original context you can try something like this:
library(tidyverse)
tibble(test) %>%
separate(test, c("first", "last"), remove = F)
This will return a dataframe containing the original strings plus components, which might be more useful down the road:
# A tibble: 3 x 3
test first last
<chr> <chr> <chr>
1 Pierre-Pomme Pierre Pomme
2 Jean-Poire Jean Poire
3 Michel-Fraise Michel Fraise
For some reason the responses here didn't work for my particular string. I found this response more helpful (i.e., using Stringr's lookbehind function): stringr str_extract capture group capturing everything.

How to extract everything until first occurrence of pattern

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Resources