Is it possible to use R's base::strsplit() without consuming pattern - r

I have a string that consists entirely of simple repeating patterns of a [:digit:]+[A-Z] for instance 12A432B4B.
I want to to use base::strsplit() to get:
[1] "12A" "432B" "4B"
I thought I could use lookahead to split by a LETTER and keep this pattern with unlist(strsplit("12A432B4B", "(?<=.)(?=[A-Z])", perl = TRUE)) but as can be seen I get the split wrongly:
[1] "12" "A432" "B4" "B"
Cant get my mind around a pattern that works with this strsplit strategy? Explanations would be really appreciated.
Bonus:
I also failed to use back reference in gsub (e.g. - pattern not working `gsub("([[:digit:]]+[A-Z])+", "\\1", "12A432B4B"), and can you retrieve more than \\1 to \\9 groups, say if [:digit:]+[A-Z] repeats for more than 9 times ?

We can use regex lookaround to split between an upper case letter and a digit
strsplit(str1, "(?<=[A-Z])(?=[0-9])", perl = TRUE)[[1]]
#[1] "12A" "432B" "4B"
data
str1 <- "12A432B4B"

The pattern mentioned in the post can be used as it is in str_extract_all :
str_extract_all(string, '[[:digit:]]+[A-Z]')[[1]]
#[1] "12A" "432B" "4B"
Or in base R :
regmatches(string, gregexpr('[[:digit:]]+[A-Z]', string))[[1]]
where string is :
string <- '12A432B4B'

Related

Extracting all numbers in a string that are surrounded by a certain pattern in R

I'd like to extract all numbers in a string that are flanked by two markers/patterns. However, regular expressions in R are my bane.
I have something like this:
string <- "<img src='images/stimuli/32.png' style='width:341.38790035587186px;height: 265px;'><img src='images/stimuli/36.png' style='width:341.38790035587186px;height: 265px;'>"
marker1 <- "images/stimuli/"
marker2 <- ".png"
and want something like this
gsub(paste0(".*", marker1, "*(.*?) *", marker2, ".*"), "\\1", string)
[1] "32" "36"
However I get this:
[1] "32"
PS If someone has a good guide to understand how regular expressions work here, please let me know. I am pretty sure that the answer is pretty simple but I just don't get regex :(
You may use
string <- "<img src='images/stimuli/32.png' style='width:341.38790035587186px;height: 265px;'><img src='images/stimuli/36.png' style='width:341.38790035587186px;height: 265px;'>"
regmatches(string, gregexpr("images/stimuli/\\K\\d+(?=\\.png)", string, perl=TRUE))[[1]]
# => [1] "32" "36"
NOTE: If there can be anything, not just numbers, you may replace \\d+ with .*?.
See the R demo and a regex demo.
The regmatches with gregexpr extract all matches found in the input.
The regex matches:
images/stimuli/ - a literal string
\K - a match reset operator discarding all text matched so far
\d+ - 1+ digits
(?=\.png) - a .png substring (. is a special character, it needs escaping).
You can use str_extract from the package stringr:
library(stringr)
str_extract_all(string, "(?<=images/stimuli/)\\d+(?=\\.png)")
[[1]]
[1] "32" "36"
This solution uses positive lookbehind, (?<=images/stimuli/), and positive lookahead, (?=\\.png), which are both non-capturing groups, and instead matches one or more numbers, \\d+, sitting between the two.

split at entire deliminator but not each component of deliminator

I want to split a string and keep where its being split.
str = 'Glenn: $53 Sutter: $44'
strsplit(str, '[0-9]\\s+[A-Z]', perl = TRUE)
# [[1]]
# [1] "Glenn: $5" "utter: $44" ## taking out what was matched
strsplit(str, '(?=[0-9]\\s+[A-Z])', perl = TRUE)
# [[1]]
# [1] "Glenn: $5" "3" " Sutter: $44" ## splitting at each component of the match
Is there a way to split it at the entire deliminator? So it returns:
# [1] "Glenn: $53" "Sutter: $44"
We can use a regex lookaround to split at one ore more spaces (\\s+) before an upper case letter and after a digit
strsplit(str, "(?<=[0-9])\\s+(?=[A-Z])", perl = TRUE)[[1]]
#[1] "Glenn: $53" "Sutter: $44"
My understanding is that you wish to split on spaces following strings comprise of a dollar sign followed by one or more digits, provided the spaces are followed by a letter.
By setting perl = true, you will use Perl's regex engine, which supports \K, which effectively means to discard everything matched so far. You therefore could use the following regex (with the case-indifferent flag set):
\$\d+\K\s+(?=[a-z])
Demo
In some cases, as here, \K can be used as a substitute for a variable-length lookbehind. Alas, most regex engines, including Perl's, do not support variable-length lookbehinds.

How to extract from string with regex, a word and or condition

I'd like to extract a word from a string, but don't know how to proceed :
Say I have these character strings :
a_toto_matthew
a_tutu_matthew
In both cases, I'd like to extract matthew
I tried
gsub("^a_[toto|tutu]_(.*)$", "\\1", "a_toto_matthew")
But it doesn't work.
I could have done :
gsub("^a_.*_(.*)$", "\\1", "a_toto_matthew")
But I find it less elegant. I'd like to know the syntax for mentioning "toto" or "tutu" in the regexpr
Thanks in advance for any guidance,
Mathieu
Another option could be using a capturing group matching u or o and a backreference and for the word use \w+ or match any word char except an underscore.
^a_t([uo])t\1_([^\W_]+)$
Regex demo
In the replacement use group 2
Try
gsub('a_(toto|tutu)_(.*)', '\\2', x)
#[1] "matthew" "matthew"
Using stringr with positive lookbehind
x <- c("a_toto_matthew", "a_tutu_matthew")
stringr::str_extract(x, "(?<=(toto|tutu)_)\\w+")
#[1] "matthew" "matthew"
Or using non-capturing group in str_match
stringr::str_match(x, "(?:toto|tutu)_(\\w+)")[,2]

How to extract everything until first occurrence of pattern

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Resources