split by lookaround - r

I want to split a string into individual chars except those surrounded by < and >. Hence <a>bc<d>e would become <a> b c <d> e. I tried (?<!<)(?!>), which seems to work in a regex tester but not in the following code in R. What did I do wrong?
X = '<a>bc<d>e'
Y = '(?<!<)(?!>)'
unlist(strsplit(X,Y,perl=TRUE))
[1] "<" "a" ">" "b" "c" "<" "d" ">" "e"

Use positive lookarounds instead of negative ones:
strsplit('<a>bc<d>e', '(?<=[^<])(?=[^>])', perl=TRUE)
## [[1]]
## [1] "<a>" "b" "c" "<d>" "e"
See the R demo.
Details
(?<=[^<]) - a positive lookbehind that requires a char other than < immediately to the left of the current location
(?=[^>]) - a positive lookahead that requires a char other than > immediately to the right of the current location.

(<[^>]+>|\S)
Seems to work. This tries to first match triangle brackets with all they encase, and if not, matches a single character.
Example on Regex101
regmatches(X, gregexpr("<[^>]+>|\\S",X))[[1]]
#> [1] "<a>" "b" "c" "<d>" "e"

Related

R: strsplit on negative lookaround

Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"
You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"
strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"

Extract substring after the final dot

I want to implement a regex to extract the substring after the final dot.
For example,
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
gsub(".*(\\..*)$", "\\1", a)
The code returns
".d" ".e" "c" ".e" ".z"
How do I modify the code to get
"d" "e" "" "e" "z"
That is to say, if the string contains dot, it will remove the last part without the dot; if the string doesn't contain dot, it will return "".
Here is a way to do this using sub without capture groups. We can try replacing all content up to and including the final dot with empty string.
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
sub(".*\\.", "", a)
[1] "d" "e" "c" "e" "z"
If you want to return empty string should the input have no dot, then we can use ifelse with grepl:
input <- "Hello World!"
output <- ifelse(grepl("\\.", input), sub(".*\\.", "", input), "")
The reason for the verbose above code is that sub by default just returns the original string should no match be found. But, in your case, you want a different behavior.
You need . outside the capture group as you don't need it
sub(".*\\.(.*)", "\\1", a)
#[1] "d" "e" "c" "e" "z"
This will capture everything after the last dot.
For strings where we have no dots, we could check for it using grepl and then extract
ifelse(grepl("\\.", a), sub(".*\\.(.*)", "\\1", a), "")
#[1] "d" "e" "" "e" "z"

How does zero-width negative lookahead assertions work in R? [duplicate]

This question already has answers here:
Why does strsplit use positive lookahead and lookbehind assertion matches differently?
(3 answers)
Closed 6 years ago.
the output of
strsplit('abc dcf', split = '(?=c)', perl = T)
is as expected.
However, the output of
strsplit('abc dcf', split = '(?!c)', perl = T)
is
[[1]]
[1] "a" "b" "c" " " "d" "c" "f"
while my expectation is
[[1]]
[1] "a" "b" "c " "d" "cf"
becasue I thought it wouldn't be splited if the last character of previous chunk matches the char c. Is my understanding of negative lookahead wrong?
We can try
strsplit('abc dcf', "(?![c ])\\s*\\b", perl=TRUE)
#[[1]]
#[1] "a" "b" "c " "d" "cf"

Regexes works on their own, but not when used together in strsplit

I'm trying to split a string in R using strsplit and a perl regex. The string consists of various alphanumeric tokens separated by periods or hyphens, e.g "WXYZ-AB-A4K7-01A-13B-J29Q-10". I want to split the string:
wherever a hyphen appears.
wherever a period appears.
between the second and third character of a token that is exactly 3 characters long and consists of 2 digits followed by 1 capital letter, e.g "01A" produces ["01", "A"] (but "012A", "B1A", "0A1", and "01A2" are not split).
For example, "WXYZ-AB-A4K7-01A-13B-J29Q-10" should produce ["WXYZ", "AB", "01", "A", "13", "B", "J29Q", "10"].
My current regex is ((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-] and it works perfectly in this online regex tester.
Furthermore, the two parts of the alternative, ((?<=[-.]\\d{2})(?=[A-Z][-.])) and [.-], both serve to split the string as intended in R, when they are used separately:
#correctly splits on periods and hyphens
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
#correctly splits tokens where a letter follows two digits
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))", perl=T)
[[1]]
[1] "WXYZ-AB-A4K7-01" "A-13" "B-J29Q-10"
But when I try and combine them using an alternative, the second regex stops working, and the string is only split on periods and hyphens:
#only second alternative is used
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
Why is this happening? Is it a problem with my regex, or with strsplit? How can I achieve the desired behavior?
Desired output:
## [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
An alternative that prevents you from having to consider how the strsplit algorithm works, is to use your original regex with gsub to insert a simple splitting character in all the right places, then do use strsplit to do the straightforward splitting.
strsplit(
gsub("((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", "-", x, perl = TRUE),
"-",
fixed = TRUE)
#[[1]]
#[1] "XYZ" "02" "01" "C" "33" "D" "2285"
Of course, RichScriven's answer and Wiktor Stribiżew's comment are probably better since they only have one function call.
You may use a consuming version of a positive lookahead (a match reset operator \K) to make sure strsplit works correctly in R and avoid the problem of using a negative lookbehind inside a positive one.
"(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]"
See the R demo online (and a regex demo here).
strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "XYZ" "02" "01" "C" "33" "D" "2285"
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
Here, the pattern matches:
(?<![^.-])\d{2}\K(?=[A-Z](?:[.-]|$)) - a sequence of:
(?<![^.-])\d{2} - 2 digits (\d{2}) that are not preceded with a char other than . and - (i.e. that are preceded with . or - or start of string, it is a common trick to avoid alternation inside a lookaround)
\K - the match reset operator that makes the regex engine discard the text matched so far and go on matching the subsequent subpatterns if any
| - or
[.-] - matches . or -.
Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:
#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^
#match + left removed
"AB-A4K7-01A-13B-J29Q-10"
#further matches found and removed
"01A-13B-J29Q-10"
#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"
#algorithm continues
"13B-J29Q-10"
This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:
> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"

Substring first character from right

I want to be able to substring the first character from the right hand side of each element of a vector
ABC20
BCD3
B1
AB2222
BX4444
so for the group above I would want, C, D, B, B, X .... is there an easy way to this? I know there is a substr and a numindex/charindex. So I think I can use these but not sure exactly in R.
You can use library stringi,
stringi::stri_extract_last_regex(x, '[A-Z]')
#[1] "C" "D" "B" "B" "X"
DATA
x <- c('ABC20', 'BCD3', 'B1', 'AB2222', 'BX4444')
Try this:
Your data:
list<-c("ABC20","BCD3","B1","AB2222","BX4444")
Identify position
number_pos<-gregexpr(pattern ="[0-9]",list)
number_first<-unlist(lapply(number_pos, `[[`, 1))
Extraction
substr(list,number_first-1,number_first-1)
[1] "C" "D" "B" "B" "X"
We can use sub to capture the last upper case letter (([A-Z])) followed by zero or more digits (\\d*) until the end ($) of the string and replace it with the backreference (\\1) of the captured group
sub(".*([A-Z])\\d*$", "\\1", x)
#[1] "C" "D" "B" "B" "X"
data
x <- c("ABC20", "BCD3", "B1", "AB2222", "BX4444")

Resources