How does zero-width negative lookahead assertions work in R? [duplicate] - r

This question already has answers here:
Why does strsplit use positive lookahead and lookbehind assertion matches differently?
(3 answers)
Closed 6 years ago.
the output of
strsplit('abc dcf', split = '(?=c)', perl = T)
is as expected.
However, the output of
strsplit('abc dcf', split = '(?!c)', perl = T)
is
[[1]]
[1] "a" "b" "c" " " "d" "c" "f"
while my expectation is
[[1]]
[1] "a" "b" "c " "d" "cf"
becasue I thought it wouldn't be splited if the last character of previous chunk matches the char c. Is my understanding of negative lookahead wrong?

We can try
strsplit('abc dcf', "(?![c ])\\s*\\b", perl=TRUE)
#[[1]]
#[1] "a" "b" "c " "d" "cf"

Related

str_extract in a vector by position [duplicate]

This question already has answers here:
Extracting the nth character from a vector of strings [duplicate]
(2 answers)
Extract first N characters from each string in a vector
(1 answer)
Closed 1 year ago.
I have a vector like this
cod <- c("6W41_CH", "6W41_CL" ,"6WPS_AH", "7C01_BC", "7C01_BD", "7C01_BL", "7C2L_AH", "7C2L_BI", "7C2L_CJ",
"7C8V_BA", "7C8W_BA", "7CAH_AD")
I'm doing an iteration and for each time I'd like to have for the cod[i] only the letter in the 6th position. I was trying to use str_extract. How I can do?
Position based extraction should be faster and more efficient with substr. We can provide the start and stop positions which is 6
substr(cod, 6, 6)
#[1] "C" "C" "A" "B" "B" "B" "A" "B" "C" "B" "B" "A"
or with str_sub
library(stringr)
str_sub(cod, 6, 6)
#[1] "C" "C" "A" "B" "B" "B" "A" "B" "C" "B" "B" "A"
If we need to use str_extract, specify a regex lookaround stating that we need to extract the next character after 5 characters
str_extract(cod, '(?<=^.{5}).')
#[1] "C" "C" "A" "B" "B" "B" "A" "B" "C" "B" "B" "A"

split by lookaround

I want to split a string into individual chars except those surrounded by < and >. Hence <a>bc<d>e would become <a> b c <d> e. I tried (?<!<)(?!>), which seems to work in a regex tester but not in the following code in R. What did I do wrong?
X = '<a>bc<d>e'
Y = '(?<!<)(?!>)'
unlist(strsplit(X,Y,perl=TRUE))
[1] "<" "a" ">" "b" "c" "<" "d" ">" "e"
Use positive lookarounds instead of negative ones:
strsplit('<a>bc<d>e', '(?<=[^<])(?=[^>])', perl=TRUE)
## [[1]]
## [1] "<a>" "b" "c" "<d>" "e"
See the R demo.
Details
(?<=[^<]) - a positive lookbehind that requires a char other than < immediately to the left of the current location
(?=[^>]) - a positive lookahead that requires a char other than > immediately to the right of the current location.
(<[^>]+>|\S)
Seems to work. This tries to first match triangle brackets with all they encase, and if not, matches a single character.
Example on Regex101
regmatches(X, gregexpr("<[^>]+>|\\S",X))[[1]]
#> [1] "<a>" "b" "c" "<d>" "e"

R splitting string based on double-character delimiter [duplicate]

This question already has an answer here:
strsplit with vertical bar (pipe)
(1 answer)
Closed 3 years ago.
I have a string "Test||Test1||test2" that I want to tokenize by ||. However, what I got is always the individual characters (with 2 empty chars at both ends):
"" "T" "e" "s" "t" "1" "|" "|" "T" "e" "s" "t" "2" "|" "|" "T" "e" "s" "t" "3" ""
I have tried both: strsplit(myString, "||") and str_split(myString, "||") from the library tidyverse (from this tutorial, seems like it should work) but got the same incorrect result.
How do I tokenize string based on double/multiple-character delimiter?
We can wrap with fixed as | is a metacharacter for OR
library(stringr)
str_split(myString, fixed("||"))[[1]]
#[1] "Test" "Test1" "test2"
Or another option is to escape (\\ - as #joran mentioned in the comments) or place it inside a square bracket
data
myString <- "Test||Test1||test2"

R: strsplit on negative lookaround

Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"
You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"
strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"

Why does asterisk wildcard fail with sub() command? [r]

When using the sub() function in R, how do we use an asterisk wildcard to replace all characters after (or before) an indicator?
If we want to remove an underscore and all arbitrary text afterward:
x <- c("a_101", "a_275", "b_133", "b_277")
The following code removes nothing:
sub(pattern = "_*", replacement = "", x = x)
[1] "a_101" "a_275" "b_133" "b_277"
Desired output:
"a" "a" "b" "b"
Why does the wildcard fail?
If using sub, you have to specify everything you want to replace, and what you want to replace it with. Here I've done that using a group function for the letter of interest.
sub('([a-z])_\\d+', replacement = '\\1', x)
[1] "a" "a" "b" "b"
Using the wild card will work too.
sub('([a-z])_.*', replacement = '\\1', x)
[1] "a" "a" "b" "b"
And finally more along the lines of what you were thinking:
sub('_.*', replacement = "", x)
[1] "a" "a" "b" "b"

Resources