R: strsplit on negative lookaround - r

Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"

You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"

strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"

Related

split by lookaround

I want to split a string into individual chars except those surrounded by < and >. Hence <a>bc<d>e would become <a> b c <d> e. I tried (?<!<)(?!>), which seems to work in a regex tester but not in the following code in R. What did I do wrong?
X = '<a>bc<d>e'
Y = '(?<!<)(?!>)'
unlist(strsplit(X,Y,perl=TRUE))
[1] "<" "a" ">" "b" "c" "<" "d" ">" "e"
Use positive lookarounds instead of negative ones:
strsplit('<a>bc<d>e', '(?<=[^<])(?=[^>])', perl=TRUE)
## [[1]]
## [1] "<a>" "b" "c" "<d>" "e"
See the R demo.
Details
(?<=[^<]) - a positive lookbehind that requires a char other than < immediately to the left of the current location
(?=[^>]) - a positive lookahead that requires a char other than > immediately to the right of the current location.
(<[^>]+>|\S)
Seems to work. This tries to first match triangle brackets with all they encase, and if not, matches a single character.
Example on Regex101
regmatches(X, gregexpr("<[^>]+>|\\S",X))[[1]]
#> [1] "<a>" "b" "c" "<d>" "e"

Remove specific words and symbol from a df

I have a dataframe structure like this, 39 rows
text.
"A" OR "B" OR "C"
"C" OR "D" OR "E"
and a "black list" of words that I want to delete, that begin and end with the symbol ". (200 words) here an example:
blackList
"A"
"D"
i want to remove them from the starting dataframe, obtaining:
text.
OR "B" OR "C"
"C" OR OR "E"
how can I do? I tried with removeWords, but it does not read the symbol ".
We could create a pattern by pasting all the blacklisted items together with "|" as collapsable argument and then remove all of them.
df$text <- gsub(paste0(blacklist$blackList, collapse = "|"), "", df$text)
df
# text
#1 OR "B" OR "C"
#2 "C" OR OR "E"
data
df <- data.frame(text = c('"A" OR "B" OR "C"','"C" OR "D" OR "E"'))
blacklist <- data.frame(blackList = c('"A"', '"D"'))
gsub('\"A\"', "", '"A" OR "B" OR "C"')
escape the quotes with a backslash and use gsub

Why does asterisk wildcard fail with sub() command? [r]

When using the sub() function in R, how do we use an asterisk wildcard to replace all characters after (or before) an indicator?
If we want to remove an underscore and all arbitrary text afterward:
x <- c("a_101", "a_275", "b_133", "b_277")
The following code removes nothing:
sub(pattern = "_*", replacement = "", x = x)
[1] "a_101" "a_275" "b_133" "b_277"
Desired output:
"a" "a" "b" "b"
Why does the wildcard fail?
If using sub, you have to specify everything you want to replace, and what you want to replace it with. Here I've done that using a group function for the letter of interest.
sub('([a-z])_\\d+', replacement = '\\1', x)
[1] "a" "a" "b" "b"
Using the wild card will work too.
sub('([a-z])_.*', replacement = '\\1', x)
[1] "a" "a" "b" "b"
And finally more along the lines of what you were thinking:
sub('_.*', replacement = "", x)
[1] "a" "a" "b" "b"

How does zero-width negative lookahead assertions work in R? [duplicate]

This question already has answers here:
Why does strsplit use positive lookahead and lookbehind assertion matches differently?
(3 answers)
Closed 6 years ago.
the output of
strsplit('abc dcf', split = '(?=c)', perl = T)
is as expected.
However, the output of
strsplit('abc dcf', split = '(?!c)', perl = T)
is
[[1]]
[1] "a" "b" "c" " " "d" "c" "f"
while my expectation is
[[1]]
[1] "a" "b" "c " "d" "cf"
becasue I thought it wouldn't be splited if the last character of previous chunk matches the char c. Is my understanding of negative lookahead wrong?
We can try
strsplit('abc dcf', "(?![c ])\\s*\\b", perl=TRUE)
#[[1]]
#[1] "a" "b" "c " "d" "cf"

Subsetting on all but empty grep returns empty vector

Suppose I have some character vector, which I'd like to subset to elements that don't match some regular expression. I might use the - operator to remove the subset that grep matches:
> vec <- letters[1:5]
> vec
[1] "a" "b" "c" "d" "e"
> vec[-grep("d", vec)]
[1] "a" "b" "c" "e"
I'm given back everything except the entries that matched "d". But if I search for a regular expression that isn't found, instead of getting everything back as I would expect, I get nothing back:
> vec[-grep("z", vec)]
character(0)
Why does this happen?
It's because grep returns an integer vector, and when there's no match, it returns integer(0).
> grep("d", vec)
[1] 4
> grep("z", vec)
integer(0)
and the since the - operator works elementwise, and integer(0) has no elements, the negation doesn't change the integer vector:
> -integer(0)
integer(0)
so vec[-grep("z", vec)] evaluates to vec[-integer(0)] which in turn evaluates to vec[integer(0)], which is character(0).
You will get the behavior you expect with invert = TRUE:
> vec[grep("d", vec, invert = TRUE)]
[1] "a" "b" "c" "e"
> vec[grep("z", vec, invert = TRUE)]
[1] "a" "b" "c" "d" "e"

Resources