R / stringr: split string, but keep the delimiters in the output - r

I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.
x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")
Normally I would get:
[[1]]
[1] "Foobar foobar," "oobar foobar"
The output I would like to get, however, should include the letter from the delimiter:
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.

You may split with 1+ whitespaces that are followed with an uppercase letter:
> str_split(x, "\\s+(?=[[:upper:]])")
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Here,
\\s+ - 1 or more whitespaces
(?=[[:upper:]]) - a positive lookahead (a non-consuming pattern) that only checks for an uppercase letter immediately to the right of the current location in string without adding it to the match value, thus, preserving it in the output.
Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).

We could use a regex lookaround to split at the space between a , and upper case character
str_split(x, "(?<=,) (?=[A-Z])")[[1]]
#[1] "Foobar foobar," "Foobar foobar"

Related

Why does stringr::str_replace() match every character in a string as "."?

library(stringr)
y4=c("yes i do")
str_replace_all(y4,".","_")
[1] "________"
str_replace_all(y4," ","_")
[1] "yes_i_do"
y4=c("yes i do.")
str_replace_all(y4," ","_")
[1] "yes_i_do."
If you attempt to replace "." in a string, every character is replaced.
stringr by default uses regular expressions (regex), a powerful searching tool. The . is a regex wildcard for any character except a new line. If you want a literal . you have to escape it with a backslash like so \. in regex, but as R interprets the string we need another backslash to escape the first backslash so you use \\.
Obligatory xkcd
For your example:
library(stringr)
y4 <- c("yes i do.") #added a period so we can see the replacement.
str_replace_all(y4,"\\.","_")
[1] "yes i do_"
Alternately, if you wanted to use a fixed expression without regex syntax you could use:
str_replace_all(y4, fixed("."),"_")
[1] "yes i do_"

R regex - extract words beginning with # symbol

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.
A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).
The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")

gsub regex in R - ignore newline symbol

Here's a reproducible example
S0 <- "\n3 4 5"
S1 <- "\n3 5"
I want to use gsub and the following regex pattern (outside of R it works - tested in regex101) to return the digits. This regex should ignore \ and n whether they occur together or not.
([^\\n])(\s{1})?
I am not looking for a way to match the digits with a fundamentally different pattern - I'd like to know how to get the above pattern to work in R. The following do not work for me
gsub("([^\\\n])(\\s{1})?", "\\1", S0)
gsub("([^[\\\]n])(\\s{1})?", "\\1", S1)
The output should be
#S0 - 345
#S1 - 3 5
Since you specifically want that regex to work you could match and optional \n (using (\n)?):
gsub("(\n)?([^\\n])(\\s{1})", "\\2", S0)
#[1] "345"
gsub("(\n)?([^\\n])(\\s{1})", "\\2", S1)
#[1] "3 5"
Note that you were right, if you use a regex tester like: https://regex101.com/ it works without the extra "(\n)?". However, I think in R you have to match more for capture groups to work properly.
Your ([^\\n])(\s{1})? pattern in regex101 (PCRE) matches different strings than the same pattern used in gsub without perl=TRUE (that is, when it is handled by the TRE regex library). They would work the same if you used perl=TRUE and use gsub("([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE).
What is so peculair with the PCRE Regex ([^\\n])(\s{1})?
This pattern in a regex tester with PCRE option matches:
([^\\n]) - any char other than \ and n (put into Group 1)
(\s{1})? - matches and captures into Group 2 any single whitespace char, optionally, 1 or 0 times.
Note this pattern does not match any non-newline char with the first capturing group, it would match any non-newline if it were [^\n].
Now, the same regex with gsub will be
gsub("([^\n])(\\s{1})?", "\\1", S1) # OR
gsub("([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE)
Why different number of backslashes? Because the first regex is handled with TRE regex library and in these patterns, inside bracket expressions, no regex escapes are parsed as such, the \ and n are treated as 2 separate chars. In a PCRE pattern, the one with perl=TRUE, the [...] are called character classes and inside them, you can define regex escapes, and thus the \ regex escape char should be doubled (that is, inside the R string literal it should be quadrupled as you need a \ to escape \ for the R engine to "see" a backslash).
Actually, if you want to match a newline, you just need to use \n in the regex pattern, you may either use "\n" or "\\n" as both TRE and PCRE regex engines parse LF and a \n regex escape as a newline char matching pattern. These four are equivalent:
gsub("\n([^\n])(\\s{1})?", "\\1", S1)
gsub("\\n([^\n])(\\s{1})?", "\\1", S1)
gsub("\n([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE)
gsub("\\n([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE)
If the \n must be optional, just add ? quantifier after it, no need wrapping it with a group:
gsub("\n?([^\n])(\\s{1})?", "\\1", S1)
^
And simplifying it further:
gsub("\n?([^\n])\\s?", "\\1", S1)
And also, if by [^\n] you want to match any char but a newline, just use . with (?n) inline modifier:
gsub("(?n)(.)(\\s{1})?", "\\1", S1)
See R demo online.
A couple of issues. The is not a backslash in your S object (it's an escape-operator rather than a character) and there is a predefined digit character class that can be negated:
gsub("[^[:digit:]]", "", S)
[1] "345"
If in the other hand you wanted to exclude the newline character and the spaces, it would be done by removing one of the escape operators, since they are not needed except for the small group of special characters that exist in the character class context:
gsub("[\n ]", "", S)
[1] "345"

regular expression to find exact matching containing a space and a punctuation

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "
You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

R Strsplit keep delimiter in second element

I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:
x <- "123123 123 A123"
strsplit(x," [A-Z]")
results in:
"123123 123" "A123"
However, this does not keep the letter A in the second element.
I have tried using
strsplit(x,"(?<=[A-Z])",perl=T)
but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.
If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:
> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"
See the PCRE regex demo.
Details:
\s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
(?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)
You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:
> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"
If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".
Details:
^ - start of string
.* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
\\S - a non-whitespace
\\K - here, drop all the text matched so far
\\s+ - 1 or more whitespaces.
See another PCRE regex demo.
I would go with stringi package:
library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data
l1<-stri_split(x,fixed=" ")
[1] "123123" "123" "A123"
Then:
lapply(seq_along(1:length(l1)), function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))
[[1]]
[1] "123123 123" "A123"
[[2]]
[1] "34512 321" "B521"

Resources