R Strsplit keep delimiter in second element - r

I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:
x <- "123123 123 A123"
strsplit(x," [A-Z]")
results in:
"123123 123" "A123"
However, this does not keep the letter A in the second element.
I have tried using
strsplit(x,"(?<=[A-Z])",perl=T)
but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.

If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:
> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"
See the PCRE regex demo.
Details:
\s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
(?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)
You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:
> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"
If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".
Details:
^ - start of string
.* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
\\S - a non-whitespace
\\K - here, drop all the text matched so far
\\s+ - 1 or more whitespaces.
See another PCRE regex demo.

I would go with stringi package:
library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data
l1<-stri_split(x,fixed=" ")
[1] "123123" "123" "A123"
Then:
lapply(seq_along(1:length(l1)), function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))
[[1]]
[1] "123123 123" "A123"
[[2]]
[1] "34512 321" "B521"

Related

How to replace `qux$foo$bar` with `qux[["foo"]][["bar]]`? (dollar subsetting to brackets subsetting)

In a R script (that I would read with readLines), I want to replace every occurence of qux$foo$bar with qux[["foo"]][["bar]]. But I'm not a regex master.
I started with this regex:
> gsub("(\\w*)(\\$)(\\w*)", '\\1[["\\3"]]', "qux$foo$bar; input$test$a$a") %>% cat
qux[["foo"]][["bar"]]; input[["test"]][["a"]][["a"]]
Nice. But I also want to handle the case of backticks. So I tried:
> gsub("(\\w*)(\\$)`{0,1}(\\w*)`{0,1}", '\\1[["\\3"]]', "qux$`foo`; bar$`baz`; x$uvw") %>% cat
qux[["foo"]]; bar[["baz"]]; x[["uvw"]]
Looks correct. But between the backticks, there could be a space, and the previous way does not work in this case. So I tried the following, which neither does not work:
gsub("(\\w*)(\\$)`{0,1}(.*)`{0,1}", '\\1[["\\3"]]', "qux$`fo o`") %>% cat
qux[["fo o`"]]
Could you help to find the right regex pattern? It seems that instead of \\w I need something which means match a "word that can contain spaces".
You can use
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
See the regex #1 demo and regex #2 demo. Details:
(\w*) - Group 1 (\1): zero or more word chars
(?|$`([^`]*)`|$([^\s$]+)) - a branch reset group matching either
$`([^`]*)` - $, backtick, Group 2 (\2) capturing zero or more non-backtick chars, and a backtick.
| - or
$([^\s$]+) - $, then Group 2 (\2) capturing one or more chars other than whitespace and $
See the R demo:
x <- c('qux$foo$bar','qux$foo$bar; input$test$a$a','qux$`foo`; bar$`baz`; x$uvw','qux$`fo o`', 'q_ux$f_o_o$b.a_r')
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
## gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
Output:
[1] "qux[[\"foo\"]][[\"bar\"]]"
[2] "qux[[\"foo\"]][[\"bar;\"]] input[[\"test\"]][[\"a\"]][[\"a\"]]"
[3] "qux[[\"foo\"]]; bar[[\"baz\"]]; x[[\"uvw\"]]"
[4] "qux[[\"fo o\"]]"
[5] "q_ux[[\"f_o_o\"]][[\"b.a_r\"]]"
Note: backslashes in the output are console artifacts to keep the double quoted strings valid string literals, they are not part of the plain text output.
You might repeat optional spaces before and after matching 1 or more word characters.
You don't need a capture group for the $ but instead you could use a capture group to pair up the backtick in case it is there or not using a backreference to group 2.
To repeat 0+ whitespace chars you can also use \s but that could also match a newline.
Note that \w* matches optional word chars, and {0,1} can be written as ?
(\w*)\$(`?)( *\w+(?: +\w+)* *)\2
The pattern matches:
(\w*) Capture group 1 Match optional word characters
\$ Match $
(`?) Capture group 2, optionally match a backtick
( *\w+(?: +\w+)* *) Capture group 3 Match repetitions of word characters between spaces
\2 Backreference to what is captured in group 2 (yes or no backtick)
Regex demo
gsub("(\\w*)\\$(`?)( *\\w+(?: +\\w+)* *)\\2", '\\1[["\\3"]]', "qux$fo o$bar", perl=TRUE)
Output
[1] "qux[[\"fo o\"]][[\"bar\"]]"

R regex - extract words beginning with # symbol

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.
A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).
The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")

Can't figure out why regex group is not working in str_match

I have the following code with a regex
CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]
I thought this should match the value of "WILL " but it is returning blank.
I tried the RegEx in a different language and the group is coming back blank in that instance also.
What do I have to add to this regex to pull back just the value "WILL"?
You formed a repeated capturing group by placing + outside a group. Put it back:
CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
^
Note you may trim Will if you use a lazy match with \s* after the group:
CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
See this regex demo.
> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"
Alternatively, you may just extract Will with
> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"
Or the same with base R:
> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"
Here,
^ - matches the start of a string
.*? - any 0+ chars other than line break chars as few as possible
(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
\\s* - 0+ whitespaces
(?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
(?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
(?:\\(CONT'D\\))? - an optional (CONT'D) substring
$ - end of string.

R / stringr: split string, but keep the delimiters in the output

I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.
x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")
Normally I would get:
[[1]]
[1] "Foobar foobar," "oobar foobar"
The output I would like to get, however, should include the letter from the delimiter:
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.
You may split with 1+ whitespaces that are followed with an uppercase letter:
> str_split(x, "\\s+(?=[[:upper:]])")
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Here,
\\s+ - 1 or more whitespaces
(?=[[:upper:]]) - a positive lookahead (a non-consuming pattern) that only checks for an uppercase letter immediately to the right of the current location in string without adding it to the match value, thus, preserving it in the output.
Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).
We could use a regex lookaround to split at the space between a , and upper case character
str_split(x, "(?<=,) (?=[A-Z])")[[1]]
#[1] "Foobar foobar," "Foobar foobar"

extract text between certain characters in R

I need to capture TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer] from the following string, basically from - to # sign.
i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
I've tried this:
str_match(i, ".*-([^\\.]*)\\#.*")[,2]
I am getting NA, any ideas?
1) gsub Replace everything up to and including -, i.e. .* -, and everything after and including #, i.e. #.*, with a zero length string. No packages are needed:
gsub(".* - |#.*", "", i)
## "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
2) sub This would also work. It matches everything to space, minus, space (i.e. .* -) and then captures everything until # (i.e. (.*)# ) followed by whatever is left (.*) and replaces that with the capture group, i.e. the part within parens. It also uses no packages.
sub(".*- (.*)#.*", "\\1", i)
## [1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Note: We used this as input i:
i <- "Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com"
The following should work:
extract <- unlist(strsplit(i,"- |#"))[2]
You may use
-\s*([^#]+)
See the regex demo
Details:
- - a hyphen
\s* - zero or more whitespaces
([^#]+) - Group 1 capturing 1 or more chars other than #.
R demo:
> library(stringr)
> i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
> str_match(i, "-\\s*([^#]+)")[,2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
The same pattern can be used with base R regmatches/regexec:
> regmatches(i, regexec("-\\s*([^#]+)", i))[[1]][2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
If you prefer a replacing approach you may use a sub:
> sub(".*?-\\s*([^#]+).*", "\\1", i)
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Here, .*? matches any 0+ chars, as few as possible, up to the first -, then -, 0+ whitespaces (\\s*), then 1+ chars other than # are captured into Group 1 (see ([^#]+)) and then .* matches the rest of the string. The \1 in the replacement pattern puts the contents of Group 1 back into the replacement result.

Resources