How to strip down comma-separated strings to unique substrings

How to strip down comma-separated strings to unique substrings - r

I'm struggling to strip down comma-separated strings to unique substrings in a cleanly fashion:
x <- c("Anna & x, Anna & x", #
"Alb, Berta 222, Alb",
"Al Pacino",
"Abb cd xy, Abb cd xy, C123, C123, B")
I seem to be doing fine with this combination of negative characterclass, negative lookahead and backreference; however what bothers me is that in many substrings there is unwanted whitespace:
library(stringr)
str_extract_all(x, "([^,]+)(?!.*\\1)")
[[1]]
[1] " Anna & x"
[[2]]
[1] " Berta 222" " Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] " Abb cd xy" " C123" " B"
How can the pattern be refined so that no unwanted whitespace gets extracted?
Desired result:
#> [[1]]
#> [1] "Anna & x"
#> [[2]]
#> [1] "Alb" "Berta 222"
#> [[3]]
#> [1] "Al Pacino"
#> [[4]]
#> [1] "Abb cd xy" "C123" "B"
EDIT:
Just wanted to share this solution with double negative lookahead, which also works well (and thanks for the many useful solutions proposed!)
str_extract_all(x, "((?!\\s)[^,]+)(?!.*\\1)")

You can use str_split to get the individual substrings, followed by unique to remove repeated strings. For example:
library(tidyverse)
str_split(x, ", ?") %>% map(unique)
[[1]]
[1] "Anna & x"
[[2]]
[1] "Alb" "Berta 222"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"
If you want the output as a single vector of unique strings, you could do:
unique(unlist(str_split(x, ", ?")))
[1] "Anna & x" "Alb" "Berta 222" "Al Pacino" "Abb cd xy" "C123" "B"
In the code above we used the regex ", ?" to split at a comma or a comma followed by a space so that we don't end up with a leading space. For future reference, if you do need to get rid of leading or trailing whitespace, you can use str_trim. For example, if we had used "," in str_split we could do the following:
str_split(x, ",") %>%
map(str_trim) %>%
map(unique)

Change your pattern to the one below:
str_extract_all(x, "(\\b[^,]+)(?!.*\\1)")
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"

You need to start matching from a char other than a whitespace and a comma, then optionally match any zero or more chars other than a comma up to a char other than whitespace and a comma:
str_extract_all(x, "([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")
See the regex demo and an R demo online. Mind that if your strings contain line breaks, you need to prepend the pattern with (?s): str_extract_all(x, "(?s)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)").
If you need to make it case insensitive (e.g. Abb cd xy and ABB cD Xy are considered duplicates), add the i flag: str_extract_all(x, "(?i)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") (or str_extract_all(x, "(?si)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") if the DOTALL behavior is needed).
Details:
([^\s,](?:[^,]*[^\s,])?) - Group 1:
[^\s,] - a char other than whitespace and a comma
(?:[^,]*[^\s,])? - an optional sequence of
[^,]* - zero or more chars other than a comma
[^\s,] - a char other than whitespace and a comma
(?!.*\1) - a negative lookahead that fails the match if there are zero or more chars, as many as possible, followed with the Group 1 value.

Not exactly what you asked for, but the NLP frameworks can be helpful when the problems get more complex.
library(tidytext)
library(dplyr)
library(tibble)
tibble(text = x) %>%
rowid_to_column("stringid") %>%
unnest_regex(substring, text, pattern = ",", to_lower = FALSE) %>%
distinct(stringid, substring = trimws(substring))
# # A tibble: 7 x 2
# stringid substring
# <int> <chr>
# 1 1 Anna & x
# 2 2 Alb
# 3 2 Berta 222
# 4 3 Al Pacino
# 5 4 Abb cd xy
# 6 4 C123
# 7 4 B

Just add lapply(..., str_trim) to your code:
library(stringr)
lapply(str_extract_all(x, "([^,]+)(?!.*\\1)"), str_trim)
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"

Related

readr::parse_number with leading zero

I would like to parse numbers that have a leading zero.
I tried readr::parse_number, however, it omits the leading zero.
library(readr)
parse_number("thankyouverymuch02")
#> [1] 2
Created on 2022-12-30 with reprex v2.0.2
The desired output would be 02

The simplest and most naive would be:
gsub("\\D", "", "thankyouverymuch02")
[1] "02"
The regex special "\\d" matches a single 0-9 character only; the inverse is "\\D" which matches a single character that is anything except 0-9.
If you have strings with multiple patches of numbers and you want them to be distinct, neither parse_number nor this simple gsub is going to work.
gsub("\\D", "", vec)
# [1] "02" "0302"
For that, it must always return a list (since we don't necessarily know a priori how may elements have 0, 1 or more number-groups).
vec <- c("thankyouverymuch02", "thank03youverymuch02")
regmatches(vec, gregexpr("\\d+", vec))
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"
#### equivalently
stringr::str_extract_all(vec, "\\d+")
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"

How to flag missing left-hand collocates with NA

I want to compute collocates of the lemma GO, including all its forms such as go, goes, gone, etc.:
go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
The lemma forms are stored in this vector:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
and this vector turns them into an alternation pattern:
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
However, when using the pattern with str_extract_all to extract the immediately left-hand collocate of GO, the extraction misses out on those strings where GO is the first word in the string and reoccurs later in the string:
library(stringr)
str_extract_all(go, paste0("'?\\b[a-z']+\\b(?=\\s?", pattern_GO, ")"))
[[1]]
character(0)
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
The expected result is this:
[[1]]
[1] NA
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] NA "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
How can the extraction be mended to also return NA in the absence of a left-hand collocate?

You can add an alternative to match at the start of a string, or your consuming pattern:
str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
See the regex demo.
See the R demo:
go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
library(stringr)
str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
Output:
[[1]]
[1] ""
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "" "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
Sukces #stdin #stdout 0.26s 42528KB
[1] "\\b(go|goes|going|gone|went|gon na)\\b"
[[1]]
[1] ""
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "" "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
If you want, you can turn all empty items into NA using
res <- str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
res <- lapply(res, function(x) ifelse(x=="", NA, x))

Handling zero length character vectors as empty strings

By way of example, see the extraction of Twitter handles below. The target is to have a character string that resembles tweets but has only handles separated by commas. str_replace_all yields empty vectors when no matches are found and that threw some unexpected errors further down the track.
library(purrr)
library(stringr)
tweets <- c(
"",
"This tweet has no handles",
"This is a tweet for #you",
"This is another tweet for #you and #me",
"This, #bla, is another tweet for #me and #you"
)
mention_rx <- "#\\w+"
This was my first attempt:
map_chr(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#> Error: Result 1 must be a single string, not a character vector of length 0
Then I played around with things:
mentions <- map(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
mentions
#> [[1]]
#> character(0)
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] "#you"
#>
#> [[4]]
#> [1] "#you, #me"
#>
#> [[5]]
#> [1] "#bla, #me, #you"
as.character(mentions)
#> [1] "character(0)" "character(0)" "#you" "#you, #me"
#> [5] "#bla, #me, #you"
Until it dawned on me that paste could also be used here:
map_chr(tweets, ~paste(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#> "" "" "#you" "#you, #me" "#bla, #me, #you"
My questions are:
Is there a more elegant way of getting there?
Why doesn't str_c behave the same as paste with an identical collapse argument?
Why don't as.character and map_chr recognise a character vector
of length zero as equivalent to an empty string but paste does?
I found some good references on str(i)_c, paste, and the difference between them; but none of these addressed the situation with empty strings.

You don't need to map over tweets, str_extract_all can handle vectors
library(stringr)
str_extract_all(tweets, mention_rx)
#[[1]]
#character(0)
#[[2]]
#character(0)
#[[3]]
#[1] "#you"
#[[4]]
#[1] "#you" "#me"
#[[5]]
#[1] "#bla" "#me" "#you"
Now if you need one comma-separated string then you can use map
purrr::map_chr(str_extract_all(tweets, mention_rx), toString)
#[1] "" "" "#you" "#you, #me" "#bla, #me, #you"
To answer the "why" questions, we can look at the documentation of paste and str_c functions.
From ?paste
Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
From ?str_c
Zero length arguments are removed.
Hence, by default str_c removes zero-length arguments which makes the output a 0-length string which fails for map_chr but it works with map as map returns a list
map(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#[[1]]
#character(0)
#[[2]]
#character(0)
#[[3]]
#[1] "#you"
#[[4]]
#[1] "#you, #me"
#[[5]]
#[1] "#bla, #me, #you"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!

One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

R substring based on Regular Expression

I have a strings like :
myString = "2 word1 & 4 word2"
myString = "4 word2"
myString = "2 word1"
I would like to get the number before the word1 and the number before word2
number1 = 2
number2 = 4
How can i do with a regular expression in R
I tried something like this but it only get the first number
gsub("([0-9]+).*", "\\1", myString)

You may extract specific number before a specific string using a regex with a lookahead:
> word1_res <- str_extract_all(myString, "\\d+(?=\\s*word1)")
> word1_res
[[1]]
[1] "2"
[[2]]
character(0)
[[3]]
[1] "2"
The results for word2 can be retrieved similarly:
word2_res <- str_extract_all(myString, "\\d+(?=\\s*word2)")
Details
\d+ - 1 or more digits...
(?=\\s*word2) - if immediately followed with:
\s* - 0+ whitespaces
word2 - a literal word2 substring.
A base R equivalent is
regmatches(myString, gregexpr("\\d+(?=\\s*word1)", myString, perl=TRUE))
regmatches(myString, gregexpr("\\d+(?=\\s*word2)", myString, perl=TRUE))
A sub almost equivalent solution would be
> sub(".*?(\\d+)\\s*word1.*|.*","\\1",myString)
[1] "2" "" "2"
> sub(".*?(\\d+)\\s*word2.*|.*","\\1",myString)
[1] "4" "4" ""
Note that this implies there is only one result per string, while str_extract_all will get all occurrences from the string.
To extract any chunk of 1+ digits as a whole word using a stringr solution with str_extract_all
library(stringr)
str_extract_all(myString, "\\b\\d+\\b")
or a base R one with regmatches/gregexpr:
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
regmatches(myString, gregexpr("\\b\\d+\\b", myString))
See an online R demo. Output:
[[1]]
[1] "2" "4"
[[2]]
[1] "4"
[[3]]
[1] "2"
Details
\b - a word boundary
\d+ - 1 or more digits
\b - a word boundary.

try
myString = "2 word1 & 4 word2"
number1 = gsub("([0-9]+).*", "\\1", myString)
myString = "4 word2"
number2 = gsub("([0-9]+).*", "\\1", myString)
myString = "2 word1"
number3 = gsub("([0-9]+).*", "\\1", myString)
print(number1)
print(number2)
print(number3)
If you assign 3 times a string to myString, myString will only contain the last one.

This removes each occurrence of a letter or ampersand possibly followed by other non-space characters and then scans in what is left. The scan also converts them to numeric. No packages are used.
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
lapply(myString, function(x) scan(text = gsub("[[:alpha:]&]\\S*", "", x), quiet = TRUE))
giving:
[[1]]
[1] 2 4
[[2]]
[1] 4
[[3]]
[1] 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to strip down comma-separated strings to unique substrings - r

Change your pattern to the one below: str_extract_all(x, "(\\b[^,]+)(?!.*\\1)") [[1]] [1] "Anna & x" [[2]] [1] "Berta 222" "Alb" [[3]] [1] "Al Pacino" [[4]] [1] "Abb cd xy" "C123" "B"

Just add lapply(..., str_trim) to your code: library(stringr) lapply(str_extract_all(x, "([^,]+)(?!.*\\1)"), str_trim) [[1]] [1] "Anna & x" [[2]] [1] "Berta 222" "Alb" [[3]] [1] "Al Pacino" [[4]] [1] "Abb cd xy" "C123" "B"

Related

readr::parse_number with leading zero

How to flag missing left-hand collocates with NA

Handling zero length character vectors as empty strings

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

R substring based on Regular Expression

Categories

Resources