By way of example, see the extraction of Twitter handles below. The target is to have a character string that resembles tweets but has only handles separated by commas. str_replace_all yields empty vectors when no matches are found and that threw some unexpected errors further down the track.
library(purrr)
library(stringr)
tweets <- c(
"",
"This tweet has no handles",
"This is a tweet for #you",
"This is another tweet for #you and #me",
"This, #bla, is another tweet for #me and #you"
)
mention_rx <- "#\\w+"
This was my first attempt:
map_chr(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#> Error: Result 1 must be a single string, not a character vector of length 0
Then I played around with things:
mentions <- map(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
mentions
#> [[1]]
#> character(0)
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] "#you"
#>
#> [[4]]
#> [1] "#you, #me"
#>
#> [[5]]
#> [1] "#bla, #me, #you"
as.character(mentions)
#> [1] "character(0)" "character(0)" "#you" "#you, #me"
#> [5] "#bla, #me, #you"
Until it dawned on me that paste could also be used here:
map_chr(tweets, ~paste(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#> "" "" "#you" "#you, #me" "#bla, #me, #you"
My questions are:
Is there a more elegant way of getting there?
Why doesn't str_c behave the same as paste with an identical collapse argument?
Why don't as.character and map_chr recognise a character vector
of length zero as equivalent to an empty string but paste does?
I found some good references on str(i)_c, paste, and the difference between them; but none of these addressed the situation with empty strings.
You don't need to map over tweets, str_extract_all can handle vectors
library(stringr)
str_extract_all(tweets, mention_rx)
#[[1]]
#character(0)
#[[2]]
#character(0)
#[[3]]
#[1] "#you"
#[[4]]
#[1] "#you" "#me"
#[[5]]
#[1] "#bla" "#me" "#you"
Now if you need one comma-separated string then you can use map
purrr::map_chr(str_extract_all(tweets, mention_rx), toString)
#[1] "" "" "#you" "#you, #me" "#bla, #me, #you"
To answer the "why" questions, we can look at the documentation of paste and str_c functions.
From ?paste
Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
From ?str_c
Zero length arguments are removed.
Hence, by default str_c removes zero-length arguments which makes the output a 0-length string which fails for map_chr but it works with map as map returns a list
map(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#[[1]]
#character(0)
#[[2]]
#character(0)
#[[3]]
#[1] "#you"
#[[4]]
#[1] "#you, #me"
#[[5]]
#[1] "#bla, #me, #you"
Related
I would like to parse numbers that have a leading zero.
I tried readr::parse_number, however, it omits the leading zero.
library(readr)
parse_number("thankyouverymuch02")
#> [1] 2
Created on 2022-12-30 with reprex v2.0.2
The desired output would be 02
The simplest and most naive would be:
gsub("\\D", "", "thankyouverymuch02")
[1] "02"
The regex special "\\d" matches a single 0-9 character only; the inverse is "\\D" which matches a single character that is anything except 0-9.
If you have strings with multiple patches of numbers and you want them to be distinct, neither parse_number nor this simple gsub is going to work.
gsub("\\D", "", vec)
# [1] "02" "0302"
For that, it must always return a list (since we don't necessarily know a priori how may elements have 0, 1 or more number-groups).
vec <- c("thankyouverymuch02", "thank03youverymuch02")
regmatches(vec, gregexpr("\\d+", vec))
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"
#### equivalently
stringr::str_extract_all(vec, "\\d+")
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"
I'm struggling to strip down comma-separated strings to unique substrings in a cleanly fashion:
x <- c("Anna & x, Anna & x", #
"Alb, Berta 222, Alb",
"Al Pacino",
"Abb cd xy, Abb cd xy, C123, C123, B")
I seem to be doing fine with this combination of negative characterclass, negative lookahead and backreference; however what bothers me is that in many substrings there is unwanted whitespace:
library(stringr)
str_extract_all(x, "([^,]+)(?!.*\\1)")
[[1]]
[1] " Anna & x"
[[2]]
[1] " Berta 222" " Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] " Abb cd xy" " C123" " B"
How can the pattern be refined so that no unwanted whitespace gets extracted?
Desired result:
#> [[1]]
#> [1] "Anna & x"
#> [[2]]
#> [1] "Alb" "Berta 222"
#> [[3]]
#> [1] "Al Pacino"
#> [[4]]
#> [1] "Abb cd xy" "C123" "B"
EDIT:
Just wanted to share this solution with double negative lookahead, which also works well (and thanks for the many useful solutions proposed!)
str_extract_all(x, "((?!\\s)[^,]+)(?!.*\\1)")
You can use str_split to get the individual substrings, followed by unique to remove repeated strings. For example:
library(tidyverse)
str_split(x, ", ?") %>% map(unique)
[[1]]
[1] "Anna & x"
[[2]]
[1] "Alb" "Berta 222"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"
If you want the output as a single vector of unique strings, you could do:
unique(unlist(str_split(x, ", ?")))
[1] "Anna & x" "Alb" "Berta 222" "Al Pacino" "Abb cd xy" "C123" "B"
In the code above we used the regex ", ?" to split at a comma or a comma followed by a space so that we don't end up with a leading space. For future reference, if you do need to get rid of leading or trailing whitespace, you can use str_trim. For example, if we had used "," in str_split we could do the following:
str_split(x, ",") %>%
map(str_trim) %>%
map(unique)
Change your pattern to the one below:
str_extract_all(x, "(\\b[^,]+)(?!.*\\1)")
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"
You need to start matching from a char other than a whitespace and a comma, then optionally match any zero or more chars other than a comma up to a char other than whitespace and a comma:
str_extract_all(x, "([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")
See the regex demo and an R demo online. Mind that if your strings contain line breaks, you need to prepend the pattern with (?s): str_extract_all(x, "(?s)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)").
If you need to make it case insensitive (e.g. Abb cd xy and ABB cD Xy are considered duplicates), add the i flag: str_extract_all(x, "(?i)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") (or str_extract_all(x, "(?si)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") if the DOTALL behavior is needed).
Details:
([^\s,](?:[^,]*[^\s,])?) - Group 1:
[^\s,] - a char other than whitespace and a comma
(?:[^,]*[^\s,])? - an optional sequence of
[^,]* - zero or more chars other than a comma
[^\s,] - a char other than whitespace and a comma
(?!.*\1) - a negative lookahead that fails the match if there are zero or more chars, as many as possible, followed with the Group 1 value.
Not exactly what you asked for, but the NLP frameworks can be helpful when the problems get more complex.
library(tidytext)
library(dplyr)
library(tibble)
tibble(text = x) %>%
rowid_to_column("stringid") %>%
unnest_regex(substring, text, pattern = ",", to_lower = FALSE) %>%
distinct(stringid, substring = trimws(substring))
# # A tibble: 7 x 2
# stringid substring
# <int> <chr>
# 1 1 Anna & x
# 2 2 Alb
# 3 2 Berta 222
# 4 3 Al Pacino
# 5 4 Abb cd xy
# 6 4 C123
# 7 4 B
Just add lapply(..., str_trim) to your code:
library(stringr)
lapply(str_extract_all(x, "([^,]+)(?!.*\\1)"), str_trim)
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"
I'm working on speaking turns in conversation. My interest is in the words that get repeated from a prior turn to a next turn:
turnsX <- data.frame(
speaker = c("A","B","A","B"),
speech = c("let's have a look",
"yeah let's take a look",
"yeah okay so where to start",
"let's start here"), stringsAsFactors = F
)
I want to extract the repeated word forms. To this end I've run a for loop, iteratively defining each speech turn as a regex pattern for the next speech turn and str_extracting the words that get repeated from turn to turn:
library(stringr)
pattern <- c()
extracted <- c()
for(i in 1:nrow(turnsX)){
pattern[i] <- paste0(unlist(str_split(turnsX$speech[i], " ")), collapse = "|")
extracted[i+1] <- str_extract_all(turnsX$speech[i+1], pattern[i])
}
The result however is partly incorrect:
extracted
[[1]]
NULL
[[2]]
[1] "a" "let's" "a" "a" "look"
[[3]]
[1] "yeah" "a" "a"
[[4]]
[1] "start"
[[5]]
[1] NA
The correct result should be:
extracted
[[1]]
NULL
[[2]]
[1] "let's" "a" "look"
[[3]]
[1] "yeah"
[[4]]
[1] "start"
Where's the mistake? How can the code be mended, or what other approach is there, to get the correct result?
Maybe you can use Map and %in%.
x <- strsplit(turnsX$speech, " ")
Map(function(y,z) y[y %in% z], x[-length(x)], x[-1])
#[[1]]
#[1] "let's" "a" "look"
#
#[[2]]
#[1] "yeah"
#
#[[3]]
#[1] "start"
Here's a base R approach using Map :
tmp <- strsplit(turnsX$speech, ' ')
c(NA, Map(intersect, tmp[-1], tmp[-length(tmp)]))
#[[1]]
#[1] NA
#[[2]]
#[1] "let's" "a" "look"
#[[3]]
#[1] "yeah"
#[[4]]
#[1] "start"
You want the word boundaries "\\b"
library(stringr)
pattern <- c()
extracted <- c()
for(i in 2:nrow(turnsX)){
pattern[i - 1] <- paste0(unlist(str_split(turnsX$speech[i - 1], " ")), collapse = "|\\b")
extracted[i] <- str_extract_all(turnsX$speech[i], pattern[i - 1])
}
# [[1]]
# NULL
#
# [[2]]
# [1] "let's" "a" "look"
#
# [[3]]
# [1] "yeah"
#
# [[4]]
# [1] "start"
I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"
I would like to count the number of underscores and split the string into two different strings at the middle underscore.
strings <- c('aa_bb_cc_dd_ee_ff', 'cc_hh_ff_zz", "bb_dd")
Desired Output:
First Last
"aa_bb_cc" "dd_ee_ff"
"cc_hh" "ff_zz"
"bb" "dd"
Here's a cludgy solution that assumes that there are always an odd number of underscores.
# Load libraries
library(stringr)
# Define function
even_split <- function(s){
# Split string
tmp <- str_split(s, "_")
lapply(tmp, function(x){
# Patch string back together in two pieces
c(paste(x[1:(length(x)/2)], collapse = "_"),
paste(x[(1+length(x)/2):length(x)], collapse = "_"))
})
}
# Example
strings <- c('aa_bb_cc_dd_ee_ff', 'cc_hh_ff_zz', 'bb_dd')
# Test function
even_split(strings)
#> [[1]]
#> [1] "aa_bb_cc" "dd_ee_ff"
#>
#> [[2]]
#> [1] "cc_hh" "ff_zz"
#>
#> [[3]]
#> [1] "bb" "dd"
Created on 2019-01-18 by the reprex package (v0.2.1)
Adapting nhahtdh's answer here, all you need to do is add a step to count the underscores (done here with str_count) and return the median number of underscores.
library(stringr)
strsplit(
strings,
paste0("^[^_]*(?:_[^_]*){", str_count(strings, '_') %/% 2, "}\\K_"),
perl = TRUE)
# [[1]]
# [1] "aa_bb_cc" "dd_ee_ff"
#
# [[2]]
# [1] "cc_hh" "ff_zz"
#
# [[3]]
# [1] "bb" "dd"
This assumes an odd number of underscores, and 99 or fewer.
library(stringr)
library(strex)
strings <- c('aa_bb_cc_dd_ee_ff', 'cc_hh_ff_zz', 'bb_dd')
splitMiddleUnderscore <- function(x){
nUnderscore <- str_count(x, '_')
middleUnderscore <- match(nUnderscore, seq(1, 99, 2))
str1 <- str_before_nth(x, '_', middleUnderscore)
str2 <- str_after_nth(x, '_', middleUnderscore)
c(str1, str2)
}
lapply(strings, splitMiddleUnderscore)
#[[1]]
#[1] "aa_bb_cc" "dd_ee_ff"
#[[2]]
#[1] "cc_hh" "ff_zz"
#[[3]]
#[1] "bb" "dd"