split vector of strings with partial match - r

If I have a list of some elements:
x = c('abc', 'bbc', 'cd', 'hj', 'aa', 'zz', 'd9', 'jk')
I'd like to split it every time there's an 'a' to create a nested list:
[1][[1]] 'abc', 'bbc', 'cd', 'hj'
[2][[1]] 'aa', 'zz', 'd9', 'jk'
I tried
split(x, 'a')
but split doesn't look for partial matches.

We can create a group by matching the substring 'a' with grepl to a logical vector and then convert to numeric by getting the cumulative sum for distinct groups and use that in split
split(x, cumsum(grepl('a', x)))
#$`1`
#[1] "abc" "bbc" "cd" "hj"
#$`2`
#[1] "aa" "zz" "d9" "jk"

Another base R solution using split + findInterval (code is not as short as the answer by #akrun)
split(x,findInterval(seq_along(x),grep("a",x)))
such that
> split(x,findInterval(seq_along(x),grep("a",x)))
$`1`
[1] "abc" "bbc" "cd" "hj"
$`2`
[1] "aa" "zz" "d9" "jk"

Another base R possibility could be:
split(x, cumsum(nchar(sub("a", "", x, fixed = TRUE)) - nchar(x) != 0))
$`1`
[1] "abc" "bbc" "cd" "hj"
$`2`
[1] "aa" "zz" "d9" "jk"

Related

How can I apply str_split to a tibble with one character column?

Suppose I have a tibble object with one character column; and I want to transform to the target by using str_split function. I cannot succeed in this. Any suggestions for that?
> as_tibble(c("sdsd/ffg","fdfd/rrrr/rrr","dfd/ww/rrr/ww"))%>%str_split("/")
[[1]]
[1] "c(\"sdsd" "ffg\", \"fdfd" "rrrr" "rrr\", \"dfd" "ww"
[6] "rrr" "ww\")"
Warning message:
In stri_split_regex(string, pattern, n = n, simplify = simplify, :
argument is not an atomic vector; coercing
> target <- str_split(c("sdsd/ffg","fdfd/rrrr/rrr","dfd/ww/rrr/ww"),"/")
> target
[[1]]
[1] "sdsd" "ffg"
[[2]]
[1] "fdfd" "rrrr" "rrr"
[[3]]
[1] "dfd" "ww" "rrr" "ww"
You could pull the column to get it as a vector and then apply str_split
tibble::as_tibble(c("sdsd/ffg","fdfd/rrrr/rrr","dfd/ww/rrr/ww")) %>%
dplyr::pull(value) %>%
stringr::str_split("/")
#[[1]]
#[1] "sdsd" "ffg"
#[[2]]
#[1] "fdfd" "rrrr" "rrr"
#[[3]]
#[1] "dfd" "ww" "rrr" "ww"
We can do the str_split inside mutate which returns a list column and then can be extracted with value
library(dplyr)
library(stringr)
as_tibble(c("sdsd/ffg","fdfd/rrrr/rrr","dfd/ww/rrr/ww")) %>%
mutate(value = str_split(value, "/")) %>%
pull(value)
#[[1]]
#[1] "sdsd" "ffg"
#[[2]]
#[1] "fdfd" "rrrr" "rrr"
#[[3]]
#[1] "dfd" "ww" "rrr" "ww"

Split a character to letters and numbers

I have a unique character, each letter follows a number. For instance: A1B10C5
I would like to split it into letter <- c(A, B, C) and number <- c(1, 10, 5) using R.
We can use regex lookarounds to split between the letters and numbers
v1 <- strsplit(str1, "(?<=[A-Za-z])(?=[0-9])|(?<=[0-9])(?=[A-Za-z])", perl = TRUE)[[1]]
v1[c(TRUE, FALSE)]
#[1] "A" "B" "C"
as.numeric(v1[c(FALSE, TRUE)])
#[1] 1 10 5
data
str1 <- "A1B10C5"
str_extract_all is another way to do this:
library(stringr)
> str <- "A1B10C5"
> str
[1] "A1B10C5"
> str_extract_all(str, "[0-9]+")
[[1]]
[1] "1" "10" "5"
> str_extract_all(str, "[aA-zZ]+")
[[1]]
[1] "A" "B" "C"
To extract letters and numbers at same time, you can use str_match_all to get letters and numbers in two separate columns:
library(stringr)
str_match_all("A1B10C5", "([a-zA-Z]+)([0-9]+)")[[1]][,-1]
# [,1] [,2]
#[1,] "A" "1"
#[2,] "B" "10"
#[3,] "C" "5"
You can also use the base R regmatches with gregexpr:
regmatches(this, gregexpr("[0-9]+", "A1B10C5"))
[[1]]
[1] "1" "10" "5"
regmatches(this, gregexpr("[A-Z]+", "A1B10C5"))
[[1]]
[1] "A" "B" "C"
These return lists with a single element, a character vector. As akrun does, you can extract the list item using [[1]] and can also convert the vector of digits to numeric like this:
as.numeric(regmatches(this, gregexpr("[0-9]+", this))[[1]])

Split a list whose elements are multiple element lists

Say I have a list a which is defined as:
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
I want to split this list by semicolon ;, get only unique values, and return another list. So far I have split the list using str_split():
a <- str_split(a, ";")
which gives me
> a
[[1]]
[1] "aaa" "bbb"
[[2]]
[1] "aaa"
[[3]]
[1] "bbb"
[[4]]
[1] "aaa" "ccc"
How can I manipulate this list (using unique()?) to give me something like
[[1]]
[1] "aaa"
[[2]]
[1] "bbb"
[[3]]
[1] "ccc"
or more simply,
[[1]]
[1] "aaa" "bbb" "ccc"
One option is to use list() with unique() and unlist() inside your list.
# So first you use your code
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
# Load required library
library(stringr) # load str_split
a <- str_split(a, ";")
# Finally use list() with unique() and unlist()
list(unique(unlist(a)))
# And the otuput
[[1]]
[1] "aaa" "bbb" "ccc"
One alternative in base R is to use rapply which applies a function to each of the inner most elements in a nested list and returns the most simplified object possible by default. Here, it returns a vector of characters.
unique(rapply(a, strsplit, split=";"))
[1] "aaa" "bbb" "ccc"
To return a list, wrap the output in list
list(unique(rapply(a, strsplit, split=";")))
[[1]]
[1] "aaa" "bbb" "ccc"

all strings of length k that can be formed from a set of n characters

This question has been asked for other languages but I'm looking for the most idiomatic way to find all strings of length k that can be formed from a set of n characters in R
Example input and output:
input <- c('a', 'b')
output <- c('aa', 'ab', 'ba', 'bb')
A little more complicated than I'd like. I think outer() only works for n=2. combn doesn't include repeats.
allcomb <- function(input = c('a', 'b'), n=2) {
args <- rep(list(input),n)
gr <- do.call(expand.grid,args)
return(do.call(paste0,gr))
}
Thanks to #thelatemail for improvements ...
allcomb(n=4)
## [1] "aaaa" "baaa" "abaa" "bbaa" "aaba" "baba" "abba"
## [8] "bbba" "aaab" "baab" "abab" "bbab" "aabb" "babb"
## [15] "abbb" "bbbb"
Adapting AK88's answer, outer can be used for arbitrary values of k, although it's not necessarily the most efficient solution:
input <- c('a', 'b')
k = 5
perms = input
for (i in 2:k) {
perms = outer(perms, input, paste, sep="")
}
result = as.vector(perms)
m <- outer(input, input, paste, sep="")
output = as.vector(m)
## "aa" "ba" "ab" "bb"
I'm not proud of how this looks, but it works...
allcombs <- function(x, k) {
apply(expand.grid(split(t(replicate(k, x)), seq_len(k))), 1, paste, collapse = "")
}
allcombs(letters[1:2], 2)
#> [1] "aa" "ba" "ab" "bb"
allcombs(letters[1:2], 4)
#> [1] "aaaa" "baaa" "abaa" "bbaa" "aaba" "baba" "abba" "bbba" "aaab" "baab"
#> [11] "abab" "bbab" "aabb" "babb" "abbb" "bbbb"

How do I paste string columns in data.frame [duplicate]

This question already has answers here:
Concatenate rows of a data frame
(4 answers)
Closed 6 years ago.
suppose we have:
mydf <- data.frame(a= LETTERS, b = LETTERS, c =LETTERS)
Now we want to add a new column, containing a concatenation of all columns.
So that rows in the new column read "AAA", "BBB", ...
In my mind the following should work?
mydf[,"Concat"] <- apply(mydf, 1, paste0)
In addition to #akrun's answer, here is a short explanation on why your code didn't work.
What you are passing to paste0 in your code are vectors and here is the behavior of paste and paste0 with vectors:
> paste0(c("A","A","A"))
[1] "A" "A" "A"
Indeed, to concatenate a vector, you need to use argument collapse:
> paste0(c("A","A","A"), collapse="")
[1] "AAA"
Consequently, your code should have been:
> apply(mydf, 1, paste0, collapse="")
[1] "AAA" "BBB" "CCC" "DDD" "EEE" "FFF" "GGG" "HHH" "III" "JJJ" "KKK" "LLL" "MMM" "NNN" "OOO" "PPP" "QQQ" "RRR" "SSS" "TTT" "UUU" "VVV"
[23] "WWW" "XXX" "YYY" "ZZZ"
We can use do.call with paste0 for faster execution
mydf[, "Concat"] <- do.call(paste0, mydf)

Resources