Combination of two lists with partial string matching (in R) - r

I am trying to find all the combinations of two lists, however the second list is essentially repetition of the first lists variables with added brackets etc., as shown below.
other_cols <- c("C", "D", "E", "F")
other_colsRnd <- c("(1|C)", "(1|D)", "(1|E)", "(1|F)")
# I have some code to do combinations from one list:
combos = do.call(c, lapply(seq_along(other_cols), function(y) {
arrangements::combinations(other_cols, y, layout = "l")
}))
theBigList = sapply(combos, paste, collapse = " + ")
> theBigList
[1] "C" "D" "E" "F" "C + D" "C + E" "C + F" "D + E" "D + F"
[10] "E + F" "C + D + E" "C + D + F" "C + E + F" "D + E + F" "C + D + E + F"
I would like the full list of combinations in theBigList of both of them combined, without any repetition of C and (1|C)
########
edit
C or D etc. are shorthand versions of the "real" variables, which look more like:
other_cols <- c("Charlie", "Delta", "Echo", "Foxtrot")
other_colsRnd <- c("(1|Charlie)", "(1|Delta)", "(1|Echo)", "(1|Foxtrot)")
########
The expected outcome is something like this, though stored order will not be important.
theBigList
"C" "(1|C)" "D" "(1|D)" "E" "(1|E)" "F" "(1|F)" "C + D"
"C + (1|D)" "C + E" "C + (1|E)" "C + F" "C + (1|F)"
"D + E" "D + (1|E)" "D + F" "D + (1|F)"
"E + F" "E + (1|F)"
"C + D + E" "(1|C) + D + E" "(1|C) + (1|D) + E" "(1|C) + (1|D) + (1|E)" etc.
Is there a way to put the lapply inside the lapply?
Or, I am currently thinking I can comboRnd e.g
combosRnd = do.call(c, lapply(seq_along(other_cols), function(y) {
arrangements::combinations(other_colsRnd, y, layout = "l")
}))
and then take inspiration from here using var_comb <- expand.grid(combos, combosRnd) with some sort of if and grep to detect the "same" variables, that I haven't worked out yet.
edit
I think I think, I can add combos e.g. something like
theBigList = sapply(combos, paste, collapse = " + ")
theBigListRnd = sapply(combosRnd, paste, collapse = " + ")
comboBigList = c(theBigList, theBigListRnd)
var_comb <- expand.grid(combos, combosRnd)
var_comb2 <- expand.grid(theBigList, theBigListRnd)
... so comboBigList has all the ones where there is no crossover whatsoever, and then I can remove any "lines" in either or var_comb or var_comb2 that have that have matching anything matching in the var columns.
Yes, this is a smaller easier chunk of my previously asked question here, however I have refined it to the bare necessity for me to get this infernal analysis done, as it seems that I may have been biting off more than I can chew on that one. I will brute force the nestings I need with this as a supplement (hopefully).

Why not combine other_cols and other_colsRnd and use the same code that you have.
combine_vec <- c(other_cols, other_colsRnd)
combos <- do.call(c, lapply(seq_along(combine_vec), function(y) {
arrangements::combinations(combine_vec, y, layout = "l")
}))
theBigList = sapply(combos, paste, collapse = " + ")
theBigList
# [1] "C"
# [2] "D"
# [3] "E"
# [4] "F"
# [5] "(1|C)"
# [6] "(1|D)"
# [7] "(1|E)"
# [8] "(1|F)"
# [9] "C + D"
# [10] "C + E"
# [11] "C + F"
# [12] "C + (1|C)"
#...
#...
From this theBigList you can drop the variable + (1|variable) combination using the following code.
library(stringr)
finalList <- theBigList[!mapply(function(x, y) any(x %in% y) || any(y %in% x),
str_extract_all(theBigList, '\\b[A-Z](?!\\))'),
str_extract_all(theBigList, '(?<=1\\|)[A-Z]'))]

Related

Split string into multiple two-word strings

I have a very long string (~1000 words) and I would like to split it into two-word phrases.
I have this:
string <- "A B C D E F"
and I would like this:
"A B"
"B C"
"C D"
"D E"
"E F"
The long string has already been cleaned and stemmed, and stop-words have been removed.
I tried to use str_split, but (I think) this needs a separator, which here is complicated because I don't want to separate A from B only "A B" from "C D", and "B C" from "D E", etc.
Split on space, then paste with shift:
s <- unlist(strsplit(string, " ", fixed = TRUE))
sl <- length(s)
paste(s[1:(sl-1)], s[2:sl])
# [1] "A B" "B C" "C D" "D E" "E F"
tmp <- strsplit(string, " ")[[1]]
tmp
# [1] "A" "B" "C" "D" "E" "F"
sapply(seq_along(tmp)[-1], function(z) paste(tmp[z-1:0], collapse = " "))
# [1] "A B" "B C" "C D" "D E" "E F"
If you already use some text mining package (as cleaned, stemmed and removed stop-words would suggest), there's most likely something to generate n-grams (and not just bigrams). For example quanteda::tokens_ngrams() or tidytext::unnest_ngrams():
string <- "A B C D E F"
quanteda::tokens_ngrams(quanteda::tokens(string), concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "A B" "B C" "C D" "D E" "E F"
data.frame(s = string) |>
tidytext::unnest_ngrams(input = "s", output = "bigrams", n = 2)
#> bigrams
#> 1 a b
#> 2 b c
#> 3 c d
#> 4 d e
#> 5 e f
Created on 2023-01-31 with reprex v2.0.2
An option would be to use a regex with look ahead.
string <- "A B C D E F"
. <- gregexpr("\\S+\\s+(?=(\\S+))", string, perl=TRUE)[[1]]
attr(.,"match.length") <- attr(.,"match.length") + attr(., "capture.length")
regmatches(string, list(.))[[1]]
#[1] "A B" "B C" "C D" "D E" "E F"

combine strings with combn lists in R

I want to loop over combinations created by combn().
Input:
"a" "b" "c" "d"
Desired Output:
[1] "a" "b" "c" "d"
[1] "a and b" "a and c" "a and d" "b and c" "b and d" "c and d"
[1] "a and b and c" "a and b and d" "a and c and d" "b and c and d"
[1] "a and b and c and d"
What i tried:
classes <- letters[1:4]
cl <- lapply(1:length(classes), combn, x = classes)
apply(cl[[1]], 2, paste, collapse = " and ")
apply(cl[[2]], 2, paste, collapse = " and ")
apply(cl[[3]], 2, paste, collapse = " and ")
apply(cl[[4]], 2, paste, collapse = " and ")
Basically my Question is what is the best way to loop over the last part apply(cl[[NR]], 2, paste, collapse = " and ").
I thought about lapply, but that i would assign FUN twice and it seems odd to combine lapply and apply in one call. For loop is possible but Maybe there is a more efficient way.
If the Question is better suited for Code review, i am happy to migrate it.
You can iterate over the length of your vector and use the function argument of combn() to collapse the output using paste():
vec <- letters[1:4]
lapply(seq_along(vec), function(x) combn(vec, x, FUN = paste, collapse = " and "))
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "a and b" "a and c" "a and d" "b and c" "b and d" "c and d"
[[3]]
[1] "a and b and c" "a and b and d" "a and c and d" "b and c and d"
[[4]]
[1] "a and b and c and d"

How to disambiguate repeated strings by appending varying length strings?

I saw the clever code submitted by Gabor G. in response to this question about disambiguation of strings. His answer, slightly modified, is:
uniqName <- function(x){
thenames <- ave(x,x,FUN = function(z){
znam <- if (length(z) == 1) z else sprintf("%s%02d", z, seq_along(z))
return(znam)
})
return(thenames)
}
I wanted to go for an "invisible" version of that, and tried to come up with a compact function that would append N spaces to the (N+1)th occurrence of a name.
(Gabor's code calculates an integer and appends that, so the number of characters appended is constant). The best I could do was the following clunky function ("fatit")
spacify <- function (x){
fatit <-function(x){
k = vector(length=length(x))
for(jp in 1:length(x)){
k[jp]=sprintf('%s%s',x[jp],paste0(rep(' ',jp),collapse=''))
}
return(k)
}
spaceOut <- ave(x,x, FUN = function(z) if (length(z) == 1) z else fatit(z) )
return(spaceOut)
}
Is there some cleaner, more compact, way to set the number of characters to append based on length(z) in the fatit function ?
Note:
uniqName(foo)
[1] "a01" "b01" "c01" "a02" "b02" "a03" "c02" "d" "e"
spacify(foo)
[1] "a " "b " "c " "a " "b " "a " "c " "d" "e"
We can take advantage of make.unique by striping the numbers that make the characters unique, and using them (... + 1) as reference as to how many characters to append, i.e.
i1 <- as.numeric(gsub('\\D+', '', make.unique(x)))
i1[is.na(i1)] <- 0 #because where there is no number it returns NA
paste0(x, sapply(i1 + 1, function(i) paste(rep(' ', each = i), collapse = '')))
#[1] "a " "b " "c " "a " "b " "a " "c " "d " "e "
We can take advantage of the stri_pad_right function from stringi:
library(stringi)
f <- function(x){
ave(x, x, FUN = function(z){
if(length(z) == 1) z else stri_pad_right(z, nchar(z[1]) + seq_along(z))
})
}
x <- c('a', 'b', 'c', 'a', 'b', 'a', 'c', 'd', 'e')
f(x)
# [1] "a " "b " "c " "a " "b " "a " "c " "d" "e"
Using stringr::str_pad(..., side = 'right') is conceptually similar.

Concatenating groups of vector character elements

I don't know the proper technical terms for this kind of operation, so it has been difficult to search for existing solutions. I thought I would try to post my own question and hopefully someone can help me out (or point me in the right direction).
I have a vector of characters and I want to collect them in groups of twos and threes. To illustrate, here is a simplified version:
The table I have:
"a"
"b"
"c"
"d"
"e"
"f"
I want to run through the vector and concatenate groups of two and three elements. This is the end result I want:
"a b"
"b c"
"c d"
"d e"
"e f"
And
"a b c"
"b c d"
"c d e"
"d e f"
I solved this the simplest and dirtiest way possible by using for-loops, but it takes a long time to run and I am convinced it can be done more efficiently.
Here is my ghetto-hack:
t1 <- c("a", "b", "c", "d", "e", "f")
t2 <- rep("", length(t1)-1)
for (i in 1:length(t1)-1) {
t2[i] = paste(t1[i], t1[i+1])
}
t3 <- rep("", length(t1)-2)
for (i in 1:length(t1)-2) {
t3[i] = paste(t1[i], t1[i+1], t1[i+2])
}
I was looking into sapply and tapply etc. but I can't seem to figure out how to use "the following element" in the vector.
Any help will be rewarded with my eternal gratitude!
-------------- Edit --------------
Run times of the suggestions using input data with ~ 3 million rows:
START: [1] "2016-11-20 19:24:50 CET"
For-loop: [1] "2016-11-20 19:28:26 CET"
rollapply: [1] "2016-11-20 19:38:55 CET"
apply(matrix): [1] "2016-11-20 19:42:15 CET"
paste t1[-length...]: [1] "2016-11-20 19:42:37 CET"
grep: [1] "2016-11-20 19:44:30 CET"
Have you considered the zoo package? For example
library('zoo')
input<-c('a','b','c','d','e','f')
output<-rollapply(data=input, width=2, FUN=paste, collapse=" ")
output
will return
"a b" "b c" "c d" "d e" "e f"
The width argument controls how many elements to concatenate. I expect you'll have improved runtimes here too but I haven't tested
For groups of two, we can do this with
paste(t1[-length(t1)], t1[-1])
#[1] "a b" "b c" "c d" "d e" "e f"
and for higher numbers, one option is shift from data.table
library(data.table)
v1 <- do.call(paste, shift(t1, 0:2, type="lead"))
grep("NA", v1, invert=TRUE, value=TRUE)
#[1] "a b c" "b c d" "c d e" "d e f"
Or
n <- length(t1)
n1 <- 3
apply(matrix(t1, ncol=n1, nrow = n+1)[seq(n-(n1-1)),], 1, paste, collapse=' ')

Pasting two strings using paste function and its collapse argument

I am trying to paste two vectors
vector_1 <- c("a", "b")
vector_2 <- c("x", "y")
paste(vector_1, vector_2, collapse = " + ")
The output I get is
"a + b x + y "
My desired output is
"a + b + x + y"
paste with more then one argument will paste together term-by-term.
> paste(c("a","b","c"),c("A","B","C"))
[1] "a A" "b B" "c C"
the result being the length of the longest vector, with the shorter term recycled. That enables things like this to work:
> paste("A",c("1","2","BBB"))
[1] "A 1" "A 2" "A BBB"
> paste(c("1","2","BBB"),"A")
[1] "1 A" "2 A" "BBB A"
then sep is used within the elements and collapse to join the elements.
> paste(c("a","b","c"),c("A","B","C"))
[1] "a A" "b B" "c C"
> paste(c("a","b","c"),c("A","B","C"),sep="+")
[1] "a+A" "b+B" "c+C"
> paste(c("a","b","c"),c("A","B","C"),sep="+",collapse="#")
[1] "a+A#b+B#c+C"
Note that once you use collapse you get a single result rather than three.
You seem to not want to combine your two vectors element-wise, so you need to turn them into one vector, which you can do with c(), giving us the solution:
> c(vector_1, vector_2)
[1] "a" "b" "x" "y"
> paste(c(vector_1, vector_2), collapse=" + ")
[1] "a + b + x + y"
Note that sep isn't needed - you are just collapsing the individual elements into one string.

Resources