Split string into multiple two-word strings

Split string into multiple two-word strings - r

I have a very long string (~1000 words) and I would like to split it into two-word phrases.
I have this:
string <- "A B C D E F"
and I would like this:
"A B"
"B C"
"C D"
"D E"
"E F"
The long string has already been cleaned and stemmed, and stop-words have been removed.
I tried to use str_split, but (I think) this needs a separator, which here is complicated because I don't want to separate A from B only "A B" from "C D", and "B C" from "D E", etc.

Split on space, then paste with shift:
s <- unlist(strsplit(string, " ", fixed = TRUE))
sl <- length(s)
paste(s[1:(sl-1)], s[2:sl])
# [1] "A B" "B C" "C D" "D E" "E F"

tmp <- strsplit(string, " ")[[1]]
tmp
# [1] "A" "B" "C" "D" "E" "F"
sapply(seq_along(tmp)[-1], function(z) paste(tmp[z-1:0], collapse = " "))
# [1] "A B" "B C" "C D" "D E" "E F"

If you already use some text mining package (as cleaned, stemmed and removed stop-words would suggest), there's most likely something to generate n-grams (and not just bigrams). For example quanteda::tokens_ngrams() or tidytext::unnest_ngrams():
string <- "A B C D E F"
quanteda::tokens_ngrams(quanteda::tokens(string), concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "A B" "B C" "C D" "D E" "E F"
data.frame(s = string) |>
tidytext::unnest_ngrams(input = "s", output = "bigrams", n = 2)
#> bigrams
#> 1 a b
#> 2 b c
#> 3 c d
#> 4 d e
#> 5 e f
Created on 2023-01-31 with reprex v2.0.2

An option would be to use a regex with look ahead.
string <- "A B C D E F"
. <- gregexpr("\\S+\\s+(?=(\\S+))", string, perl=TRUE)[[1]]
attr(.,"match.length") <- attr(.,"match.length") + attr(., "capture.length")
regmatches(string, list(.))[[1]]
#[1] "A B" "B C" "C D" "D E" "E F"

Related

Combination of two lists with partial string matching (in R)

I am trying to find all the combinations of two lists, however the second list is essentially repetition of the first lists variables with added brackets etc., as shown below.
other_cols <- c("C", "D", "E", "F")
other_colsRnd <- c("(1|C)", "(1|D)", "(1|E)", "(1|F)")
# I have some code to do combinations from one list:
combos = do.call(c, lapply(seq_along(other_cols), function(y) {
arrangements::combinations(other_cols, y, layout = "l")
}))
theBigList = sapply(combos, paste, collapse = " + ")
> theBigList
[1] "C" "D" "E" "F" "C + D" "C + E" "C + F" "D + E" "D + F"
[10] "E + F" "C + D + E" "C + D + F" "C + E + F" "D + E + F" "C + D + E + F"
I would like the full list of combinations in theBigList of both of them combined, without any repetition of C and (1|C)
########
edit
C or D etc. are shorthand versions of the "real" variables, which look more like:
other_cols <- c("Charlie", "Delta", "Echo", "Foxtrot")
other_colsRnd <- c("(1|Charlie)", "(1|Delta)", "(1|Echo)", "(1|Foxtrot)")
########
The expected outcome is something like this, though stored order will not be important.
theBigList
"C" "(1|C)" "D" "(1|D)" "E" "(1|E)" "F" "(1|F)" "C + D"
"C + (1|D)" "C + E" "C + (1|E)" "C + F" "C + (1|F)"
"D + E" "D + (1|E)" "D + F" "D + (1|F)"
"E + F" "E + (1|F)"
"C + D + E" "(1|C) + D + E" "(1|C) + (1|D) + E" "(1|C) + (1|D) + (1|E)" etc.
Is there a way to put the lapply inside the lapply?
Or, I am currently thinking I can comboRnd e.g
combosRnd = do.call(c, lapply(seq_along(other_cols), function(y) {
arrangements::combinations(other_colsRnd, y, layout = "l")
}))
and then take inspiration from here using var_comb <- expand.grid(combos, combosRnd) with some sort of if and grep to detect the "same" variables, that I haven't worked out yet.
edit
I think I think, I can add combos e.g. something like
theBigList = sapply(combos, paste, collapse = " + ")
theBigListRnd = sapply(combosRnd, paste, collapse = " + ")
comboBigList = c(theBigList, theBigListRnd)
var_comb <- expand.grid(combos, combosRnd)
var_comb2 <- expand.grid(theBigList, theBigListRnd)
... so comboBigList has all the ones where there is no crossover whatsoever, and then I can remove any "lines" in either or var_comb or var_comb2 that have that have matching anything matching in the var columns.
Yes, this is a smaller easier chunk of my previously asked question here, however I have refined it to the bare necessity for me to get this infernal analysis done, as it seems that I may have been biting off more than I can chew on that one. I will brute force the nestings I need with this as a supplement (hopefully).

Why not combine other_cols and other_colsRnd and use the same code that you have.
combine_vec <- c(other_cols, other_colsRnd)
combos <- do.call(c, lapply(seq_along(combine_vec), function(y) {
arrangements::combinations(combine_vec, y, layout = "l")
}))
theBigList = sapply(combos, paste, collapse = " + ")
theBigList
# [1] "C"
# [2] "D"
# [3] "E"
# [4] "F"
# [5] "(1|C)"
# [6] "(1|D)"
# [7] "(1|E)"
# [8] "(1|F)"
# [9] "C + D"
# [10] "C + E"
# [11] "C + F"
# [12] "C + (1|C)"
#...
#...
From this theBigList you can drop the variable + (1|variable) combination using the following code.
library(stringr)
finalList <- theBigList[!mapply(function(x, y) any(x %in% y) || any(y %in% x),
str_extract_all(theBigList, '\\b[A-Z](?!\\))'),
str_extract_all(theBigList, '(?<=1\\|)[A-Z]'))]

R all permutations per vector entry

I have a vector
x <- c("a b c", "d e")
with splitted entries
str_split(x, " ")
I want to get all permutations per splitted vector entry, so the result should be
c("a b c", "b c a", "c a b", "a c b", "b a c", "c b a", "d e", "e d")
I tried to use function
permutations(n, r, v=1:n, set=TRUE, repeats.allowed=FALSE)

After the str_split step , you can use combinat::permn to create all possible permutation of the string and paste them together.
result <- unlist(sapply(strsplit(x, " "), function(x)
combinat::permn(x, paste0, collapse = " ")))
result
#[1] "a b c" "a c b" "c a b" "c b a" "b c a" "b a c" "d e" "e d"

You can try pracma::perms like below
unlist(
Map(
function(v) do.call(paste, as.data.frame(pracma::perms(v))),
strsplit(x, " ")
)
)
which gives
[1] "c b a" "c a b" "b c a" "b a c" "a b c" "a c b" "e d" "d e"

combine strings with combn lists in R

I want to loop over combinations created by combn().
Input:
"a" "b" "c" "d"
Desired Output:
[1] "a" "b" "c" "d"
[1] "a and b" "a and c" "a and d" "b and c" "b and d" "c and d"
[1] "a and b and c" "a and b and d" "a and c and d" "b and c and d"
[1] "a and b and c and d"
What i tried:
classes <- letters[1:4]
cl <- lapply(1:length(classes), combn, x = classes)
apply(cl[[1]], 2, paste, collapse = " and ")
apply(cl[[2]], 2, paste, collapse = " and ")
apply(cl[[3]], 2, paste, collapse = " and ")
apply(cl[[4]], 2, paste, collapse = " and ")
Basically my Question is what is the best way to loop over the last part apply(cl[[NR]], 2, paste, collapse = " and ").
I thought about lapply, but that i would assign FUN twice and it seems odd to combine lapply and apply in one call. For loop is possible but Maybe there is a more efficient way.
If the Question is better suited for Code review, i am happy to migrate it.

You can iterate over the length of your vector and use the function argument of combn() to collapse the output using paste():
vec <- letters[1:4]
lapply(seq_along(vec), function(x) combn(vec, x, FUN = paste, collapse = " and "))
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "a and b" "a and c" "a and d" "b and c" "b and d" "c and d"
[[3]]
[1] "a and b and c" "a and b and d" "a and c and d" "b and c and d"
[[4]]
[1] "a and b and c and d"

Split string using space and capital letter

I'm trying to split my string into multiple rows. String looks like this:
x <- c("C 10.1 C 12.4","C 12", "C 45.5 C 10")
Code snippet:
strsplit(x, "//s")[[3]]
Result:
"C 45.5 C 10"
Expected Output: Split string into multiple rows like this:
"C 10.1"
"C 12.4"
"C 12"
"C 45.5"
"C 10"
The question is how to split the string?
Clue: there is a space and then character which is "C" in our case. Anyone who knows how to do it?

You may use
unlist(strsplit(x, "(?<=\\d)\\s+(?=C)", perl=TRUE))
Output:
[1] "C 10.1" "C 12.4" "C 12" "C 45.5" "C 10"
See the online R demo and a regex demo.
The (?<=\\d)\\s+(?=C) regex matches 1 or more whitespace characters (\\s+) that are immediately preceded with a digit ((?<=\\d)) and that are immediately followed with C.
If C can be any uppercase ASCII letter, replace C with [A-Z].

A somwhat more complicated expression but easier on the regex side:
unlist(
sapply(
strsplit(x, " ?C"),
function(x) {
paste0("C", x[nzchar(x)])
}
)
)
"C 10.1" "C 12.4" "C 12" "C 45.5" "C 10"

Concatenating groups of vector character elements

I don't know the proper technical terms for this kind of operation, so it has been difficult to search for existing solutions. I thought I would try to post my own question and hopefully someone can help me out (or point me in the right direction).
I have a vector of characters and I want to collect them in groups of twos and threes. To illustrate, here is a simplified version:
The table I have:
"a"
"b"
"c"
"d"
"e"
"f"
I want to run through the vector and concatenate groups of two and three elements. This is the end result I want:
"a b"
"b c"
"c d"
"d e"
"e f"
And
"a b c"
"b c d"
"c d e"
"d e f"
I solved this the simplest and dirtiest way possible by using for-loops, but it takes a long time to run and I am convinced it can be done more efficiently.
Here is my ghetto-hack:
t1 <- c("a", "b", "c", "d", "e", "f")
t2 <- rep("", length(t1)-1)
for (i in 1:length(t1)-1) {
t2[i] = paste(t1[i], t1[i+1])
}
t3 <- rep("", length(t1)-2)
for (i in 1:length(t1)-2) {
t3[i] = paste(t1[i], t1[i+1], t1[i+2])
}
I was looking into sapply and tapply etc. but I can't seem to figure out how to use "the following element" in the vector.
Any help will be rewarded with my eternal gratitude!
-------------- Edit --------------
Run times of the suggestions using input data with ~ 3 million rows:
START: [1] "2016-11-20 19:24:50 CET"
For-loop: [1] "2016-11-20 19:28:26 CET"
rollapply: [1] "2016-11-20 19:38:55 CET"
apply(matrix): [1] "2016-11-20 19:42:15 CET"
paste t1[-length...]: [1] "2016-11-20 19:42:37 CET"
grep: [1] "2016-11-20 19:44:30 CET"

Have you considered the zoo package? For example
library('zoo')
input<-c('a','b','c','d','e','f')
output<-rollapply(data=input, width=2, FUN=paste, collapse=" ")
output
will return
"a b" "b c" "c d" "d e" "e f"
The width argument controls how many elements to concatenate. I expect you'll have improved runtimes here too but I haven't tested

For groups of two, we can do this with
paste(t1[-length(t1)], t1[-1])
#[1] "a b" "b c" "c d" "d e" "e f"
and for higher numbers, one option is shift from data.table
library(data.table)
v1 <- do.call(paste, shift(t1, 0:2, type="lead"))
grep("NA", v1, invert=TRUE, value=TRUE)
#[1] "a b c" "b c d" "c d e" "d e f"
Or
n <- length(t1)
n1 <- 3
apply(matrix(t1, ncol=n1, nrow = n+1)[seq(n-(n1-1)),], 1, paste, collapse=' ')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split string into multiple two-word strings - r

Split on space, then paste with shift: s <- unlist(strsplit(string, " ", fixed = TRUE)) sl <- length(s) paste(s[1:(sl-1)], s[2:sl]) # [1] "A B" "B C" "C D" "D E" "E F"

tmp <- strsplit(string, " ")[[1]] tmp # [1] "A" "B" "C" "D" "E" "F" sapply(seq_along(tmp)[-1], function(z) paste(tmp[z-1:0], collapse = " ")) # [1] "A B" "B C" "C D" "D E" "E F"

An option would be to use a regex with look ahead. string <- "A B C D E F" . <- gregexpr("\\S+\\s+(?=(\\S+))", string, perl=TRUE)[[1]] attr(.,"match.length") <- attr(.,"match.length") + attr(., "capture.length") regmatches(string, list(.))[[1]] #[1] "A B" "B C" "C D" "D E" "E F"

Related

Combination of two lists with partial string matching (in R)

R all permutations per vector entry

combine strings with combn lists in R

Split string using space and capital letter

Concatenating groups of vector character elements

Categories

Resources