I'm trying to split my string into multiple rows. String looks like this:
x <- c("C 10.1 C 12.4","C 12", "C 45.5 C 10")
Code snippet:
strsplit(x, "//s")[[3]]
Result:
"C 45.5 C 10"
Expected Output: Split string into multiple rows like this:
"C 10.1"
"C 12.4"
"C 12"
"C 45.5"
"C 10"
The question is how to split the string?
Clue: there is a space and then character which is "C" in our case. Anyone who knows how to do it?
You may use
unlist(strsplit(x, "(?<=\\d)\\s+(?=C)", perl=TRUE))
Output:
[1] "C 10.1" "C 12.4" "C 12" "C 45.5" "C 10"
See the online R demo and a regex demo.
The (?<=\\d)\\s+(?=C) regex matches 1 or more whitespace characters (\\s+) that are immediately preceded with a digit ((?<=\\d)) and that are immediately followed with C.
If C can be any uppercase ASCII letter, replace C with [A-Z].
A somwhat more complicated expression but easier on the regex side:
unlist(
sapply(
strsplit(x, " ?C"),
function(x) {
paste0("C", x[nzchar(x)])
}
)
)
"C 10.1" "C 12.4" "C 12" "C 45.5" "C 10"
Related
I have a very long string (~1000 words) and I would like to split it into two-word phrases.
I have this:
string <- "A B C D E F"
and I would like this:
"A B"
"B C"
"C D"
"D E"
"E F"
The long string has already been cleaned and stemmed, and stop-words have been removed.
I tried to use str_split, but (I think) this needs a separator, which here is complicated because I don't want to separate A from B only "A B" from "C D", and "B C" from "D E", etc.
Split on space, then paste with shift:
s <- unlist(strsplit(string, " ", fixed = TRUE))
sl <- length(s)
paste(s[1:(sl-1)], s[2:sl])
# [1] "A B" "B C" "C D" "D E" "E F"
tmp <- strsplit(string, " ")[[1]]
tmp
# [1] "A" "B" "C" "D" "E" "F"
sapply(seq_along(tmp)[-1], function(z) paste(tmp[z-1:0], collapse = " "))
# [1] "A B" "B C" "C D" "D E" "E F"
If you already use some text mining package (as cleaned, stemmed and removed stop-words would suggest), there's most likely something to generate n-grams (and not just bigrams). For example quanteda::tokens_ngrams() or tidytext::unnest_ngrams():
string <- "A B C D E F"
quanteda::tokens_ngrams(quanteda::tokens(string), concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "A B" "B C" "C D" "D E" "E F"
data.frame(s = string) |>
tidytext::unnest_ngrams(input = "s", output = "bigrams", n = 2)
#> bigrams
#> 1 a b
#> 2 b c
#> 3 c d
#> 4 d e
#> 5 e f
Created on 2023-01-31 with reprex v2.0.2
An option would be to use a regex with look ahead.
string <- "A B C D E F"
. <- gregexpr("\\S+\\s+(?=(\\S+))", string, perl=TRUE)[[1]]
attr(.,"match.length") <- attr(.,"match.length") + attr(., "capture.length")
regmatches(string, list(.))[[1]]
#[1] "A B" "B C" "C D" "D E" "E F"
I have a vector
x <- c("a b c", "d e")
with splitted entries
str_split(x, " ")
I want to get all permutations per splitted vector entry, so the result should be
c("a b c", "b c a", "c a b", "a c b", "b a c", "c b a", "d e", "e d")
I tried to use function
permutations(n, r, v=1:n, set=TRUE, repeats.allowed=FALSE)
After the str_split step , you can use combinat::permn to create all possible permutation of the string and paste them together.
result <- unlist(sapply(strsplit(x, " "), function(x)
combinat::permn(x, paste0, collapse = " ")))
result
#[1] "a b c" "a c b" "c a b" "c b a" "b c a" "b a c" "d e" "e d"
You can try pracma::perms like below
unlist(
Map(
function(v) do.call(paste, as.data.frame(pracma::perms(v))),
strsplit(x, " ")
)
)
which gives
[1] "c b a" "c a b" "b c a" "b a c" "a b c" "a c b" "e d" "d e"
I want to loop over combinations created by combn().
Input:
"a" "b" "c" "d"
Desired Output:
[1] "a" "b" "c" "d"
[1] "a and b" "a and c" "a and d" "b and c" "b and d" "c and d"
[1] "a and b and c" "a and b and d" "a and c and d" "b and c and d"
[1] "a and b and c and d"
What i tried:
classes <- letters[1:4]
cl <- lapply(1:length(classes), combn, x = classes)
apply(cl[[1]], 2, paste, collapse = " and ")
apply(cl[[2]], 2, paste, collapse = " and ")
apply(cl[[3]], 2, paste, collapse = " and ")
apply(cl[[4]], 2, paste, collapse = " and ")
Basically my Question is what is the best way to loop over the last part apply(cl[[NR]], 2, paste, collapse = " and ").
I thought about lapply, but that i would assign FUN twice and it seems odd to combine lapply and apply in one call. For loop is possible but Maybe there is a more efficient way.
If the Question is better suited for Code review, i am happy to migrate it.
You can iterate over the length of your vector and use the function argument of combn() to collapse the output using paste():
vec <- letters[1:4]
lapply(seq_along(vec), function(x) combn(vec, x, FUN = paste, collapse = " and "))
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "a and b" "a and c" "a and d" "b and c" "b and d" "c and d"
[[3]]
[1] "a and b and c" "a and b and d" "a and c and d" "b and c and d"
[[4]]
[1] "a and b and c and d"
Here is my sample:
a = c("a","b","c")
b = c("1","2","3")
I need to concatenate a and b automatically. The result should be "a 1","a 2","a 3","b 1","b 2","b 3","c 1","c 2","c 3".
For now, I am using the paste function:
paste(a[1],b[1])
I need an automatic way to do this. Besides writing a loop, is there any easier way to achieve this?
c(outer(a, b, paste))
# [1] "a 1" "b 1" "c 1" "a 2" "b 2" "c 2" "a 3" "b 3" "c 3"
Other options are :
paste(rep.int(a,length(b)),b)
or :
with(expand.grid(b,a),paste(Var2,Var1))
You can do:
c(sapply(a, function(x) {paste(x,b)}))
[1] "a 1" "a 2" "a 3" "b 1" "b 2" "b 3" "c 1" "c 2" "c 3"
edited paste0 into paste to match OP update
I don't know the proper technical terms for this kind of operation, so it has been difficult to search for existing solutions. I thought I would try to post my own question and hopefully someone can help me out (or point me in the right direction).
I have a vector of characters and I want to collect them in groups of twos and threes. To illustrate, here is a simplified version:
The table I have:
"a"
"b"
"c"
"d"
"e"
"f"
I want to run through the vector and concatenate groups of two and three elements. This is the end result I want:
"a b"
"b c"
"c d"
"d e"
"e f"
And
"a b c"
"b c d"
"c d e"
"d e f"
I solved this the simplest and dirtiest way possible by using for-loops, but it takes a long time to run and I am convinced it can be done more efficiently.
Here is my ghetto-hack:
t1 <- c("a", "b", "c", "d", "e", "f")
t2 <- rep("", length(t1)-1)
for (i in 1:length(t1)-1) {
t2[i] = paste(t1[i], t1[i+1])
}
t3 <- rep("", length(t1)-2)
for (i in 1:length(t1)-2) {
t3[i] = paste(t1[i], t1[i+1], t1[i+2])
}
I was looking into sapply and tapply etc. but I can't seem to figure out how to use "the following element" in the vector.
Any help will be rewarded with my eternal gratitude!
-------------- Edit --------------
Run times of the suggestions using input data with ~ 3 million rows:
START: [1] "2016-11-20 19:24:50 CET"
For-loop: [1] "2016-11-20 19:28:26 CET"
rollapply: [1] "2016-11-20 19:38:55 CET"
apply(matrix): [1] "2016-11-20 19:42:15 CET"
paste t1[-length...]: [1] "2016-11-20 19:42:37 CET"
grep: [1] "2016-11-20 19:44:30 CET"
Have you considered the zoo package? For example
library('zoo')
input<-c('a','b','c','d','e','f')
output<-rollapply(data=input, width=2, FUN=paste, collapse=" ")
output
will return
"a b" "b c" "c d" "d e" "e f"
The width argument controls how many elements to concatenate. I expect you'll have improved runtimes here too but I haven't tested
For groups of two, we can do this with
paste(t1[-length(t1)], t1[-1])
#[1] "a b" "b c" "c d" "d e" "e f"
and for higher numbers, one option is shift from data.table
library(data.table)
v1 <- do.call(paste, shift(t1, 0:2, type="lead"))
grep("NA", v1, invert=TRUE, value=TRUE)
#[1] "a b c" "b c d" "c d e" "d e f"
Or
n <- length(t1)
n1 <- 3
apply(matrix(t1, ncol=n1, nrow = n+1)[seq(n-(n1-1)),], 1, paste, collapse=' ')