I've seen several examples on Google, but still don't understand quite well how it works
here's what I'm trying to do.
I have a text array
> V <- c("aa","bb","cc","dd","ee","ff")
> V
[1] "aa" "bb" "cc" "dd" "ee" "ff"
i would like as an output an array of length length(V)-2 (=4)
composed of
[1] "aabbcc" "bbccdd" "ccddee" "ddeeff"
which is a vector with the concatenations of 3 successive elements of V
i'm thinking of using something like mapply
mapply(function(x,i){paste(x[i:i+2],sep="",collapse="")},V,1:(length(V)-2))
but thats not the right syntax
thanks
Here's a solution in case your project needs many successive elements. There are other approaches, mapply is just one:
mapply(function(x,y) paste(V[x:y], collapse=""), 1:(length(V)-2), 3:length(V))
#[1] "aabbcc" "bbccdd" "ccddee" "ddeeff"
As per your comments, you can create a function and use lapply for a list:
paste2 <- function(vec, n=3) {
mapply(function(x,y) paste(vec[x:y], collapse=""), 1:(length(vec)-(n-1)), n:length(vec))
}
## single vector still works
paste2(V)
#[1] "aabbcc" "bbccdd" "ccddee" "ddeeff"
## with list
lst <- rep(list(V), 2)
lapply(lst, paste2)
#[[1]]
#[1] "aabbcc" "bbccdd" "ccddee" "ddeeff"
#
#[[2]]
#[1] "aabbcc" "bbccdd" "ccddee" "ddeeff"
You don't need any fancy mapply for this:
n = length(V)
paste0(V[1:(n - 2)], V[2:(n - 1)], V[3:n])
Here's a parametric solution, you still don't need mapply:
i = 3
apply(matrix(V, nrow = length(V) + 1, ncol = i)[1:(length(V) - i + 1), ],
MARGIN = 1, FUN = paste, collapse = "")
You could functionalize this:
f = function(V, i) {
apply(matrix(V, nrow = length(V) + 1, ncol = i)[1:(length(V) - i + 1), ],
MARGIN = 1, FUN = paste, collapse = "")
}
You could then apply it to a list of vectors like this:
lapply(list(c("a", "b", "c", "d"), letters), f, i = 3)
# [[1]]
# [1] "abc" "bcd"
#
# [[2]]
# [1] "abc" "bcd" "cde" "def" "efg" "fgh" "ghi" "hij" "ijk" "jkl" "klm" "lmn" "mno" "nop" "opq"
# [16] "pqr" "qrs" "rst" "stu" "tuv" "uvw" "vwx" "wxy" "xyz"
You would need mapply (and you could use it with the function) if you had several different vectors and for each vector you wanted a concatenation of a different number of elements.
Here's another alternative using apply() and embed()
rev(apply(embed(rev(V),3), 1, paste, collapse=""))
# [1] "aabbcc" "bbccdd" "ccddee" "ddeeff"
Related
I have a list X of three elements. Each element X[[i]] is a list of two – X1 and X2.
I want to construct a list X_new which would be a list of two elements X1 and X2, each element X_new[[i]] is a list of three.
This works perfectly fine:
X_new <- vector(mode = "list", length = 2)
X_new[[1]] <- lapply(X, function(x) x$X1)
X_new[[2]] <- lapply(X, function(x) x$X2)
But what if instead of 2 I have n? I tried this
X_new <- vector(mode = "list", length = n)
ind <- names(X[[1]])
for (i in 1:n) {
X[[i]] <- lapply(X, function(x) x$ind[i])
}
but it doesn't work, I just get null lists.
We can use transpose from purrr
library(purrr)
X_new <- transpose(X)
-output
X_new
#[[1]]
#[[1]][[1]]
#[1] "a"
#[[1]][[2]]
#[1] "a"
#[[1]][[3]]
#[1] "a"
#[[2]]
#[[2]][[1]]
#[1] "b"
#[[2]][[2]]
#[1] "b"
#[[2]][[3]]
#[1] "b"
Refer to this discussion for an answer. In short, you can use paste inside the [[]] list brackets.
This question already has answers here:
Chopping a string into a vector of fixed width character elements
(13 answers)
Closed 8 years ago.
I have a string such as:
"aabbccccdd"
I want to break this string into a vector of substrings of length 2 :
"aa" "bb" "cc" "cc" "dd"
Here is one way
substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"
or more generally
text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"
Edit: This is much, much faster
sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
It first splits the string into characters. Then, it pastes together the even elements and the odd elements.
Timings
text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
sst <- strsplit(text, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 g1 100 95.451 79.87531 95.438 0 0 0
#2 g2 100 1.195 1.00000 1.196 0 0 0
There are two easy possibilities:
s <- "aabbccccdd"
gregexpr and regmatches:
regmatches(s, gregexpr(".{2}", s))[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"
strsplit:
strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"
string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)
# the indices where each substr will start
starts <- seq(1,num.chars, by=2)
# chop it up
sapply(starts, function(ii) {
substr(string, ii, ii+1)
})
Which gives
[1] "aa" "bb" "cc" "cc" "dd"
One can use a matrix to group the characters:
s2 <- function(x) {
m <- matrix(strsplit(x, '')[[1]], nrow=2)
apply(m, 2, paste, collapse='')
}
s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"
Unfortunately, this breaks for an input of odd string length, giving a warning:
s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
## data length [3] is not a sub-multiple or multiple of the number of rows [2]
More unfortunate is that g1 and g2 from #GSee silently return incorrect results for an input of odd string length:
g1('abc')
## [1] "ab"
g2('abc')
## [1] "ab" "cb"
Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:
s <- function(x, n) {
sst <- strsplit(x, '')[[1]]
m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
m[seq_along(sst)] <- sst
apply(m, 2, paste, collapse='')
}
s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d"
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"
(It is indeed slower than g2, but faster than g1 by about a factor of 7)
Ugly but works
sequenceString <- "ATGAATAAAG"
J=3#maximum sequence length in file
sequenceSmallVecStart <-
substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J),
seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")
Gives
ATG
AAT
AAA
G
Given is vector:
vec <- c(LETTERS[1:10])
I would like to be able to combine it in a following manner:
resA <- c("AB", "CD", "EF", "GH", "IJ")
resB <- c("ABCDEF","GHIJ")
where elements of the vector vec are merged together according to the desired size of a new element constituting the resulting vector. This is 2 in case of resA and 5 in case of resB.
Desired solution characteristics
The solution should allow for flexibility with respect to the element sizes, i.e. I may want to have vectors with elements of size 2 or 20
There may be not enough elements in the vector to match the desired chunk size, in that case last element should be shortened accordingly (as shown)
This is shouldn't make a difference but the solution should work on words as well
Attempts
Initially, I was thinking of using something on the lines:
c(
paste0(vec[1:2], collapse = ""),
paste0(vec[3:4], collapse = ""),
paste0(vec[5:6], collapse = "")
# ...
)
but this would have to be adapted to jump through the remaining pairs/bigger groups of the vec and handle last group which often would be of a smaller size.
Here is what I came up with. Using Harlan's idea in this question, you can split the vector in different number of chunks. You also want to use your paste0() idea in lapply() here. Finally, you unlist a list.
unlist(lapply(split(vec, ceiling(seq_along(vec)/2)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4 5
#"AB" "CD" "EF" "GH" "IJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/5)), function(x){paste0(x, collapse = "")}))
# 1 2
#"ABCDE" "FGHIJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/3)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4
#"ABC" "DEF" "GHI" "J"
vec <- c(LETTERS[1:10])
f1 <- function(x, n){
f <- function(x) paste0(x, collapse = '')
regmatches(f(x), gregexpr(f(rep('.', n)), f(x)))[[1]]
}
f1(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f2 <- function(x, n)
apply(matrix(x, nrow = n), 2, paste0, collapse = '')
f2(vec, 5)
# [1] "ABCDE" "FGHIJ"
or
f3 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
strsplit(gsub(sprintf('(%s)', f(rep('.', n))), '\\1 ', f(x)), '\\s+')[[1]]
}
f3(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
I would say the last is best of these since n for the others must be a factor or you will get warnings or recycling
edit - more
f4 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
Vectorize(substring, USE.NAMES = FALSE)(f(x), which((seq_along(x) %% n) == 1),
which((seq_along(x) %% n) == 0))
}
f4(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f5 <- function(x, n)
mapply(function(x) paste0(x, collapse = ''),
split(x, c(0, head(cumsum(rep_len(sequence(n), length(x)) %in% n), -1))),
USE.NAMES = FALSE)
f5(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
Here is another way, working with the original array.
A side note, working with words is not straightforward, since there is at least two ways to understand it: you can either keep each word separately or collapse them first an get individual characters. The next function can deal with both options.
vec <- c(LETTERS[1:10])
vec2 <- c("AB","CDE","F","GHIJ")
cuts <- function(x, n, bychar=F) {
if (bychar) x <- unlist(strsplit(paste0(x, collapse=""), ""))
ii <- seq_along(x)
li <- split(ii, ceiling(ii/n))
return(sapply(li, function(y) paste0(x[y], collapse="")))
}
cuts(vec2,2,F)
# 1 2
# "ABCDE" "FGHIJ"
cuts(vec2,2,T)
# 1 2 3 4 5
# "AB" "CD" "EF" "GH" "IJ"
test <- list(a = list("first"= 1, "second" = 2),
b = list("first" = 3, "second" = 4))
In the list above, I would like to reassign the "first" elements to equal, let's say, five. This for loop works:
for(temp in c("a", "b")) {
test[[temp]]$first <- 5
}
Is there a way to do the same using a vectorized operation (lapply, etc)? The following extracts the values, but I can't get them reassigned:
lapply(test, "[[", "first")
Here is a vectorised one-liner using unlist and relist:
relist((function(x) ifelse(grepl("first",names(x)),5,x))(unlist(test)),test)
$a
$a$first
[1] 5
$a$second
[1] 2
$b
$b$first
[1] 5
$b$second
[1] 4
You can do it like this:
test <- lapply(test, function(x) {x$first <- 5; x})
I use LETTERS most of the time for my factors but today I tried to go beyond 26 characters:
LETTERS[1:32]
Expecting there to be an automatic recursive factorization AA, AB, AC... But was disappointed. Is this simply a limitation of LETTERS or is there a way to get what I'm looking for using another function?
Would 702 be enough?
LETTERS702 <- c(LETTERS, sapply(LETTERS, function(x) paste0(x, LETTERS)))
If not, how about 18,278?
MOAR_LETTERS <- function(n=2) {
n <- as.integer(n[1L])
if(!is.finite(n) || n < 2)
stop("'n' must be a length-1 integer >= 2")
res <- vector("list", n)
res[[1]] <- LETTERS
for(i in 2:n)
res[[i]] <- c(sapply(res[[i-1L]], function(y) paste0(y, LETTERS)))
unlist(res)
}
ml <- MOAR_LETTERS(3)
str(ml)
# chr [1:18278] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" ...
This solution uses recursion. Usage is a bit different in the sense MORELETTERS is not a long vector you will have to store and possibly expand as your inputs get larger. Instead, it is a function that converts your numbers into the new base.
extend <- function(alphabet) function(i) {
base10toA <- function(n, A) {
stopifnot(n >= 0L)
N <- length(A)
j <- n %/% N
if (j == 0L) A[n + 1L] else paste0(Recall(j - 1L, A), A[n %% N + 1L])
}
vapply(i-1L, base10toA, character(1L), alphabet)
}
MORELETTERS <- extend(LETTERS)
MORELETTERS(1:1000)
# [1] "A" "B" ... "ALL"
MORELETTERS(c(1, 26, 27, 1000, 1e6, .Machine$integer.max))
# [1] "A" "Z" "AA" "ALL" "BDWGN" "FXSHRXW"
You can make what you want like this:
LETTERS2<-c(LETTERS[1:26], paste0("A",LETTERS[1:26]))
Another solution for excel style column names, generalized to any number of letters
#' Excel Style Column Names
#'
#' #param n maximum number of letters in column name
excel_style_colnames <- function(n){
unlist(Reduce(
function(x, y) as.vector(outer(x, y, 'paste0')),
lapply(1:n, function(x) LETTERS),
accumulate = TRUE
))
}
A variant on eipi10's method (ordered correctly) using data.table:
library(data.table)
BIG_LETTERS <- c(LETTERS,
do.call("paste0",CJ(LETTERS,LETTERS)),
do.call("paste0",CJ(LETTERS,LETTERS,LETTERS)))
Yet another option:
l2 = c(LETTERS, sort(do.call("paste0", expand.grid(LETTERS, LETTERS[1:3]))))
Adjust the two instances of LETTERS inside expand.grid to get the number of letter pairs you'd like.
A function to produce Excel-style column names, i.e.
# A, B, ..., Z, AA, AB, ..., AZ, BA, BB, ..., ..., ZZ, AAA, ...
letterwrap <- function(n, depth = 1) {
args <- lapply(1:depth, FUN = function(x) return(LETTERS))
x <- do.call(expand.grid, args = list(args, stringsAsFactors = F))
x <- x[, rev(names(x)), drop = F]
x <- do.call(paste0, x)
if (n <= length(x)) return(x[1:n])
return(c(x, letterwrap(n - length(x), depth = depth + 1)))
}
letterwrap(26^2 + 52) # through AAZ
## This will take a few seconds:
# x <- letterwrap(1e6)
It's probably not the fastest, but it extends indefinitely and is nicely predictable. Took about 20 seconds to produce through 1 million, BDWGN.
(For a few more details, see here: https://stackoverflow.com/a/21689613/903061)
A little late to the party, but I want to play too.
You can also use sub, and sprintf in place of paste0 and get a length 702 vector.
c(LETTERS, sapply(LETTERS, sub, pattern = " ", x = sprintf("%2s", LETTERS)))
Here's another addition to the list. This seems a bit faster than Gregor's (comparison done on my computer - using length.out = 1e6 his took 12.88 seconds, mine was 6.2), and can also be extended indefinitely. The flip side is that it's 2 functions, not just 1.
make.chars <- function(length.out, case, n.char = NULL) {
if(is.null(n.char))
n.char <- ceiling(log(length.out, 26))
m <- sapply(n.char:1, function(x) {
rep(rep(1:26, each = 26^(x-1)) , length.out = length.out)
})
m.char <- switch(case,
'lower' = letters[m],
'upper' = LETTERS[m]
)
m.char <- LETTERS[m]
dim(m.char) <- dim(m)
apply(m.char, 1, function(x) paste(x, collapse = ""))
}
get.letters <- function(length.out, case = 'upper'){
max.char <- ceiling(log(length.out, 26))
grp <- rep(1:max.char, 26^(1:max.char))[1:length.out]
unlist(lapply(unique(grp), function(n) make.chars(length(grp[grp == n]), case = case, n.char = n)))
}
##
make.chars(5, "lower", 2)
#> [1] "AA" "AB" "AC" "AD" "AE"
make.chars(5, "lower")
#> [1] "A" "B" "C" "D" "E"
make.chars(5, "upper", 4)
#> [1] "AAAA" "AAAB" "AAAC" "AAAD" "AAAE"
tmp <- get.letters(800)
head(tmp)
#> [1] "A" "B" "C" "D" "E" "F"
tail(tmp)
#> [1] "ADO" "ADP" "ADQ" "ADR" "ADS" "ADT"
Created on 2019-03-22 by the reprex package (v0.2.1)