Related
I have to split a vector into n chunks of equal size in R. I couldn't find any base function to do that. Also Google didn't get me anywhere. Here is what I came up with so far;
x <- 1:10
n <- 3
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1] 8 9 10
A one-liner splitting d into chunks of size 20:
split(d, ceiling(seq_along(d)/20))
More details: I think all you need is seq_along(), split() and ceiling():
> d <- rpois(73,5)
> d
[1] 3 1 11 4 1 2 3 2 4 10 10 2 7 4 6 6 2 1 1 2 3 8 3 10 7 4
[27] 3 4 4 1 1 7 2 4 6 0 5 7 4 6 8 4 7 12 4 6 8 4 2 7 6 5
[53] 4 5 4 5 5 8 7 7 7 6 2 4 3 3 8 11 6 6 1 8 4
> max <- 20
> x <- seq_along(d)
> d1 <- split(d, ceiling(x/max))
> d1
$`1`
[1] 3 1 11 4 1 2 3 2 4 10 10 2 7 4 6 6 2 1 1 2
$`2`
[1] 3 8 3 10 7 4 3 4 4 1 1 7 2 4 6 0 5 7 4 6
$`3`
[1] 8 4 7 12 4 6 8 4 2 7 6 5 4 5 4 5 5 8 7 7
$`4`
[1] 7 6 2 4 3 3 8 11 6 6 1 8 4
chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
A simplified version:
n = 3
split(x, sort(x%%n))
NB: This will only work on numeric vectors.
Using base R's rep_len:
x <- 1:10
n <- 3
split(x, rep_len(1:n, length(x)))
# $`1`
# [1] 1 4 7 10
#
# $`2`
# [1] 2 5 8
#
# $`3`
# [1] 3 6 9
And as already mentioned if you want sorted indices, simply:
split(x, sort(rep_len(1:n, length(x))))
# $`1`
# [1] 1 2 3 4
#
# $`2`
# [1] 5 6 7
#
# $`3`
# [1] 8 9 10
Try the ggplot2 function, cut_number:
library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#> [1] [1,4] [1,4] [1,4] [1,4] (4,7] (4,7] (4,7] (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]
# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#>
#> $`(4,7]`
#> [1] 5 6 7
#>
#> $`(7,10]`
#> [1] 8 9 10
This will split it differently to what you have, but is still quite a nice list structure I think:
chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) {
if(force.number.of.groups) {
f1 <- as.character(sort(rep(1:n, groups)))
f <- as.character(c(f1, rep(n, overflow)))
} else {
f1 <- as.character(sort(rep(1:groups, n)))
f <- as.character(c(f1, rep("overflow", overflow)))
}
g <- split(x, f)
if(force.number.of.groups) {
g.names <- names(g)
g.names.ordered <- as.character(sort(as.numeric(g.names)))
} else {
g.names <- names(g[-length(g)])
g.names.ordered <- as.character(sort(as.numeric(g.names)))
g.names.ordered <- c(g.names.ordered, "overflow")
}
return(g[g.names.ordered])
}
Which will give you the following, depending on how you want it formatted:
> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3
$`2`
[1] 4 5 6
$`3`
[1] 7 8 9
$overflow
[1] 10
> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3
$`2`
[1] 4 5 6
$`3`
[1] 7 8 9 10
Running a couple of timings using these settings:
set.seed(42)
x <- rnorm(1:1e7)
n <- 3
Then we have the following results:
> system.time(chunk(x, n)) # your function
user system elapsed
29.500 0.620 30.125
> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
user system elapsed
5.360 0.300 5.663
Note: Changing as.factor() to as.character() made my function twice as fast.
If you don't like split() and you don't like matrix() (with its dangling NAs), there's this:
chunk <- function(x, n) (mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n), pmin(seq.int(from=1, to=length(x), by=n)+(n-1), length(x)), SIMPLIFY=FALSE))
Like split(), it returns a list, but it doesn't waste time or space with labels, so it may be more performant.
A few more variants to the pile...
> x <- 1:10
> n <- 3
Note, that you don't need to use the factor function here, but you still want to sort o/w your first vector would be 1 2 3 10:
> chunk <- function(x, n) split(x, sort(rank(x) %% n))
> chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1] 8 9 10
Or you can assign character indices, vice the numbers in left ticks above:
> my.chunk <- function(x, n) split(x, sort(rep(letters[1:n], each=n, len=length(x))))
> my.chunk(x, n)
$a
[1] 1 2 3 4
$b
[1] 5 6 7
$c
[1] 8 9 10
Or you can use plainword names stored in a vector. Note that using sort to get consecutive values in x alphabetizes the labels:
> my.other.chunk <- function(x, n) split(x, sort(rep(c("tom", "dick", "harry"), each=n, len=length(x))))
> my.other.chunk(x, n)
$dick
[1] 1 2 3
$harry
[1] 4 5 6
$tom
[1] 7 8 9 10
Yet another possibility is the splitIndices function from package parallel:
library(parallel)
splitIndices(20, 3)
Gives:
[[1]]
[1] 1 2 3 4 5 6 7
[[2]]
[1] 8 9 10 11 12 13
[[3]]
[1] 14 15 16 17 18 19 20
NB: this works only with numeric values though. If you want to split a character vector, you would need to do some indexing: lapply(splitIndices(20, 3), \(x) letters[1:20][x])
You could combine the split/cut, as suggested by mdsummer, with quantile to create even groups:
split(x,cut(x,quantile(x,(0:n)/n), include.lowest=TRUE, labels=FALSE))
This gives the same result for your example, but not for skewed variables.
split(x,matrix(1:n,n,length(x))[1:length(x)])
perhaps this is more clear, but the same idea:
split(x,rep(1:n, ceiling(length(x)/n),length.out = length(x)))
if you want it ordered,throw a sort around it
Here's another variant.
NOTE: with this sample you're specifying the CHUNK SIZE in the second parameter
all chunks are uniform, except for the last;
the last will at worst be smaller, never bigger than the chunk size.
chunk <- function(x,n)
{
f <- sort(rep(1:(trunc(length(x)/n)+1),n))[1:length(x)]
return(split(x,f))
}
#Test
n<-c(1,2,3,4,5,6,7,8,9,10,11)
c<-chunk(n,5)
q<-lapply(c, function(r) cat(r,sep=",",collapse="|") )
#output
1,2,3,4,5,|6,7,8,9,10,|11,|
I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):
chunk <- function(x,n){
numOfVectors <- floor(length(x)/n)
elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538 0.1836433 -0.8356286
$`2`
[1] 1.5952808 0.3295078 -0.8204684
$`3`
[1] 0.4874291 0.7383247 0.5757814 -0.3053884
Simple function for splitting a vector by simply using indexes - no need to over complicate this
vsplit <- function(v, n) {
l = length(v)
r = l/n
return(lapply(1:n, function(i) {
s = max(1, round(r*(i-1))+1)
e = min(l, round(r*i))
return(v[s:e])
}))
}
Sorry if this answer comes so late, but maybe it can be useful for someone else. Actually there is a very useful solution to this problem, explained at the end of ?split.
> testVector <- c(1:10) #I want to divide it into 5 parts
> VectorList <- split(testVector, 1:5)
> VectorList
$`1`
[1] 1 6
$`2`
[1] 2 7
$`3`
[1] 3 8
$`4`
[1] 4 9
$`5`
[1] 5 10
Credit to #Sebastian for this function
chunk <- function(x,y){
split(x, factor(sort(rank(row.names(x))%%y)))
}
If you don't like split() and you don't mind NAs padding out your short tail:
chunk <- function(x, n) { if((length(x)%%n)==0) {return(matrix(x, nrow=n))} else {return(matrix(append(x, rep(NA, n-(length(x)%%n))), nrow=n))} }
The columns of the returned matrix ([,1:ncol]) are the droids you are looking for.
I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:
library(data.table)
split_dt <- function(x,y)
{
for(i in seq(from=1,to=nrow(get(x)),by=y))
{df_ <<- get(x)[i:(i + y)];
assign(paste0("df_",i),df_,inherits=TRUE)}
rm(df_,inherits=TRUE)
}
This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.
I have come up with this solution:
require(magrittr)
create.chunks <- function(x, elements.per.chunk){
# plain R version
# split(x, rep(seq_along(x), each = elements.per.chunk)[seq_along(x)])
# magrittr version - because that's what people use now
x %>% seq_along %>% rep(., each = elements.per.chunk) %>% extract(seq_along(x)) %>% split(x, .)
}
create.chunks(letters[1:10], 3)
$`1`
[1] "a" "b" "c"
$`2`
[1] "d" "e" "f"
$`3`
[1] "g" "h" "i"
$`4`
[1] "j"
The key is to use the seq(each = chunk.size) parameter so make it work. Using seq_along acts like rank(x) in my previous solution, but is actually able to produce the correct result with duplicated entries.
Here's yet another one, allowing you to control if you want the result ordered or not:
split_to_chunks <- function(x, n, keep.order=TRUE){
if(keep.order){
return(split(x, sort(rep(1:n, length.out = length(x)))))
}else{
return(split(x, rep(1:n, length.out = length(x))))
}
}
split_to_chunks(x = 1:11, n = 3)
$`1`
[1] 1 2 3 4
$`2`
[1] 5 6 7 8
$`3`
[1] 9 10 11
split_to_chunks(x = 1:11, n = 3, keep.order=FALSE)
$`1`
[1] 1 4 7 10
$`2`
[1] 2 5 8 11
$`3`
[1] 3 6 9
Not sure if this answers OP's question, but I think the %% can be useful here
df # some data.frame
N_CHUNKS <- 10
I_VEC <- 1:nrow(df)
df_split <- split(df, sort(I_VEC %% N_CHUNKS))
This splits into chunks of size ⌊n/k⌋+1 or ⌊n/k⌋ and does not use the O(n log n) sort.
get_chunk_id<-function(n, k){
r <- n %% k
s <- n %/% k
i<-seq_len(n)
1 + ifelse (i <= r * (s+1), (i-1) %/% (s+1), r + ((i - r * (s+1)-1) %/% s))
}
split(1:10, get_chunk_id(10,3))
In R, I try systematically to avoid "for" loops and use lapply() family instead.
But how to do so when an iteration contains an increment step ?
For example : is it possible to obtain the same result as below with a lapply approach ?
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
for (i in 1:10){
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
> data.frame(a, b)
> a b
> 1 0 0
> 2 1 0-1
> 3 0 0-1-0
> 4 0 0-1-0-0
> 5 1 0-1-0-0-1
> 6 0 0-1-0-0-1-0
> 7 0 0-1-0-0-1-0-0
> 8 0 0-1-0-0-1-0-0-0
> 9 1 0-1-0-0-1-0-0-0-1
> 10 1 0-1-0-0-1-0-0-0-1-1
EDIT
My question was very badly redacted. The below new example is much more illustrative : is it anyway to use lapply family if each iteration is calculated from the previous one ?
a <- c()
b <- c()
for (i in 1:10){
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
> data.frame(a, b)
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 1 0-1-0-1
5 1 0-1-0-1-1
6 1 0-1-0-1-1-1
7 1 0-1-0-1-1-1-1
8 0 0-1-0-1-1-1-1-0
9 1 0-1-0-1-1-1-1-0-1
10 1 0-1-0-1-1-1-1-0-1-1
For the sake of completeness, there is also the accumulate() function from the purrr package.
So, building on the answers of Sotos and ThomasIsCoding:
df <- data.frame(a = 1:10)
df$b <- purrr::accumulate(df$a, paste, sep = "-")
df
a b
1 1 1
2 2 1-2
3 3 1-2-3
4 4 1-2-3-4
5 5 1-2-3-4-5
6 6 1-2-3-4-5-6
7 7 1-2-3-4-5-6-7
8 8 1-2-3-4-5-6-7-8
9 9 1-2-3-4-5-6-7-8-9
10 10 1-2-3-4-5-6-7-8-9-10
The difference to Reduce() is
that accumulate() is a function verb on its own (no additional parameter accumulate = TRUE required)
and that additional arguments like sep = "-" can be passed on to the mapped function which may help to avoid the creation of an anonymous function.
EDIT
If I understand correctly OP's edit of the question, the OP is asking if a for loop which computes a result iteratively can be replaced by lapply().
This is difficult to answer for me. Here are some thoughts and observations:
First, accumulate() still will work:
set.seed(1L) # required for reproducible data
df <- data.frame(a = sample(0:1, 10L, TRUE))
df$b <- purrr::accumulate(df$a, paste, sep = "-")
df
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 0 0-1-0-0
5 1 0-1-0-0-1
6 0 0-1-0-0-1-0
7 0 0-1-0-0-1-0-0
8 0 0-1-0-0-1-0-0-0
9 1 0-1-0-0-1-0-0-0-1
10 1 0-1-0-0-1-0-0-0-1-1
This is possible because the computation of a can be pulled out off the loop as it does not depend on b.
IMHO, accumulate() and Reduce() do what the OP is looking for but is not called lapply(): They take the result of the previous iteration and combine it with the actual value, for instance
Reduce(`+`, 1:3)
returns the sum of 1, 2, and 3 by iteratively computing (((0 + 1) + 2) + 3). This can be visualised by using the accumulate parameter
Reduce(`+`, 1:3, accumulate = TRUE)
[1] 1 3 6
Second, there is a major difference between a for loop and functions of the lapply() family: lapply(X, FUN, ...) requires a function FUN to be called on each element of X. So, scoping rules for functions apply.
When we transplant the body of the loop into an anonymous function within lapply()
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
lapply(1:10, function(i) {
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
})
we get
[[1]]
[1] "0"
[[2]]
[1] "1"
[[3]]
[1] "0"
[[4]]
[1] "0"
[[5]]
[1] "1"
[[6]]
[1] "0"
[[7]]
[1] "0"
[[8]]
[1] "0"
[[9]]
[1] "1"
[[10]]
[1] "1"
data.frame(a, b)
data frame with 0 columns and 0 rows data.frame(a, b)
Due to the scoping rules, a and b inside the function are considered as local to the function. No reference is made to a and b defined outside of the function.
This can be fixed by global assignment using the global assignment operator <<-:
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
lapply(1:10, function(i) {
a <<- c(a, sample(c(0,1), 1))
b <<- c(b, (paste(a, collapse = "-")))
})
data.frame(a, b)
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 0 0-1-0-0
5 1 0-1-0-0-1
6 0 0-1-0-0-1-0
7 0 0-1-0-0-1-0-0
8 0 0-1-0-0-1-0-0-0
9 1 0-1-0-0-1-0-0-0-1
10 1 0-1-0-0-1-0-0-0-1-1
However, global assignment is considered bad programming practice and should be avoided, see, e.g., the 6th Circle of Patrick Burns' The R Inferno and many questions on SO.
Third, the way the loop is written grows vectors in the loop. This also is considered bad practice as it requires to copy the data over and over again which may slow down tremendously with increasing size. See, e.g., the 2nd Circle of Patrick Burns' The R Inferno.
However, the original code
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
for (i in 1:10) {
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
can be re-written as
a <- integer(10)
b <- character(10)
set.seed(1L) # required for reproducible data
for (i in seq_along(a)) {
a[i] <- sample(c(0,1), 1)
b[i] <- if (i == 1L) a[1] else paste(b[i-1], a[i], sep = "-")
}
data.frame(a, b)
Here, vectors are pre-allocated with the required size to hold the result. Elements to update are identified by subscripting.
Calculation of b[i] still depends only the value of the previous iteration b[i-1] and the actual value a[i] as requested by the OP.
Another way is to use Reduce with accumulate = TRUE, i.e.
df$new <- do.call(rbind, Reduce(paste, split(df, seq(nrow(df))), accumulate = TRUE))
which gives,
a new
1 1 1
2 2 1 2
3 3 1 2 3
4 4 1 2 3 4
5 5 1 2 3 4 5
6 6 1 2 3 4 5 6
7 7 1 2 3 4 5 6 7
8 8 1 2 3 4 5 6 7 8
9 9 1 2 3 4 5 6 7 8 9
10 10 1 2 3 4 5 6 7 8 9 10
You can use sapply (lapply would work too but it returns a list) and iterate over every value of a in df and create a sequence and paste the value together.
df <- data.frame(a = 1:10)
df$b <- sapply(df$a, function(x) paste(seq(x), collapse = "-"))
df
# a b
#1 1 1
#2 2 1-2
#3 3 1-2-3
#4 4 1-2-3-4
#5 5 1-2-3-4-5
#6 6 1-2-3-4-5-6
#7 7 1-2-3-4-5-6-7
#8 8 1-2-3-4-5-6-7-8
#9 9 1-2-3-4-5-6-7-8-9
#10 10 1-2-3-4-5-6-7-8-9-10
If there could be non-numerical values in data on which we can not use seq like
df <- data.frame(a =letters[1:10])
In those case, we can use
df$b <- sapply(seq_along(df$a), function(x) paste(df$a[seq_len(x)], collapse = "-"))
df
# a b
#1 a a
#2 b a-b
#3 c a-b-c
#4 d a-b-c-d
#5 e a-b-c-d-e
#6 f a-b-c-d-e-f
#7 g a-b-c-d-e-f-g
#8 h a-b-c-d-e-f-g-h
#9 i a-b-c-d-e-f-g-h-i
#10 j a-b-c-d-e-f-g-h-i-j
Another way of using Reduce, different to the approach by #Sotos
df$b <- Reduce(function(...) paste(...,sep = "-"), df$a, accumulate = T)
such that
> df
a b
1 1 1
2 2 1-2
3 3 1-2-3
4 4 1-2-3-4
5 5 1-2-3-4-5
6 6 1-2-3-4-5-6
7 7 1-2-3-4-5-6-7
8 8 1-2-3-4-5-6-7-8
9 9 1-2-3-4-5-6-7-8-9
10 10 1-2-3-4-5-6-7-8-9-10
I have to split a vector into n chunks of equal size in R. I couldn't find any base function to do that. Also Google didn't get me anywhere. Here is what I came up with so far;
x <- 1:10
n <- 3
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1] 8 9 10
A one-liner splitting d into chunks of size 20:
split(d, ceiling(seq_along(d)/20))
More details: I think all you need is seq_along(), split() and ceiling():
> d <- rpois(73,5)
> d
[1] 3 1 11 4 1 2 3 2 4 10 10 2 7 4 6 6 2 1 1 2 3 8 3 10 7 4
[27] 3 4 4 1 1 7 2 4 6 0 5 7 4 6 8 4 7 12 4 6 8 4 2 7 6 5
[53] 4 5 4 5 5 8 7 7 7 6 2 4 3 3 8 11 6 6 1 8 4
> max <- 20
> x <- seq_along(d)
> d1 <- split(d, ceiling(x/max))
> d1
$`1`
[1] 3 1 11 4 1 2 3 2 4 10 10 2 7 4 6 6 2 1 1 2
$`2`
[1] 3 8 3 10 7 4 3 4 4 1 1 7 2 4 6 0 5 7 4 6
$`3`
[1] 8 4 7 12 4 6 8 4 2 7 6 5 4 5 4 5 5 8 7 7
$`4`
[1] 7 6 2 4 3 3 8 11 6 6 1 8 4
chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
A simplified version:
n = 3
split(x, sort(x%%n))
NB: This will only work on numeric vectors.
Using base R's rep_len:
x <- 1:10
n <- 3
split(x, rep_len(1:n, length(x)))
# $`1`
# [1] 1 4 7 10
#
# $`2`
# [1] 2 5 8
#
# $`3`
# [1] 3 6 9
And as already mentioned if you want sorted indices, simply:
split(x, sort(rep_len(1:n, length(x))))
# $`1`
# [1] 1 2 3 4
#
# $`2`
# [1] 5 6 7
#
# $`3`
# [1] 8 9 10
Try the ggplot2 function, cut_number:
library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#> [1] [1,4] [1,4] [1,4] [1,4] (4,7] (4,7] (4,7] (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]
# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#>
#> $`(4,7]`
#> [1] 5 6 7
#>
#> $`(7,10]`
#> [1] 8 9 10
This will split it differently to what you have, but is still quite a nice list structure I think:
chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) {
if(force.number.of.groups) {
f1 <- as.character(sort(rep(1:n, groups)))
f <- as.character(c(f1, rep(n, overflow)))
} else {
f1 <- as.character(sort(rep(1:groups, n)))
f <- as.character(c(f1, rep("overflow", overflow)))
}
g <- split(x, f)
if(force.number.of.groups) {
g.names <- names(g)
g.names.ordered <- as.character(sort(as.numeric(g.names)))
} else {
g.names <- names(g[-length(g)])
g.names.ordered <- as.character(sort(as.numeric(g.names)))
g.names.ordered <- c(g.names.ordered, "overflow")
}
return(g[g.names.ordered])
}
Which will give you the following, depending on how you want it formatted:
> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3
$`2`
[1] 4 5 6
$`3`
[1] 7 8 9
$overflow
[1] 10
> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3
$`2`
[1] 4 5 6
$`3`
[1] 7 8 9 10
Running a couple of timings using these settings:
set.seed(42)
x <- rnorm(1:1e7)
n <- 3
Then we have the following results:
> system.time(chunk(x, n)) # your function
user system elapsed
29.500 0.620 30.125
> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
user system elapsed
5.360 0.300 5.663
Note: Changing as.factor() to as.character() made my function twice as fast.
If you don't like split() and you don't like matrix() (with its dangling NAs), there's this:
chunk <- function(x, n) (mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n), pmin(seq.int(from=1, to=length(x), by=n)+(n-1), length(x)), SIMPLIFY=FALSE))
Like split(), it returns a list, but it doesn't waste time or space with labels, so it may be more performant.
A few more variants to the pile...
> x <- 1:10
> n <- 3
Note, that you don't need to use the factor function here, but you still want to sort o/w your first vector would be 1 2 3 10:
> chunk <- function(x, n) split(x, sort(rank(x) %% n))
> chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1] 8 9 10
Or you can assign character indices, vice the numbers in left ticks above:
> my.chunk <- function(x, n) split(x, sort(rep(letters[1:n], each=n, len=length(x))))
> my.chunk(x, n)
$a
[1] 1 2 3 4
$b
[1] 5 6 7
$c
[1] 8 9 10
Or you can use plainword names stored in a vector. Note that using sort to get consecutive values in x alphabetizes the labels:
> my.other.chunk <- function(x, n) split(x, sort(rep(c("tom", "dick", "harry"), each=n, len=length(x))))
> my.other.chunk(x, n)
$dick
[1] 1 2 3
$harry
[1] 4 5 6
$tom
[1] 7 8 9 10
Yet another possibility is the splitIndices function from package parallel:
library(parallel)
splitIndices(20, 3)
Gives:
[[1]]
[1] 1 2 3 4 5 6 7
[[2]]
[1] 8 9 10 11 12 13
[[3]]
[1] 14 15 16 17 18 19 20
NB: this works only with numeric values though. If you want to split a character vector, you would need to do some indexing: lapply(splitIndices(20, 3), \(x) letters[1:20][x])
You could combine the split/cut, as suggested by mdsummer, with quantile to create even groups:
split(x,cut(x,quantile(x,(0:n)/n), include.lowest=TRUE, labels=FALSE))
This gives the same result for your example, but not for skewed variables.
split(x,matrix(1:n,n,length(x))[1:length(x)])
perhaps this is more clear, but the same idea:
split(x,rep(1:n, ceiling(length(x)/n),length.out = length(x)))
if you want it ordered,throw a sort around it
Here's another variant.
NOTE: with this sample you're specifying the CHUNK SIZE in the second parameter
all chunks are uniform, except for the last;
the last will at worst be smaller, never bigger than the chunk size.
chunk <- function(x,n)
{
f <- sort(rep(1:(trunc(length(x)/n)+1),n))[1:length(x)]
return(split(x,f))
}
#Test
n<-c(1,2,3,4,5,6,7,8,9,10,11)
c<-chunk(n,5)
q<-lapply(c, function(r) cat(r,sep=",",collapse="|") )
#output
1,2,3,4,5,|6,7,8,9,10,|11,|
I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):
chunk <- function(x,n){
numOfVectors <- floor(length(x)/n)
elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538 0.1836433 -0.8356286
$`2`
[1] 1.5952808 0.3295078 -0.8204684
$`3`
[1] 0.4874291 0.7383247 0.5757814 -0.3053884
Simple function for splitting a vector by simply using indexes - no need to over complicate this
vsplit <- function(v, n) {
l = length(v)
r = l/n
return(lapply(1:n, function(i) {
s = max(1, round(r*(i-1))+1)
e = min(l, round(r*i))
return(v[s:e])
}))
}
Sorry if this answer comes so late, but maybe it can be useful for someone else. Actually there is a very useful solution to this problem, explained at the end of ?split.
> testVector <- c(1:10) #I want to divide it into 5 parts
> VectorList <- split(testVector, 1:5)
> VectorList
$`1`
[1] 1 6
$`2`
[1] 2 7
$`3`
[1] 3 8
$`4`
[1] 4 9
$`5`
[1] 5 10
Credit to #Sebastian for this function
chunk <- function(x,y){
split(x, factor(sort(rank(row.names(x))%%y)))
}
If you don't like split() and you don't mind NAs padding out your short tail:
chunk <- function(x, n) { if((length(x)%%n)==0) {return(matrix(x, nrow=n))} else {return(matrix(append(x, rep(NA, n-(length(x)%%n))), nrow=n))} }
The columns of the returned matrix ([,1:ncol]) are the droids you are looking for.
I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:
library(data.table)
split_dt <- function(x,y)
{
for(i in seq(from=1,to=nrow(get(x)),by=y))
{df_ <<- get(x)[i:(i + y)];
assign(paste0("df_",i),df_,inherits=TRUE)}
rm(df_,inherits=TRUE)
}
This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.
I have come up with this solution:
require(magrittr)
create.chunks <- function(x, elements.per.chunk){
# plain R version
# split(x, rep(seq_along(x), each = elements.per.chunk)[seq_along(x)])
# magrittr version - because that's what people use now
x %>% seq_along %>% rep(., each = elements.per.chunk) %>% extract(seq_along(x)) %>% split(x, .)
}
create.chunks(letters[1:10], 3)
$`1`
[1] "a" "b" "c"
$`2`
[1] "d" "e" "f"
$`3`
[1] "g" "h" "i"
$`4`
[1] "j"
The key is to use the seq(each = chunk.size) parameter so make it work. Using seq_along acts like rank(x) in my previous solution, but is actually able to produce the correct result with duplicated entries.
Here's yet another one, allowing you to control if you want the result ordered or not:
split_to_chunks <- function(x, n, keep.order=TRUE){
if(keep.order){
return(split(x, sort(rep(1:n, length.out = length(x)))))
}else{
return(split(x, rep(1:n, length.out = length(x))))
}
}
split_to_chunks(x = 1:11, n = 3)
$`1`
[1] 1 2 3 4
$`2`
[1] 5 6 7 8
$`3`
[1] 9 10 11
split_to_chunks(x = 1:11, n = 3, keep.order=FALSE)
$`1`
[1] 1 4 7 10
$`2`
[1] 2 5 8 11
$`3`
[1] 3 6 9
Not sure if this answers OP's question, but I think the %% can be useful here
df # some data.frame
N_CHUNKS <- 10
I_VEC <- 1:nrow(df)
df_split <- split(df, sort(I_VEC %% N_CHUNKS))
This splits into chunks of size ⌊n/k⌋+1 or ⌊n/k⌋ and does not use the O(n log n) sort.
get_chunk_id<-function(n, k){
r <- n %% k
s <- n %/% k
i<-seq_len(n)
1 + ifelse (i <= r * (s+1), (i-1) %/% (s+1), r + ((i - r * (s+1)-1) %/% s))
}
split(1:10, get_chunk_id(10,3))
add <- c( 2,3,4)
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
print(z)
}
# Result
[1] 13
[1] 15
[1] 17
In R, it can print the result, but I want to save the results for further computation in a vector, data frame or list
Thanks in advance
Try something like:
add <- c(2, 3, 4)
z <- rep(0, length(add))
idx = 1
for(i in add) {
a <- i + 3
b <- a + 3
z[idx] <- a + b
idx <- idx + 1
}
print(z)
This is simple algebra, no need in a for loop at all
res <- (add + 3)*2 + 3
res
## [1] 13 15 17
Or if you want a data.frame
data.frame(a = add + 3, b = add + 6, c = (add + 3)*2 + 3)
# a b c
# 1 5 8 13
# 2 6 9 15
# 3 7 10 17
Though in general, when you are trying to something like that, it is better to create a function, for example
myfunc <- function(x) {
a <- x + 3
b <- a + 3
z <- a + b
z
}
myfunc(add)
## [1] 13 15 17
In cases when a loop is actually needed (unlike in your example) and you want to store its results, it is better to use *apply family for such tasks. For example, use lapply if you want a list back
res <- lapply(add, myfunc)
res
# [[1]]
# [1] 13
#
# [[2]]
# [1] 15
#
# [[3]]
# [1] 17
Or use sapply if you want a vector back
res <- sapply(add, myfunc)
res
## [1] 13 15 17
For a data.frame to keep all the info
add <- c( 2,3,4)
results <- data.frame()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- rbind(results, cbind(a,b,z))
}
results
a b z
1 5 8 13
2 6 9 15
3 7 10 17
If you just want z then use a vector, no need for lists
add <- c( 2,3,4)
results <- vector()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- c(results, z)
}
results
[1] 13 15 17
It might be instructive to compare these two results with those of #dugar:
> sapply(add, function(x) c(a=x+3, b=a+3, z=a+b) )
[,1] [,2] [,3]
a 5 6 7
b 10 10 10
z 17 17 17
That is the result of lazy evaluation and sometimes trips us up when computing with intermediate values. This next one should give a slightly more expected result:
> sapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[,1] [,2] [,3]
a 5 6 7
b 8 9 10
z 13 15 17
Those results are the transpose of #dugar. Using sapply or lapply often saves you the effort off setting up a zeroth case object and then incrementing counters.
> lapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[[1]]
a b z
5 8 13
[[2]]
a b z
6 9 15
[[3]]
a b z
7 10 17
I have to split a vector into n chunks of equal size in R. I couldn't find any base function to do that. Also Google didn't get me anywhere. Here is what I came up with so far;
x <- 1:10
n <- 3
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1] 8 9 10
A one-liner splitting d into chunks of size 20:
split(d, ceiling(seq_along(d)/20))
More details: I think all you need is seq_along(), split() and ceiling():
> d <- rpois(73,5)
> d
[1] 3 1 11 4 1 2 3 2 4 10 10 2 7 4 6 6 2 1 1 2 3 8 3 10 7 4
[27] 3 4 4 1 1 7 2 4 6 0 5 7 4 6 8 4 7 12 4 6 8 4 2 7 6 5
[53] 4 5 4 5 5 8 7 7 7 6 2 4 3 3 8 11 6 6 1 8 4
> max <- 20
> x <- seq_along(d)
> d1 <- split(d, ceiling(x/max))
> d1
$`1`
[1] 3 1 11 4 1 2 3 2 4 10 10 2 7 4 6 6 2 1 1 2
$`2`
[1] 3 8 3 10 7 4 3 4 4 1 1 7 2 4 6 0 5 7 4 6
$`3`
[1] 8 4 7 12 4 6 8 4 2 7 6 5 4 5 4 5 5 8 7 7
$`4`
[1] 7 6 2 4 3 3 8 11 6 6 1 8 4
chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
A simplified version:
n = 3
split(x, sort(x%%n))
NB: This will only work on numeric vectors.
Using base R's rep_len:
x <- 1:10
n <- 3
split(x, rep_len(1:n, length(x)))
# $`1`
# [1] 1 4 7 10
#
# $`2`
# [1] 2 5 8
#
# $`3`
# [1] 3 6 9
And as already mentioned if you want sorted indices, simply:
split(x, sort(rep_len(1:n, length(x))))
# $`1`
# [1] 1 2 3 4
#
# $`2`
# [1] 5 6 7
#
# $`3`
# [1] 8 9 10
Try the ggplot2 function, cut_number:
library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#> [1] [1,4] [1,4] [1,4] [1,4] (4,7] (4,7] (4,7] (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]
# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#>
#> $`(4,7]`
#> [1] 5 6 7
#>
#> $`(7,10]`
#> [1] 8 9 10
This will split it differently to what you have, but is still quite a nice list structure I think:
chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) {
if(force.number.of.groups) {
f1 <- as.character(sort(rep(1:n, groups)))
f <- as.character(c(f1, rep(n, overflow)))
} else {
f1 <- as.character(sort(rep(1:groups, n)))
f <- as.character(c(f1, rep("overflow", overflow)))
}
g <- split(x, f)
if(force.number.of.groups) {
g.names <- names(g)
g.names.ordered <- as.character(sort(as.numeric(g.names)))
} else {
g.names <- names(g[-length(g)])
g.names.ordered <- as.character(sort(as.numeric(g.names)))
g.names.ordered <- c(g.names.ordered, "overflow")
}
return(g[g.names.ordered])
}
Which will give you the following, depending on how you want it formatted:
> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3
$`2`
[1] 4 5 6
$`3`
[1] 7 8 9
$overflow
[1] 10
> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3
$`2`
[1] 4 5 6
$`3`
[1] 7 8 9 10
Running a couple of timings using these settings:
set.seed(42)
x <- rnorm(1:1e7)
n <- 3
Then we have the following results:
> system.time(chunk(x, n)) # your function
user system elapsed
29.500 0.620 30.125
> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
user system elapsed
5.360 0.300 5.663
Note: Changing as.factor() to as.character() made my function twice as fast.
If you don't like split() and you don't like matrix() (with its dangling NAs), there's this:
chunk <- function(x, n) (mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n), pmin(seq.int(from=1, to=length(x), by=n)+(n-1), length(x)), SIMPLIFY=FALSE))
Like split(), it returns a list, but it doesn't waste time or space with labels, so it may be more performant.
A few more variants to the pile...
> x <- 1:10
> n <- 3
Note, that you don't need to use the factor function here, but you still want to sort o/w your first vector would be 1 2 3 10:
> chunk <- function(x, n) split(x, sort(rank(x) %% n))
> chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1] 8 9 10
Or you can assign character indices, vice the numbers in left ticks above:
> my.chunk <- function(x, n) split(x, sort(rep(letters[1:n], each=n, len=length(x))))
> my.chunk(x, n)
$a
[1] 1 2 3 4
$b
[1] 5 6 7
$c
[1] 8 9 10
Or you can use plainword names stored in a vector. Note that using sort to get consecutive values in x alphabetizes the labels:
> my.other.chunk <- function(x, n) split(x, sort(rep(c("tom", "dick", "harry"), each=n, len=length(x))))
> my.other.chunk(x, n)
$dick
[1] 1 2 3
$harry
[1] 4 5 6
$tom
[1] 7 8 9 10
Yet another possibility is the splitIndices function from package parallel:
library(parallel)
splitIndices(20, 3)
Gives:
[[1]]
[1] 1 2 3 4 5 6 7
[[2]]
[1] 8 9 10 11 12 13
[[3]]
[1] 14 15 16 17 18 19 20
NB: this works only with numeric values though. If you want to split a character vector, you would need to do some indexing: lapply(splitIndices(20, 3), \(x) letters[1:20][x])
You could combine the split/cut, as suggested by mdsummer, with quantile to create even groups:
split(x,cut(x,quantile(x,(0:n)/n), include.lowest=TRUE, labels=FALSE))
This gives the same result for your example, but not for skewed variables.
split(x,matrix(1:n,n,length(x))[1:length(x)])
perhaps this is more clear, but the same idea:
split(x,rep(1:n, ceiling(length(x)/n),length.out = length(x)))
if you want it ordered,throw a sort around it
Here's another variant.
NOTE: with this sample you're specifying the CHUNK SIZE in the second parameter
all chunks are uniform, except for the last;
the last will at worst be smaller, never bigger than the chunk size.
chunk <- function(x,n)
{
f <- sort(rep(1:(trunc(length(x)/n)+1),n))[1:length(x)]
return(split(x,f))
}
#Test
n<-c(1,2,3,4,5,6,7,8,9,10,11)
c<-chunk(n,5)
q<-lapply(c, function(r) cat(r,sep=",",collapse="|") )
#output
1,2,3,4,5,|6,7,8,9,10,|11,|
I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):
chunk <- function(x,n){
numOfVectors <- floor(length(x)/n)
elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538 0.1836433 -0.8356286
$`2`
[1] 1.5952808 0.3295078 -0.8204684
$`3`
[1] 0.4874291 0.7383247 0.5757814 -0.3053884
Simple function for splitting a vector by simply using indexes - no need to over complicate this
vsplit <- function(v, n) {
l = length(v)
r = l/n
return(lapply(1:n, function(i) {
s = max(1, round(r*(i-1))+1)
e = min(l, round(r*i))
return(v[s:e])
}))
}
Sorry if this answer comes so late, but maybe it can be useful for someone else. Actually there is a very useful solution to this problem, explained at the end of ?split.
> testVector <- c(1:10) #I want to divide it into 5 parts
> VectorList <- split(testVector, 1:5)
> VectorList
$`1`
[1] 1 6
$`2`
[1] 2 7
$`3`
[1] 3 8
$`4`
[1] 4 9
$`5`
[1] 5 10
Credit to #Sebastian for this function
chunk <- function(x,y){
split(x, factor(sort(rank(row.names(x))%%y)))
}
If you don't like split() and you don't mind NAs padding out your short tail:
chunk <- function(x, n) { if((length(x)%%n)==0) {return(matrix(x, nrow=n))} else {return(matrix(append(x, rep(NA, n-(length(x)%%n))), nrow=n))} }
The columns of the returned matrix ([,1:ncol]) are the droids you are looking for.
I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:
library(data.table)
split_dt <- function(x,y)
{
for(i in seq(from=1,to=nrow(get(x)),by=y))
{df_ <<- get(x)[i:(i + y)];
assign(paste0("df_",i),df_,inherits=TRUE)}
rm(df_,inherits=TRUE)
}
This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.
I have come up with this solution:
require(magrittr)
create.chunks <- function(x, elements.per.chunk){
# plain R version
# split(x, rep(seq_along(x), each = elements.per.chunk)[seq_along(x)])
# magrittr version - because that's what people use now
x %>% seq_along %>% rep(., each = elements.per.chunk) %>% extract(seq_along(x)) %>% split(x, .)
}
create.chunks(letters[1:10], 3)
$`1`
[1] "a" "b" "c"
$`2`
[1] "d" "e" "f"
$`3`
[1] "g" "h" "i"
$`4`
[1] "j"
The key is to use the seq(each = chunk.size) parameter so make it work. Using seq_along acts like rank(x) in my previous solution, but is actually able to produce the correct result with duplicated entries.
Here's yet another one, allowing you to control if you want the result ordered or not:
split_to_chunks <- function(x, n, keep.order=TRUE){
if(keep.order){
return(split(x, sort(rep(1:n, length.out = length(x)))))
}else{
return(split(x, rep(1:n, length.out = length(x))))
}
}
split_to_chunks(x = 1:11, n = 3)
$`1`
[1] 1 2 3 4
$`2`
[1] 5 6 7 8
$`3`
[1] 9 10 11
split_to_chunks(x = 1:11, n = 3, keep.order=FALSE)
$`1`
[1] 1 4 7 10
$`2`
[1] 2 5 8 11
$`3`
[1] 3 6 9
Not sure if this answers OP's question, but I think the %% can be useful here
df # some data.frame
N_CHUNKS <- 10
I_VEC <- 1:nrow(df)
df_split <- split(df, sort(I_VEC %% N_CHUNKS))
This splits into chunks of size ⌊n/k⌋+1 or ⌊n/k⌋ and does not use the O(n log n) sort.
get_chunk_id<-function(n, k){
r <- n %% k
s <- n %/% k
i<-seq_len(n)
1 + ifelse (i <= r * (s+1), (i-1) %/% (s+1), r + ((i - r * (s+1)-1) %/% s))
}
split(1:10, get_chunk_id(10,3))