Memoize and vectorize a custom function - r

I want to know how to vectorize and memoize a custom function in R. It seems
my way of thinking is not aligned with R's way of operation. So, I gladly
welcome any links to good reading material. For example, R inferno is a nice
resource, but it didn't help to figure out memoization in R.
More generally, can you provide a relevant usage example for the memoise
or R.cache packages?
I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions. I emailed the mailing list and did not receive a complete
answer.
I am not solely interested in memoizing the GC function, and I am aware of
Bioconductor and the various packages
available there.
Here's my data:
seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank "".
I have a function for calculating GC content:
> GC <- function(s) {
if (!is.character(s)) return(NA)
n <- nchar(s)
if (n == 0) return(NA)
m <- gregexpr('[GCSgcs]', s)[[1]]
if (m[1] < 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
I want to memoize it. Then, I want to vectorize it.
Apparently, I must have the wrong mindset for using the memoise or
R.cache R packages:
> system.time(dummy <- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the hash package, but things seem to be happening behind the
scenes and I don't understand the output. The sequence C should have a
value of 100, not NULL.
Note that using has.key(s, cache) instead of exists(s, cache) results
in the same output. Also, using cache[s] <<- result instead of
cache[[s]] <<- result results in the same output.
> cache <- hash()
> GCc <- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result <- GC(s)
cache[[s]] <<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv <- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
Relevant stackoverflow posts:
Options for caching / memoization / hashing in R

While this won't give you memoization across calls, you can use factors to make individual calls a lot faster if there is a fair bit of repetition. Eg using Joshua's GC2 (though I had to remove fixed=T to get it to work):
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
100.0 * len/n
}
One can easily define a wrapper like:
GC3 <- function(s) {
x <- factor(s)
GC2(levels(x))[x]
}
system.time(GC2(rep(seqs, 50000)))
# user system elapsed
# 8.97 0.00 8.99
system.time(GC3(rep(seqs, 50000)))
# user system elapsed
# 0.06 0.00 0.06

This doesn't explicitly answer your question, but this function is ~4 times faster than yours.
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
len/n
}

Related

Are "self-contained" functions more efficient in R?

I'm writing a function that needs to call a function g passed as a parameter to each element of a list, iteratively.
I'm wondering how to make this the fastest possible. I can achieve an acceptable speed using Rcpp and specific kind of g (writing everything in Cpp), but I can't figure out if I can reach similar speed passing an R function as argument.
Was doing some tests to figure out why R is slower and found some really unexpected results:
minus <- function(x) -x
minus_vec <- Vectorize(minus, "x")
Testing with some simple functions to invert signs.
f0 <- function(x) {
sapply(x, minus)
}
f1 <- function(x) {
for(i in seq_along(x)){
x[i] <- -x[i]
}
x
}
f2 <- function(x) {
for(i in seq_along(x)){
x[i] <- minus(x[i])
}
x
}
I got the following results:
a <- 1:10^5
library(rbenchmark)
benchmark(f0(a), f1(a), f2(a), minus_vec(a), minus(a))[,c(1,4)]
test relative
1 f0(a) 454.842
2 f1(a) 25.579
3 f2(a) 178.211
4 minus_vec(a) 523.789
5 minus(a) 1.000
I would like some explanation on the following points:
Why don't f1 and f2 have the same speed? Writing the piece of code -x[i] and calling the function minus(x[i]) really should be so different when they do the exact same thing?
Why is f0 slower than f2? I always thought apply functions were more efficient than for loops, but never really understood why and now I even found a counter-example.
Can I make a function as fast as f1 using the function minus ?
Why does vectorizing minus (unnecessary since - is already vectorized, but that might not be the case always) made it so bad?
Not a full answer, but here are a few notes
1 minus(x) vs -x: Doing nothing is better than doing something
Your function minus calls `-`, so the added step adds computation time. I honestly do not know the who's, what's and when's specifically, in other words I wouldn't know how much more computation time ought to be expected.
Here is an example highlighting it: we have four functions, all squaring numbers
fa <- function (n) n^2
fb <- function (n) fa(n)
fc <- function (n) fb(n)
fd <- function (n) fc(n)
Fa <- function (n) {
for (i in seq_along(n)) n[i] <- fa(i)
n
}
Fb <- function (n) {
for (i in seq_along(n)) n[i] <- fb(i)
n
}
Fc <- function (n) {
for (i in seq_along(n)) n[i] <- fc(i)
n
}
Fd <- function (n) {
for (i in seq_along(n)) n[i] <- fd(i)
n
}
And here are the benchmarking results
n <- 1:10^4
b <- benchmark(Fa(n),Fb(n),Fc(n),Fd(n), replications = 1000L)
b
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 Fa(n) 1000 3.93 1.000 3.85 0.00 NA NA
# 2 Fb(n) 1000 7.08 1.802 6.94 0.02 NA NA
# 3 Fc(n) 1000 10.16 2.585 9.94 0.06 NA NA
# 4 Fd(n) 1000 13.68 3.481 13.56 0.00 NA NA
# looks rather even
diff(b$elapsed)
# [1] 3.15 3.08 3.52
Now back to your minusfunction
a <- 1:10^5
b <- benchmark(f0(a), f1(a), f2(a), minus_vec(a), minus(a))
b$elapsed[b$test == 'f2(a)'] - b$elapsed[b$test == 'f1(a)']
# [1] 3.39
2 apply vs for vs Vectorize:
#NavyCheng provided for some good material on the topic. Now my understanding is, the apply family (just like Vectorize) loops in R (whereas if I'm not mistaking the looping for `-` is done in C).
Again, I do not know about the exact details, but if apply/Vectorize use R loops, then, in theory (and often in practice), it is possible to write a proper for loop that will perform as good or better.
3 A Function as fast as f1:
Ad-hoc, the closes I came up was by cheating using the Rcpp package. (cheating since one writes the function in c++ first)
In C++
#include <RcppArmadillo.h>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector minusCpp(NumericVector x) {
for (int k = 0; k < x.length(); ++k) {
x[k] = -x[k];
}
return x;
}
Now to the bechmarks in R
a <- 1:10^5
b <- benchmark(f0(a), f1(a), f2(a), minus_vec(a), minus(a), minusCpp(a))
b
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 f0(a) 100 9.47 NA 9.22 0.01 NA NA
# 2 f1(a) 100 0.53 NA 0.54 0.00 NA NA
# 3 f2(a) 100 4.23 NA 4.24 0.00 NA NA
# 5 minus(a) 100 0.00 NA 0.00 0.00 NA NA
# 4 minus_vec(a) 100 10.42 NA 10.39 0.02 NA NA
# 6 minusCpp(a) 100 0.05 NA 0.04 0.00 NA NA
Ignore -x[i] and minus(-x[i]), and I summarize the four questions to two:
Why apply family is slower than forloop?
Why Vectorize is slower than apply family?
For the 1st question:
The apply functions are designed to be convenient and clear to read,
not necessarily fast.
and apply family will do more things than forloop,
Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form.
You can't read here and here for more detail.
For for 2rd question, it's because Vectorize is a wrapper of mapply and if you type Vectorize in Rstudio, you'll see the detail code. you can read this for more help.

Faster way of calculating off-diagonal averages in large matrices

I need to calculate the mean of each off-diagonal element in an n × n matrix. The lower and upper triangles are redundant. Here's the code I'm currently using:
A <- replicate(500, rnorm(500))
sapply(1:(nrow(A)-1), function(x) mean(A[row(A) == (col(A) - x)]))
Which seems to work but does not scale well with larger matrices. The ones I have aren't huge, around 2-5000^2, but even with 1000^2 it's taking longer than I'd like:
A <- replicate(1000, rnorm(1000))
system.time(sapply(1:(nrow(A)-1), function(x) mean(A[row(A) == (col(A) - x)])))
> user system elapsed
> 26.662 4.846 31.494
Is there a smarter way of doing this?
edit To clarify, I'd like the mean of each diagonal independently, e.g. for:
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
I would like:
mean(c(1,2,3))
mean(c(1,2))
mean(1)
You can get significantly faster just by extracting the diagonals directly using linear addressing: superdiag here extracts the ith superdiagonal from A (i=1 is the principal diagonal)
superdiag <- function(A,i) {
n<-nrow(A);
len<-n-i+1;
r <- 1:len;
c <- i:n;
indices<-(c-1)*n+r;
A[indices]
}
superdiagmeans <- function(A) {
sapply(2:nrow(A), function(i){mean(superdiag(A,i))})
}
Running this on a 1K square matrix gives a ~800x speedup:
> A <- replicate(1000, rnorm(1000))
> system.time(sapply(1:(nrow(A)-1), function(x) mean(A[row(A) == (col(A) - x)])))
user system elapsed
26.464 3.345 29.793
> system.time(superdiagmeans(A))
user system elapsed
0.033 0.006 0.039
This gives you results in the same order as the original.
You can use the following function :
diagmean <- function(x){
id <- row(x) - col(x)
sol <- tapply(x,id,mean)
sol[names(sol)!='0']
}
If we check this on your matrix, the speed gain is substantial:
> system.time(diagmean(A))
user system elapsed
2.58 0.00 2.58
> system.time(sapply(1:(nrow(A)-1), function(x) mean(A[row(A) == (col(A) - x)])))
user system elapsed
38.93 4.01 42.98
Note that this function calculates both upper and lower triangles. You can calculate eg only the lower triangular using:
diagmean <- function(A){
id <- row(A) - col(A)
id[id>=0] <- NA
tapply(A,id,mean)
}
This results in another speed gain. Note that the solution will be reversed compared to yours :
> A <- matrix(rep(c(1,2,3,4),4),ncol=4)
> sapply(1:(nrow(A)-1), function(x) mean(A[row(A) == (col(A) - x)]))
[1] 2.0 1.5 1.0
> diagmean(A)
-3 -2 -1
1.0 1.5 2.0

What's the fastest way to apply t.test to each column of a large matrix?

Suppose I have a large matrix:
M <- matrix(rnorm(1e7),nrow=20)
Further suppose that each column represents a sample. Say I would like to apply t.test() to each column, is there a way to do this that is much faster than using apply()?
apply(M, 2, t.test)
It took slightly less than 2 minutes to run the analysis on my computer:
> system.time(invisible( apply(M, 2, t.test)))
user system elapsed
113.513 0.663 113.519
You can do better than this with the colttests function from the genefilter package (on Bioconductor).
> library(genefilter)
> M <- matrix(rnorm(40),nrow=20)
> my.t.test <- function(c){
+ n <- sqrt(length(c))
+ mean(c)*n/sd(c)
+ }
> x1 <- apply(M, 2, function(c) my.t.test(c))
> x2 <- colttests(M, gl(1, nrow(M)))[,"statistic"]
> all.equal(x1, x2)
[1] TRUE
> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
user system elapsed
27.386 0.004 27.445
> system.time(invisible(colttests(M, gl(1, nrow(M)))[,"statistic"]))
user system elapsed
0.412 0.000 0.414
Ref: "Computing thousands of test statistics simultaneously in R", SCGN, Vol 18 (1), 2007, http://stat-computing.org/newsletter/issues/scgn-18-1.pdf.
If you have a multicore machine there are some gains from using all the cores, for example using mclapply.
> library(multicore)
> M <- matrix(rnorm(40),nrow=20)
> x1 <- apply(M, 2, t.test)
> x2 <- mclapply(1:dim(M)[2], function(i) t.test(M[,i]))
> all.equal(x1, x2)
[1] "Component 1: Component 9: 1 string mismatch" "Component 2: Component 9: 1 string mismatch"
# str(x1) and str(x2) show that the difference is immaterial
This mini-example shows that things go as we planned. Now scale up:
> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, t.test)))
user system elapsed
101.346 0.626 101.859
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i]))))
user system elapsed
55.049 2.527 43.668
This is using 8 virtual cores. Your mileage may vary. Not a huge gain, but it comes from very little effort.
EDIT
If you only care about the t-statistic itself, extracting the corresponding field ($statistic) makes things a bit faster, in particular in the multicore case:
> system.time(invisible(apply(M, 2, function(c) t.test(c)$statistic)))
user system elapsed
80.920 0.437 82.109
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i])$statistic)))
user system elapsed
21.246 1.367 24.107
Or even faster, compute the t value directly
my.t.test <- function(c){
n <- sqrt(length(c))
mean(c)*n/sd(c)
}
Then
> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
user system elapsed
21.371 0.247 21.532
> system.time(invisible(mclapply(1:dim(M)[2], function(i) my.t.test(M[,i]))))
user system elapsed
144.161 8.658 6.313

Vectorize a product calculation which depends on previous elements?

I'm trying to speed up/vectorize some calculations in a time series.
Can I vectorize a calculation in a for loop which can depend on results from an earlier iteration? For example:
z <- c(1,1,0,0,0,0)
zi <- 2:6
for (i in zi) {z[i] <- ifelse (z[i-1]== 1, 1, 0) }
uses the z[i] values updated in earlier steps:
> z
[1] 1 1 1 1 1 1
In my effort at vectorizing this
z <- c(1,1,0,0,0,0)
z[zi] <- ifelse( z[zi-1] == 1, 1, 0)
the element-by-element operations don't use results updated in the operation:
> z
[1] 1 1 1 0 0 0
So this vectorized operation operates in 'parallel' rather than iterative fashion. Is there a way I can write/vectorize this to get the results of the for loop?
ifelse is vectorized and there's a bit of a penalty if you're using it on one element at a time in a for-loop. In your example, you can get a pretty good speedup by using if instead of ifelse.
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
z <- c(1,1,0,0,0,0)
identical(fun1(z),fun2(z))
# [1] TRUE
system.time(replicate(10000, fun1(z)))
# user system elapsed
# 1.13 0.00 1.32
system.time(replicate(10000, fun2(z)))
# user system elapsed
# 0.27 0.00 0.26
You can get some additional speed gains out of fun2 by compiling it.
library(compiler)
cfun2 <- cmpfun(fun2)
system.time(replicate(10000, cfun2(z)))
# user system elapsed
# 0.11 0.00 0.11
So there's a 10x speedup without vectorization. As others have said (and some have illustrated) there are ways to vectorize your example, but that may not translate to your actual problem. Hopefully this is general enough to be applicable.
The filter function may be useful to you as well if you can figure out how to express your problem in terms of a autoregressive or moving average process.
This is a nice and simple example where Rcpp can shine.
So let us first recast functions 1 and 2 and their compiled counterparts:
library(inline)
library(rbenchmark)
library(compiler)
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun1c <- cmpfun(fun1)
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
fun2c <- cmpfun(fun2)
We write a Rcpp variant very easily:
funRcpp <- cxxfunction(signature(zs="numeric"), plugin="Rcpp", body="
Rcpp::NumericVector z = Rcpp::NumericVector(zs);
int n = z.size();
for (int i=1; i<n; i++) {
z[i] = (z[i-1]==1.0 ? 1.0 : 0.0);
}
return(z);
")
This uses the inline package to compile, load and link the five-liner on the fly.
Now we can define our test-date, which we make a little longer than the original (as just running the original too few times result in unmeasurable times):
R> z <- rep(c(1,1,0,0,0,0), 100)
R> identical(fun1(z),fun2(z),fun1c(z),fun2c(z),funRcpp(z))
[1] TRUE
R>
All answers are seen as identical.
Finally, we can benchmark:
R> res <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=1000)
R> print(res)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 1000 0.005 1.0 0.01 0
4 fun2c(z) 1000 0.466 93.2 0.46 0
2 fun2(z) 1000 1.918 383.6 1.92 0
3 fun1c(z) 1000 10.865 2173.0 10.86 0
1 fun1(z) 1000 12.480 2496.0 12.47 0
The compiled version wins by a factor of almost 400 against the best R version, and almost 100 against its byte-compiled variant. For function 1, the byte compilation matters much less and both variants trail the C++ by a factor of well over two-thousand.
It took about one minute to write the C++ version. The speed gain suggests it was a minute well spent.
For comparison, here is the result for the original short vector called more often:
R> z <- c(1,1,0,0,0,0)
R> res2 <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications",
+ "elapsed", "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=10000)
R> print(res2)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 10000 0.046 1.000000 0.04 0
4 fun2c(z) 10000 0.132 2.869565 0.13 0
2 fun2(z) 10000 0.271 5.891304 0.27 0
3 fun1c(z) 10000 1.045 22.717391 1.05 0
1 fun1(z) 10000 1.202 26.130435 1.20 0
The qualitative ranking is unchanged: the Rcpp version dominates, function2 is second-best. with the byte-compiled version being about twice as fast that the plain R variant, but still almost three times slower than the C++ version. And the relative difference are lower: relatively speaking, the function call overhead matters less and the actual looping matters more: C++ gets a bigger advantage on the actual loop operations in the longer vectors. That it is an important result as it suggests that more real-life sized data, the compiled version may reap a larger benefit.
Edited to correct two small oversights in the code examples. And edited again with thanks to Josh to catch a setup error relative to fun2c.
I think this is cheating and not generalizable, but: according to the rules you have above, any occurrence of 1 in the vector will make all subsequent elements 1 (by recursion: z[i] is 1 set to 1 if z[i-1] equals 1; therefore z[i] will be set to 1 if z[i-2] equals 1; and so forth). Depending on what you really want to do, there may be such a recursive solution available if you think carefully about it ...
z <- c(1,1,0,0,0,0)
first1 <- min(which(z==1))
z[seq_along(z)>first1] <- 1
edit: this is wrong, but I'm leaving it up to admit my mistakes. Based on a little bit of playing (and less thinking), I think the actual solution to this recursion is more symmetric and even simpler:
rep(z[1],length(z))
Test cases:
z <- c(1,1,0,0,0,0)
z <- c(0,1,1,0,0,0)
z <- c(0,0,1,0,0,0)
Check out the rollapply function in zoo.
I'm not super familiar with it, but I think this does what you want:
> c( 1, rollapply(z,2,function(x) x[1]) )
[1] 1 1 1 1 1 1
I'm sort of kludging it by using a window of 2 and then only using the first element of that window.
For more complicated examples you could perform some calculation on x[1] and return that instead.
Sometimes you just need to think about it totally differently. What you're doing is creating a vector where every item is the same as the first if it's a 1 or 0 otherwise.
z <- c(1,1,0,0,0,0)
if (z[1] != 1) z[1] <- 0
z[2:length(z)] <- z[1]
There is a function that does this particular calculation: cumprod (cumulative product)
> cumprod(z[zi])
[1] 1 0 0 0 0
> cumprod(c(1,2,3,4,0,5))
[1] 1 2 6 24 0 0
Otherwise, vectorize with Rccp as other answers have shown.
It's also possible to do this with "apply" using the original vector and a lagged version of the vector as the constituent columns of a data frame.

Create grouping variable for consecutive sequences and split vector

I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in:
L1: 1
L2: 3,4,5
L3: 9,10
L4: 17
L5: 29,30
Naive code (of an ex-C programmer):
partition.neighbors <- function(v)
{
result <<- list() #jagged array
currentList <<- v[1] #current series
for(i in 2:length(v))
{
if(v[i] - v [i-1] == 1)
{
currentList <<- c(currentList, v[i])
}
else
{
result <<- c(result, list(currentList))
currentList <<- v[i] #next series
}
}
return(result)
}
Now I understand that a) R is not C (despite the curly brackets) b) global variables are pure evil c) that is a horribly inefficient way of achieving the result
, so any better solutions are welcome.
Making heavy use of some R idioms:
> split(v, cumsum(c(1, diff(v) != 1)))
$`1`
[1] 1
$`2`
[1] 3 4 5
$`3`
[1] 9 10
$`4`
[1] 17
$`5`
[1] 29 30
daroczig writes "you could write a lot neater code based on diff"...
Here's one way:
split(v, cumsum(diff(c(-Inf, v)) != 1))
EDIT (added timings):
Tommy discovered this could be faster by being careful with types; the reason it got faster is that split is faster on integers, and is actually faster still on factors.
Here's Joshua's solution; the result from the cumsum is a numeric because it's being c'd with 1, so it's the slowest.
system.time({
a <- cumsum(c(1, diff(v) != 1))
split(v, a)
})
# user system elapsed
# 1.839 0.004 1.848
Just cing with 1L so the result is an integer speeds it up considerably.
system.time({
a <- cumsum(c(1L, diff(v) != 1))
split(v, a)
})
# user system elapsed
# 0.744 0.000 0.746
This is Tommy's solution, for reference; it's also splitting on an integer.
> system.time({
a <- cumsum(c(TRUE, diff(v) != 1L))
split(v, a)
})
# user system elapsed
# 0.742 0.000 0.746
Here's my original solution; it also is splitting on an integer.
system.time({
a <- cumsum(diff(c(-Inf, v)) != 1)
split(v, a)
})
# user system elapsed
# 0.750 0.000 0.754
Here's Joshua's, with the result converted to an integer before the split.
system.time({
a <- cumsum(c(1, diff(v) != 1))
a <- as.integer(a)
split(v, a)
})
# user system elapsed
# 0.736 0.002 0.740
All the versions that split on an integer vector are about the same; it could be even faster if that integer vector was already a factor, as the conversion from integer to factor actually takes about half the time. Here I make it into a factor directly; this is not recommended in general because it depends on the structure of the factor class. It'ss done here for comparison purposes only.
system.time({
a <- cumsum(c(1L, diff(v) != 1))
a <- structure(a, class = "factor", levels = 1L:a[length(a)])
split(v,a)
})
# user system elapsed
# 0.356 0.000 0.357
Joshua and Aaron were spot on. However, their code can still be made more than twice as fast by careful use of the correct types, integers and logicals:
split(v, cumsum(c(TRUE, diff(v) != 1L)))
v <- rep(c(1:5, 19), len = 1e6) # Huge vector...
system.time( split(v, cumsum(c(1, diff(v) != 1))) ) # Joshua's code
# user system elapsed
# 2.64 0.00 2.64
system.time( split(v, cumsum(c(TRUE, diff(v) != 1L))) ) # Modified code
# user system elapsed
# 1.09 0.00 1.12
You could define the cut-points easily:
which(diff(v) != 1)
Based on that try:
v <- c(1,3,4,5,9,10,17,29,30)
cutpoints <- c(0, which(diff(v) != 1), length(v))
ragged.vector <- vector("list", length(cutpoints)-1)
for (i in 2:length(cutpoints)) ragged.vector[[i-1]] <- v[(cutpoints[i-1]+1):cutpoints[i]]
Which results in:
> ragged.vector
[[1]]
[1] 1
[[2]]
[1] 3 4 5
[[3]]
[1] 9 10
[[4]]
[1] 17
[[5]]
[1] 29 30
This algorithm is not a nice one but you could write a lot neater code based on diff :) Good luck!
You can create a data.frame and assign the elements to groups using diff, ifelse and cumsum, then aggregate using tapply:
v.df <- data.frame(v = v)
v.df$group <- cumsum(ifelse(c(1, diff(v) - 1), 1, 0))
tapply(v.df$v, v.df$group, function(x) x)
$`1`
[1] 1
$`2`
[1] 3 4 5
$`3`
[1] 9 10
$`4`
[1] 17
$`5`
[1] 29 30

Resources