Speed up R loop [duplicate] - r

This question already has answers here:
Any documentation for optimizing the performance of R? [duplicate]
(4 answers)
Closed 9 years ago.
Speeding up loops in R can easily be done using a function from the apply family. How can I use an apply function in the code below to speed it up? Note that within the loop, at each iteration, one column is permuted and a function is applied to the new data frame (i.e., the initial data frame with one column permuted). I cannot seem to get apply to work because the new data frame has to be built within the loop.
#x <- data.frame(a=1:10,b=11:20,c=21:30) #small example
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
y <- rowMeans(x)
start <- Sys.time()
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
colnames(x)[which.max(totaldiff)]
Sys.time() - start

After working through this and other replies, the optimization strategies (and approximate speed-up) here seem to be
(30x) Choose an appropriate data representation -- matrix, rather than data.frame
(1.5x) Reduce unnecessary data copies -- difference of columns, rather than of rowMeans
Structure for loops as *apply functions (to emphasize code structure, simplify memory management, and provide type consistency)
(2x) Hoist vector operations outside loops -- abs and sum on columns become abs and colSums on a matrix
for an overall speed-up of about 100x. For this size and complexity of code, the use of the compiler or parallel packages would not be effective.
I put your code into a function
f0 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
and here we have
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
set.seed(123)
system.time(res0 <- f0(x))
## user system elapsed
## 1.065 0.000 1.066
Your data can be represented as a matrix, and operations on R matrices are faster than on data.frames.
m <- matrix(runif(50*100),nrow=50,ncol=100)
set.seed(123)
system.time(res0.m <- f0(m))
## user system elapsed
## 0.036 0.000 0.037
identical(res0, res0.m)
##[1] TRUE
That's probably the biggest speed-up. But for the specific operation here we don't need to calculate the row means of the updated matrix, just the change in the mean from shuffling one column
f1 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
diff <- abs(sample(x[,i]) - x[,i]) / ncol(x)
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
The for loop doesn't follow the right pattern for filling up the result vector totaldiff (you want to "pre-allocate and fill", so totaldiff <- numeric(ncol(x))) but we can use an sapply and let R worry about that (this memory management is one of the advantages of using the apply family of functions)
f2 <- function(x) {
totaldiff <- sapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, x)
which.max(totaldiff)
}
set.seed(123); identical(res0, f1(m))
set.seed(123); identical(res0, f2(m))
The timings are
> library(microbenchmark)
> microbenchmark(f0(m), f1(m), f2(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.45073 33.07804 33.16851 33.26364 33.81924 100
f1(m) 22.20913 23.87784 23.96915 24.06216 24.66042 100
f2(m) 21.02474 22.60745 22.70042 22.80080 23.19030 100
#flodel points out that vapply can be faster (and provides type safety)
f3 <- function(x) {
totaldiff <- vapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, numeric(1), x)
which.max(totaldiff)
}
and that
f4 <- function(x)
which.max(colSums(abs((apply(x, 2, sample) - x))))
is still faster (ncol(x) is a constant factor, so removed) -- The abs and sum are hoisted outside the sapply, maybe at the expense of additional memory use. The advice in the comments to compile functions is good in general; here are some further timings
> microbenchmark(f0(m), f1(m), f1.c(m), f2(m), f2.c(m), f3(m), f4(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.35600 32.88326 33.12274 33.25946 34.49003 100
f1(m) 22.21964 23.41500 23.96087 24.06587 24.49663 100
f1.c(m) 20.69856 21.20862 22.20771 22.32653 213.26667 100
f2(m) 20.76128 21.52786 22.66352 22.79101 69.49891 100
f2.c(m) 21.16423 21.57205 22.94157 23.06497 23.35764 100
f3(m) 20.17755 21.41369 21.99292 22.10814 22.36987 100
f4(m) 10.10816 10.47535 10.56790 10.61938 10.83338 100
where the ".c" are compiled versions and
Compilation is particularly helpful in code written with for loops but doesn't do much for vectorized code; this is shown here where's a small but consistent improvement from compiling f1's for loop, but not f2's sapply.

Since you are looking at efficiency/optimization, start by using the rbenchmark package for comparison purposes.
Rewriting your given example as a function (so that it can be replicated and compared)
forFirst <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
colnames(x)[which.max(totaldiff)]
}
Applying some standard optimizations (pre-allocating totaldiff to the right size, eliminating intermediate variables that are only used once) gives
forSecond <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric(ncol(x))
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
totaldiff[i] <- sum(abs(y-rowMeans(x.after)))
}
colnames(x)[which.max(totaldiff)]
}
Not much more can be done for this that I can see to improve the algorithm itself in the loop. A better algorithm would be the most help, but since this particular problem is just an example, it is not worth spending that time.
The apply version looks very similar.
applyFirst <- function(x) {
y <- rowMeans(x)
totaldiff <- sapply(seq_len(ncol(x)), function(i) {
x[,i] <- sample(x[,i])
sum(abs(y-rowMeans(x)))
})
colnames(x)[which.max(totaldiff)]
}
Benchmarking them gives:
> library("rbenchmark")
> benchmark(forFirst(x),
+ forSecond(x),
+ applyFirst(x),
+ order = "relative")
test replications elapsed relative user.self sys.self user.child
1 forFirst(x) 100 16.92 1.000 16.88 0.00 NA
2 forSecond(x) 100 17.02 1.006 16.96 0.03 NA
3 applyFirst(x) 100 17.05 1.008 17.02 0.01 NA
sys.child
1 NA
2 NA
3 NA
The differences between these is just noise. In fact, running the benchmark again gives a different ordering:
> benchmark(forFirst(x),
+ forSecond(x),
+ applyFirst(x),
+ order = "relative")
test replications elapsed relative user.self sys.self user.child
3 applyFirst(x) 100 17.05 1.000 17.02 0 NA
2 forSecond(x) 100 17.08 1.002 17.05 0 NA
1 forFirst(x) 100 17.44 1.023 17.41 0 NA
sys.child
3 NA
2 NA
1 NA
So these approaches are the same speed. Any real improvement will come from using a better algorithm than just simple looping and copying to create the intermediate results.

Apply functions do not necessarily speed up loops in R. Sometimes they can even slow them down. There's no reason to believe that turning this into an apply family function will speed it up any appreciable amount.
As an aside, this code seems like a relatively pointless endeavour. It's just going to select a random column. I could get the same result by just doing that in the first place. Perhaps this is nested in a larger loop looking for a distribution?

Related

Multiplication of many matrices in R

I want to multiply several matrices of the same size with an inital vector. In the example below p.state is vector of m elements and tran.mat is list where each member is an m x m matrix.
for (i in 1:length(tran.mat)){
p.state <- p.state %*% tran.mat[[i]]
}
The code above gives the correct answer but can be slow when length(tran.mat) is large. I was wondering if there was a more efficient way of doing this?
Below is an example with a m=3 and length(mat)=10 that can generate this:
p.state <- c(1,0,0)
tran.mat<-lapply(1:10,function(y){apply(matrix(runif(9),3,3),1,function(x){x/sum(x)})})
for (i in 1:length(tran.mat)){
p.state <- p.state %*% tran.mat[[i]]
}
print(p.state)
NB: tran.mat does not have to be a list it is just currently written as one.
Edit after a few comments:
Reduce is useful when m is small. However when m=6 the loop out performed both the above solutions.
library(rbenchmark)
p.state1 <- p.state <- c(1,0,0,0,0,0)
tran.mat<-lapply(1:10000,function(y){t(apply(matrix(runif(36),6,6),1,function(x){x/sum(x)}))})
tst<-do.call(c, list(list(p.state), tran.mat))
benchmark(
'loop' = {
for (i in 1:length(tran.mat)){
p.state <- p.state %*% tran.mat[[i]]
}
},
'reduce' = {
p.state1 %*% Reduce('%*%', tran.mat)
},
'reorder' = {
Reduce(`%*%`,tran.mat,p.state1)
}
)
This results in
test replications elapsed relative user.self sys.self user.child sys.child
1 loop 100 0.87 1.000 0.87 0 NA NA
2 reduce 100 1.41 1.621 1.39 0 NA NA
3 reorder 100 1.00 1.149 1.00 0 NA NA
A faster way is to use Reduce() to do sequential matrix multiplication on the list of matrices.
You can get approximately a 4x speedup that way. Below is an example of your code tested, with 1000 elements in the list instead of 10 to see the performance improvement more easily.
Code
library(rbenchmark)
p.state <- c(1,0,0)
tran.mat<-lapply(1:1000,function(y){apply(matrix(runif(9),3,3),1,function(x){x/sum(x)})})
benchmark(
'loop' = {
for (i in 1:length(tran.mat)){
p.state <- p.state %*% tran.mat[[i]]
}
},
'reduce' = {
p.state %*% Reduce('%*%', tran.mat)
}
)
Output
test replications elapsed relative user.self sys.self user.child sys.child
1 loop 100 0.23 3.833 0.23 0 NA NA
2 reduce 100 0.06 1.000 0.07 0 NA NA
You can see the reduce method is about 3.8 times faster.
I am not sure that this will be any faster but it is shorter:
prod <- Reduce("%*%", L)
all.equal(prod, L[[1]] %*% L[[2]] %*% L[[3]] %*% L[[4]])
## [1] TRUE
Note
We used this test input:
m <- matrix(1:9, 3)
L <- list(m^0, m, m^2, m^3)
I am going to use a function from package Rfast to reduce the execution time of multiplication. Unfortunately, for loop's time can not be reduced.
The function called Rfast::eachcol.apply is a great solution for your purpose. Your multiplication is also the function crossprod but it is slow for our purpose.
Here are some helper functions:
mult.list<-function(x,y){
for (xm in x){
y <- y %*% xm
}
y
}
mult.list2<-function(x,y){
for (xm in x){
y <- Rfast::eachcol.apply(xm,y,oper="*",apply="sum")
}
y
}
Here is an example:
x<-list()
y<-rnomr(1000)
for(i in 1:100){
x[[i]]<-Rfast::matrnorm(1000,1000)
}
microbenchmark::microbenchmark(R=a<-mult.list(x,y),Rfast=b<-mult.list2(x,y),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
R 410.067525 532.176979 633.3700627 649.155826 699.721086 916.542414 10
Rfast 239.987159 251.266488 352.1951486 276.382339 458.089342 741.340268 10
all.equal(as.numeric(a),as.numeric(b))
[1] TRUE
The argument oper is for the operation on each element and the apply for the operation on each column. In large matrices should be fast. I couldn't test it in my laptop for bigger matrices.

Is it possible to speed up my function for creating a correlation matrix?

I have written the following function to estimate the pairwise correlations of multinomial variables using so-called Cramér's V. I use the vcd package for this purpose, but to my knowledge there is no existing function that would create a symmetrical correlation matrix of V from a matrix or data.frame similar to cor.
The function is:
require(vcd)
get.V<-function(y){
col.y<-ncol(y)
V<-matrix(ncol=col.y,nrow=col.y)
for(i in 1:col.y){
for(j in 1:col.y){
V[i,j]<-assocstats(table(y[,i],y[,j]))$cramer
}
}
return(V)
}
However, for large numbers of variables it gets relatively slow.
no.var<-5
y<-matrix(ncol=no.var,sample(1:5,100*no.var,TRUE))
get.V(y)
As you increase no.var computing time may explode. Since I need to apply this to a data.frame of lengths 100 and higher, my question is, whether it is possible to 'speed up' my function by more elegant programming, maybe. Thank you.
As well as the reducing the number of tests performed, or otherwise
optimising the running of the whole function, we might be able to make
assocstats faster. We'll start by establishing a test case to make
sure we don't accidentally make a faster function that's incorrect.
x <- vcd::Arthritis$Improved
y <- vcd::Arthritis$Treatment
correct <- vcd::assocstats(table(x, y))$cramer
correct
## [1] 0.3942
is_ok <- function(x) stopifnot(all.equal(x, correct))
We'll start by making a version of assocstats that's very close to the
original.
cramer1 <- function (x, y) {
mat <- table(x, y)
tab <- summary(MASS::loglm(~1 + 2, mat))$tests
phi <- sqrt(tab[2, 1] / sum(mat))
cont <- sqrt(phi ^ 2 / (1 + phi ^ 2))
sqrt(phi ^ 2 / min(dim(mat) - 1))
}
is_ok(cramer1(x, y))
The slowest operation here is going to be loglm, so before we try
making that faster, it's worth looking for an alternative approach. A
little googling finds a useful blog
post.
Let's also try that:
cramer2 <- function(x, y) {
chi <- chisq.test(x, y, correct=FALSE)$statistic[[1]]
ulength_x <- length(unique(x))
ulength_y <- length(unique(y))
sqrt(chi / (length(x) * (min(ulength_x, ulength_y) - 1)))
}
is_ok(cramer2(x, y))
How does the performance stack up:
library(microbenchmark)
microbenchmark(
cramer1(x, y),
cramer2(x, y)
)
## Unit: microseconds
## expr min lq median uq max neval
## cramer1(x, y) 1080.0 1149.3 1182.0 1222.1 2598 100
## cramer2(x, y) 800.7 850.6 881.9 934.6 1866 100
cramer2() is faster. chisq.test() is likely to be the bottleneck, so
lets see if we can make that function faster by doing less:
chisq.test() does a lot more than compute the test-statistic, so it's
likely that we can make it faster. A few minutes careful work reduces
the function to:
chisq_test <- function (x, y) {
O <- table(x, y)
n <- sum(O)
E <- outer(rowSums(O), colSums(O), "*")/n
sum((abs(O - E))^2 / E)
}
We can then create a new cramer3() that uses chisq.test().
cramer3 <- function(x, y) {
chi <- chisq_test(x, y)
ulength_x <- length(unique(x))
ulength_y <- length(unique(y))
sqrt(chi / (length(x) * (min(ulength_x, ulength_y) - 1)))
}
is_ok(cramer3(x, y))
microbenchmark(
cramer1(x, y),
cramer2(x, y),
cramer3(x, y)
)
## Unit: microseconds
## expr min lq median uq max neval
## cramer1(x, y) 1088.6 1138.9 1169.6 1221.5 2534 100
## cramer2(x, y) 796.1 840.6 865.0 906.6 1893 100
## cramer3(x, y) 334.6 358.7 373.5 390.4 1409 100
And now that we have our own simple version of chisq.test() we could
eek out a little more speed by using the results of table() to figure
out the number of unique elements in x and y:
cramer4 <- function(x, y) {
O <- table(x, y)
n <- length(x)
E <- outer(rowSums(O), colSums(O), "*")/n
chi <- sum((abs(O - E))^2 / E)
sqrt(chi / (length(x) * (min(dim(O)) - 1)))
}
is_ok(cramer4(x, y))
microbenchmark(
cramer1(x, y),
cramer2(x, y),
cramer3(x, y),
cramer4(x, y)
)
## Unit: microseconds
## expr min lq median uq max neval
## cramer1(x, y) 1097.6 1145.8 1183.3 1233.3 2318 100
## cramer2(x, y) 800.7 840.5 860.7 895.5 2079 100
## cramer3(x, y) 334.4 353.1 365.7 384.1 1654 100
## cramer4(x, y) 248.0 263.3 273.2 283.5 1342 100
Not bad - we've made it 4 times faster just using R code. From here, you
could try to get even more speed by:
Using tcrossprod() instead of outer()
Making a faster version of table() for this special (2d) case
Using Rcpp to compute the test-statistic from the tabular data
You are best off using the vectorized version of outer like Tyler suggested. You can still get a performance boost by writing a function to calculate just the Cramer's V. The assocstats function uses summary on the table and that calculates a lot of statistics you don't want. If you reply the call to assocstats to a a user defined function along the lines of
cv <- function(x, y) {
t <- table(x, y)
chi <- chisq.test(t)$statistic
cramer <- sqrt(chi / (NROW(x) * (min(dim(t)) - 1)))
cramer
}
This new function, by calculating only Cramer's V, runs in about 40% of the time required for assocstats. You could potentially speed it up again my reducing the chisq.test to something that only calculates the chi square test statistic.
Even if you just adjust your loop index values to realize you have a symmetric matrix with 1 on the diagonals and use this cv function instead of assocstats you are looking at easily a 5 fold increase in performance.
Edit: As requested, the full code I've been using to get a 4x speed up is
cv <- function(x, y) {
t <- table(x, y)
chi <- suppressWarnings(chisq.test(t))$statistic
cramer <- sqrt(chi / (NROW(x) * (min(dim(t)) - 1)))
cramer
}
get.V3<-function(y, fill = TRUE){
col.y<-ncol(y)
V<-matrix(ncol=col.y,nrow=col.y)
for(i in 1:(col.y - 1)){
for(j in (i + 1):col.y){
V[i,j]<-cv(y[,i],y[,j])
}
}
diag(V) <- 1
if (fill) {
for (i in 1:ncol(V)) {
V[, i] <- V[i, ]
}
}
V
}
It looks to be very similar to what Hadley suggests below, although his version of the function to get Cramer's V uses correct = FALSE in chisq.test. If all the tables are larger than 2x2, the setting on correct doesn't matter. For 2x2 tables, the results will vary depending on the argument. It is probably best to follow his example and set it to correct = FALSE so that everything is calculated the same regardless of the table size.
You could reduce the calculation time by calculate only one half of your matrix:
get.V2 <-function(y){
cb <- combn(1:ncol(y), 2, function(i)assocstats(table(y[, i[1]], y[, i[2]]))$cramer)
m <- matrix(0, ncol(y), ncol(y))
m[lower.tri(m)] <- cb
diag(m) <- 1
## copy the lower.tri to upper.tri, suggested by #iacobus
for (i in 1:nrow(m)) {
m[i, ] <- m[, i]
}
return(m)
}
EDIT: added #iacobus suggestion to populate the upper.tri of the matrix and added a little benchmark:
library("vcd")
library("qdapTools")
library("rbenchmark")
## suggested by #TylerRinker
get.V3 <- function(y)v_outer(y, function(i, j)assocstats(table(i, j))$cramer)
set.seed(1)
no.var<-10
y<-matrix(ncol=no.var,sample(1:5,100*no.var,TRUE))
benchmark(get.V(y), get.V2(y), get.V3(y), replications=10, order="relative")
# test replications elapsed relative user.self sys.self user.child sys.child
#2 get.V2(y) 10 0.992 1.000 0.988 0.000 0 0
#1 get.V(y) 10 2.239 2.257 2.232 0.004 0 0
#3 get.V3(y) 10 2.495 2.515 2.484 0.004 0 0
This uses a vectorized version of outer:
library(qdapTools)
y <- matrix(ncol=no.var,sample(1:5,100*no.var,TRUE))
get.V2<-function(x, y){
assocstats(table(x, y))$cramer
}
v_outer(y, get.V2)
## > v_outer(y, get.V2)
## V1 V2 V3 V4 V5
## V1 1.000 0.224 0.158 0.195 0.217
## V2 0.224 1.000 0.175 0.163 0.240
## V3 0.158 0.175 1.000 0.208 0.145
## V4 0.195 0.163 0.208 1.000 0.189
## V5 0.217 0.240 0.145 0.189 1.000
Edit
On 1000 variables these are the system times:
Tyler: Time difference of 38.79437 mins
sgibb: Time difference of 19.54342 mins
Clearly sgibb's approach is superior.

Benchmark analysis and display results of analysis AND benchmark?

Probably, I just missed a parameter... but, maybe someone can point me to it: How can run analysis in R benchmark it and still store the result back somewhere?. I know R functions can only return one single object, but I could either make use of a list here or paste the benchmark results and store the analysis in the function's return value.
But, is there any way to evaluate benchmark (or system.time) and analysis without running it twice like this?:
require(rbenchmark)
bmark <- function(x){
res <- list()
res[[1]] <- benchmark(x^6)
res[[2]] <- x^6
res
}
EDIT: I am sorry I caused some confusion about what I really want to do. Maybe the use case makes it clearer: I don't have a typical benchmark situation where I want to check whether my custom function is faster than some other function. It's rather that I run the same thing with different data on different machines. I don't need this in a test environment, but in production – I just want to let users of a script know how long it took. If that's an hour or more people can plan their lunch break :) .
Here's an example using two functions. The first one uses plyr and the second uses data.table.
# dummy data
require(plyr)
require(data.table)
set.seed(45)
x1 <- data.frame(x=rnorm(1e6), grp = sample(letters[1:26], 1e6, replace=T))
x1.dt <- data.table(x1, key="grp")
# function that uses plyr
DF.FUN <- function(x) {
ddply(x1, .(grp), summarise, m.x = mean(x))
}
# function that uses data.table
DT.FUN <- function(x) {
x1.dt[, list(m.x=mean(x)),by=grp]
}
require(rbenchmark)
> benchmark( s1 <- DF.FUN(), s2 <- DT.FUN(), order="elapsed", replications=2)
# test replications elapsed relative user.self sys.self user.child sys.child
# 2 s2 <- DT.FUN() 2 0.036 1.000 0.031 0.006 0 0
# 1 s1 <- DF.FUN() 2 0.527 14.639 0.363 0.163 0 0
Now, s1 and s2 contain the results from each function, and the benchmarked results will be displayed on screen.
# > head(s1)
# grp m.x
# 1 a 0.0069312201
# 2 b -0.0002422315
# 3 c -0.0129449586
# 4 d -0.0036275338
# 5 e 0.0013438022
# 6 f -0.0015428427
# > head(s2)
# grp m.x
# 1: a 0.0069312201
# 2: b -0.0002422315
# 3: c -0.0129449586
# 4: d -0.0036275338
# 5: e 0.0013438022
# 6: f -0.0015428427
Is this what you were after?
I read the question a bit differently than Arun. This would be the answer to what I thought was being asked:
> bres <- bmark(2)
> bres
[[1]]
test replications elapsed relative user.self sys.self user.child sys.child
1 x^6 100 0.001 1 0.001 0.001 0 0
[[2]]
[1] 64
The bmark function is returning a result with the default 100 replications. It you wanted to annotate the results you could use paste() and if you wanted to add a parameter for number of reps:
bmark2 <- function(x, reps=100){
res <- list()
res[[1]] <- benchmark(x^6, replications=reps)
res[[2]] <- paste(reps, " replications of ", x, "to the 6th in", res[[1]]$elapsed)
res
}
I am unsure of what StackOverflow thinks about answering old questions, but it seems like nobody actually answered after your edit. So here goes:
To time a process in R you can use two methods.
The first one uses system.time(expression) and gives you how much time it took to evaluate the expression within the brackets.
If this is not practical in your case you can get system time with Sys.time() before the operation and after the operation and subtract the two.
If this finally answers your question please accept the solution :)

Vectorize a product calculation which depends on previous elements?

I'm trying to speed up/vectorize some calculations in a time series.
Can I vectorize a calculation in a for loop which can depend on results from an earlier iteration? For example:
z <- c(1,1,0,0,0,0)
zi <- 2:6
for (i in zi) {z[i] <- ifelse (z[i-1]== 1, 1, 0) }
uses the z[i] values updated in earlier steps:
> z
[1] 1 1 1 1 1 1
In my effort at vectorizing this
z <- c(1,1,0,0,0,0)
z[zi] <- ifelse( z[zi-1] == 1, 1, 0)
the element-by-element operations don't use results updated in the operation:
> z
[1] 1 1 1 0 0 0
So this vectorized operation operates in 'parallel' rather than iterative fashion. Is there a way I can write/vectorize this to get the results of the for loop?
ifelse is vectorized and there's a bit of a penalty if you're using it on one element at a time in a for-loop. In your example, you can get a pretty good speedup by using if instead of ifelse.
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
z <- c(1,1,0,0,0,0)
identical(fun1(z),fun2(z))
# [1] TRUE
system.time(replicate(10000, fun1(z)))
# user system elapsed
# 1.13 0.00 1.32
system.time(replicate(10000, fun2(z)))
# user system elapsed
# 0.27 0.00 0.26
You can get some additional speed gains out of fun2 by compiling it.
library(compiler)
cfun2 <- cmpfun(fun2)
system.time(replicate(10000, cfun2(z)))
# user system elapsed
# 0.11 0.00 0.11
So there's a 10x speedup without vectorization. As others have said (and some have illustrated) there are ways to vectorize your example, but that may not translate to your actual problem. Hopefully this is general enough to be applicable.
The filter function may be useful to you as well if you can figure out how to express your problem in terms of a autoregressive or moving average process.
This is a nice and simple example where Rcpp can shine.
So let us first recast functions 1 and 2 and their compiled counterparts:
library(inline)
library(rbenchmark)
library(compiler)
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun1c <- cmpfun(fun1)
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
fun2c <- cmpfun(fun2)
We write a Rcpp variant very easily:
funRcpp <- cxxfunction(signature(zs="numeric"), plugin="Rcpp", body="
Rcpp::NumericVector z = Rcpp::NumericVector(zs);
int n = z.size();
for (int i=1; i<n; i++) {
z[i] = (z[i-1]==1.0 ? 1.0 : 0.0);
}
return(z);
")
This uses the inline package to compile, load and link the five-liner on the fly.
Now we can define our test-date, which we make a little longer than the original (as just running the original too few times result in unmeasurable times):
R> z <- rep(c(1,1,0,0,0,0), 100)
R> identical(fun1(z),fun2(z),fun1c(z),fun2c(z),funRcpp(z))
[1] TRUE
R>
All answers are seen as identical.
Finally, we can benchmark:
R> res <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=1000)
R> print(res)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 1000 0.005 1.0 0.01 0
4 fun2c(z) 1000 0.466 93.2 0.46 0
2 fun2(z) 1000 1.918 383.6 1.92 0
3 fun1c(z) 1000 10.865 2173.0 10.86 0
1 fun1(z) 1000 12.480 2496.0 12.47 0
The compiled version wins by a factor of almost 400 against the best R version, and almost 100 against its byte-compiled variant. For function 1, the byte compilation matters much less and both variants trail the C++ by a factor of well over two-thousand.
It took about one minute to write the C++ version. The speed gain suggests it was a minute well spent.
For comparison, here is the result for the original short vector called more often:
R> z <- c(1,1,0,0,0,0)
R> res2 <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications",
+ "elapsed", "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=10000)
R> print(res2)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 10000 0.046 1.000000 0.04 0
4 fun2c(z) 10000 0.132 2.869565 0.13 0
2 fun2(z) 10000 0.271 5.891304 0.27 0
3 fun1c(z) 10000 1.045 22.717391 1.05 0
1 fun1(z) 10000 1.202 26.130435 1.20 0
The qualitative ranking is unchanged: the Rcpp version dominates, function2 is second-best. with the byte-compiled version being about twice as fast that the plain R variant, but still almost three times slower than the C++ version. And the relative difference are lower: relatively speaking, the function call overhead matters less and the actual looping matters more: C++ gets a bigger advantage on the actual loop operations in the longer vectors. That it is an important result as it suggests that more real-life sized data, the compiled version may reap a larger benefit.
Edited to correct two small oversights in the code examples. And edited again with thanks to Josh to catch a setup error relative to fun2c.
I think this is cheating and not generalizable, but: according to the rules you have above, any occurrence of 1 in the vector will make all subsequent elements 1 (by recursion: z[i] is 1 set to 1 if z[i-1] equals 1; therefore z[i] will be set to 1 if z[i-2] equals 1; and so forth). Depending on what you really want to do, there may be such a recursive solution available if you think carefully about it ...
z <- c(1,1,0,0,0,0)
first1 <- min(which(z==1))
z[seq_along(z)>first1] <- 1
edit: this is wrong, but I'm leaving it up to admit my mistakes. Based on a little bit of playing (and less thinking), I think the actual solution to this recursion is more symmetric and even simpler:
rep(z[1],length(z))
Test cases:
z <- c(1,1,0,0,0,0)
z <- c(0,1,1,0,0,0)
z <- c(0,0,1,0,0,0)
Check out the rollapply function in zoo.
I'm not super familiar with it, but I think this does what you want:
> c( 1, rollapply(z,2,function(x) x[1]) )
[1] 1 1 1 1 1 1
I'm sort of kludging it by using a window of 2 and then only using the first element of that window.
For more complicated examples you could perform some calculation on x[1] and return that instead.
Sometimes you just need to think about it totally differently. What you're doing is creating a vector where every item is the same as the first if it's a 1 or 0 otherwise.
z <- c(1,1,0,0,0,0)
if (z[1] != 1) z[1] <- 0
z[2:length(z)] <- z[1]
There is a function that does this particular calculation: cumprod (cumulative product)
> cumprod(z[zi])
[1] 1 0 0 0 0
> cumprod(c(1,2,3,4,0,5))
[1] 1 2 6 24 0 0
Otherwise, vectorize with Rccp as other answers have shown.
It's also possible to do this with "apply" using the original vector and a lagged version of the vector as the constituent columns of a data frame.

Is R's apply family more than syntactic sugar?

...regarding execution time and / or memory.
If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must come from apply (tapply, sapply, ...) itself.
The apply functions in R don't provide improved performance over other looping functions (e.g. for). One exception to this is lapply which can be a little faster because it does more work in C code than in R (see this question for an example of this).
But in general, the rule is that you should use an apply function for clarity, not for performance.
I would add to this that apply functions have no side effects, which is an important distinction when it comes to functional programming with R. This can be overridden by using assign or <<-, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.
Edit:
Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:
> fibo <- function(n) {
+ if ( n < 2 ) n
+ else fibo(n-1) + fibo(n-2)
+ }
> system.time(for(i in 0:26) fibo(i))
user system elapsed
7.48 0.00 7.52
> system.time(sapply(0:26, fibo))
user system elapsed
7.50 0.00 7.54
> system.time(lapply(0:26, fibo))
user system elapsed
7.48 0.04 7.54
> library(plyr)
> system.time(ldply(0:26, fibo))
user system elapsed
7.52 0.00 7.58
Edit 2:
Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide apply family functions (even the foreach package is essentially equivalent, despite the name). Here's a simple example of the sapply function in snow:
library(snow)
cl <- makeSOCKcluster(c("localhost","localhost"))
parSapply(cl, 1:20, get("+"), 3)
This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page). snow has the following apply functions:
parLapply(cl, x, fun, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
parApply(cl, X, MARGIN, FUN, ...)
parRapply(cl, x, fun, ...)
parCapply(cl, x, fun, ...)
It makes sense that apply functions should be used for parallel execution since they have no side effects. When you change a variable value within a for loop, it is globally set. On the other hand, all apply functions can safely be used in parallel because changes are local to the function call (unless you try to use assign or <<-, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.
Edit:
Here's a trivial example to demonstrate the difference between for and *apply so far as side effects are concerned:
> df <- 1:10
> # *apply example
> lapply(2:3, function(i) df <- df * i)
> df
[1] 1 2 3 4 5 6 7 8 9 10
> # for loop example
> for(i in 2:3) df <- df * i
> df
[1] 6 12 18 24 30 36 42 48 54 60
Note how the df in the parent environment is altered by for but not *apply.
Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :
set.seed(1) #for reproducability of the results
# The data
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))
# the function forloop that averages X over every combination of Y and Z
forloop <- function(x,y,z){
# These ones are for optimization, so the functions
#levels() and length() don't have to be called more than once.
ylev <- levels(y)
zlev <- levels(z)
n <- length(ylev)
p <- length(zlev)
out <- matrix(NA,ncol=p,nrow=n)
for(i in 1:n){
for(j in 1:p){
out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
}
}
rownames(out) <- ylev
colnames(out) <- zlev
return(out)
}
# Used on the generated data
forloop(X,Y,Z)
# The same using tapply
tapply(X,list(Y,Z),mean)
Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :
> system.time(forloop(X,Y,Z))
user system elapsed
0.94 0.02 0.95
> system.time(tapply(X,list(Y,Z),mean))
user system elapsed
0.06 0.00 0.06
There you go. What did I win? ;-)
...and as I just wrote elsewhere, vapply is your friend!
...it's like sapply, but you also specify the return value type which makes it much faster.
foo <- function(x) x+1
y <- numeric(1e6)
system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
# user system elapsed
# 3.54 0.00 3.53
system.time(z <- lapply(y, foo))
# user system elapsed
# 2.89 0.00 2.91
system.time(z <- vapply(y, foo, numeric(1)))
# user system elapsed
# 1.35 0.00 1.36
Jan. 1, 2020 update:
system.time({z1 <- numeric(1e6); for(i in seq_along(y)) z1[i] <- foo(y[i])})
# user system elapsed
# 0.52 0.00 0.53
system.time(z <- lapply(y, foo))
# user system elapsed
# 0.72 0.00 0.72
system.time(z3 <- vapply(y, foo, numeric(1)))
# user system elapsed
# 0.7 0.0 0.7
identical(z1, z3)
# [1] TRUE
I've written elsewhere that an example like Shane's doesn't really stress the difference in performance among the various kinds of looping syntax because the time is all spent within the function rather than actually stressing the loop. Furthermore, the code unfairly compares a for loop with no memory with apply family functions that return a value. Here's a slightly different example that emphasizes the point.
foo <- function(x) {
x <- x+1
}
y <- numeric(1e6)
system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
# user system elapsed
# 4.967 0.049 7.293
system.time(z <- sapply(y, foo))
# user system elapsed
# 5.256 0.134 7.965
system.time(z <- lapply(y, foo))
# user system elapsed
# 2.179 0.126 3.301
If you plan to save the result then apply family functions can be much more than syntactic sugar.
(the simple unlist of z is only 0.2s so the lapply is much faster. Initializing the z in the for loop is quite fast because I'm giving the average of the last 5 of 6 runs so moving that outside the system.time would hardly affect things)
One more thing to note though is that there is another reason to use apply family functions independent of their performance, clarity, or lack of side effects. A for loop typically promotes putting as much as possible within the loop. This is because each loop requires setup of variables to store information (among other possible operations). Apply statements tend to be biased the other way. Often times you want to perform multiple operations on your data, several of which can be vectorized but some might not be able to be. In R, unlike other languages, it is best to separate those operations out and run the ones that are not vectorized in an apply statement (or vectorized version of the function) and the ones that are vectorized as true vector operations. This often speeds up performance tremendously.
Taking Joris Meys example where he replaces a traditional for loop with a handy R function we can use it to show the efficiency of writing code in a more R friendly manner for a similar speedup without the specialized function.
set.seed(1) #for reproducability of the results
# The data - copied from Joris Meys answer
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))
# an R way to generate tapply functionality that is fast and
# shows more general principles about fast R coding
YZ <- interaction(Y, Z)
XS <- split(X, YZ)
m <- vapply(XS, mean, numeric(1))
m <- matrix(m, nrow = length(levels(Y)))
rownames(m) <- levels(Y)
colnames(m) <- levels(Z)
m
This winds up being much faster than the for loop and just a little slower than the built in optimized tapply function. It's not because vapply is so much faster than for but because it is only performing one operation in each iteration of the loop. In this code everything else is vectorized. In Joris Meys traditional for loop many (7?) operations are occurring in each iteration and there's quite a bit of setup just for it to execute. Note also how much more compact this is than the for version.
When applying functions over subsets of a vector, tapply can be pretty faster than a for loop. Example:
df <- data.frame(id = rep(letters[1:10], 100000),
value = rnorm(1000000))
f1 <- function(x)
tapply(x$value, x$id, sum)
f2 <- function(x){
res <- 0
for(i in seq_along(l <- unique(x$id)))
res[i] <- sum(x$value[x$id == l[i]])
names(res) <- l
res
}
library(microbenchmark)
> microbenchmark(f1(df), f2(df), times=100)
Unit: milliseconds
expr min lq median uq max neval
f1(df) 28.02612 28.28589 28.46822 29.20458 32.54656 100
f2(df) 38.02241 41.42277 41.80008 42.05954 45.94273 100
apply, however, in most situation doesn't provide any speed increase, and in some cases can be even lot slower:
mat <- matrix(rnorm(1000000), nrow=1000)
f3 <- function(x)
apply(x, 2, sum)
f4 <- function(x){
res <- 0
for(i in 1:ncol(x))
res[i] <- sum(x[,i])
res
}
> microbenchmark(f3(mat), f4(mat), times=100)
Unit: milliseconds
expr min lq median uq max neval
f3(mat) 14.87594 15.44183 15.87897 17.93040 19.14975 100
f4(mat) 12.01614 12.19718 12.40003 15.00919 40.59100 100
But for these situations we've got colSums and rowSums:
f5 <- function(x)
colSums(x)
> microbenchmark(f5(mat), times=100)
Unit: milliseconds
expr min lq median uq max neval
f5(mat) 1.362388 1.405203 1.413702 1.434388 1.992909 100

Resources