I know that loops are slow in R and that I should try to do things in a vectorised manner instead.
But, why? Why are loops slow and apply is fast? apply calls several sub-functions -- that doesn't seem fast.
Update: I'm sorry, the question was ill-posed. I was confusing vectorisation with apply. My question should have been,
"Why is vectorisation faster?"
It's not always the case that loops are slow and apply is fast. There's a nice discussion of this in the May, 2008, issue of R News:
Uwe Ligges and John Fox. R Help Desk: How can I avoid this loop or
make it faster? R News, 8(1):46-50, May 2008.
In the section "Loops!" (starting on pg 48), they say:
Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory.
They further suggest:
Initialize new objects to full length before the loop, rather
than increasing their size within the loop. Do not do things in a
loop that can be done outside the loop. Do not avoid loops simply
for the sake of avoiding loops.
They have a simple example where a for loop takes 1.3 sec but apply runs out of memory.
Loops in R are slow for the same reason any interpreted language is slow: every
operation carries around a lot of extra baggage.
Look at R_execClosure in eval.c (this is the function called to call a
user-defined function). It's nearly 100 lines long and performs all sorts of
operations -- creating an environment for execution, assigning arguments into
the environment, etc.
Think how much less happens when you call a function in C (push args on to
stack, jump, pop args).
So that is why you get timings like these (as joran pointed out in the comment,
it's not actually apply that's being fast; it's the internal C loop in mean
that's being fast. apply is just regular old R code):
A = matrix(as.numeric(1:100000))
Using a loop: 0.342 seconds:
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
Using sum: unmeasurably small:
sum(A)
It's a little disconcerting because, asymptotically, the loop is just as good
as sum; there's no practical reason it should be slow; it's just doing more
extra work each iteration.
So consider:
# 0.370 seconds
system.time({
I = 0
while (I < 100000) {
10
I = I + 1
}
})
# 0.743 seconds -- double the time just adding parentheses
system.time({
I = 0
while (I < 100000) {
((((((((((10))))))))))
I = I + 1
}
})
(That example was discovered by Radford Neal)
Because ( in R is an operator, and actually requires a name lookup every time you use it:
> `(` = function(x) 2
> (3)
[1] 2
Or, in general, interpreted operations (in any language) have more steps. Of course, those steps provide benefits as well: you couldn't do that ( trick in C.
The only Answer to the Question posed is; loops are not slow if what you need to do is iterate over a set of data performing some function and that function or the operation is not vectorized. A for() loop will be as quick, in general, as apply(), but possibly a little bit slower than an lapply() call. The last point is well covered on SO, for example in this Answer, and applies if the code involved in setting up and operating the loop is a significant part of the overall computational burden of the loop.
Why many people think for() loops are slow is because they, the user, are writing bad code. In general (though there are several exceptions), if you need to expand/grow an object, that too will involve copying so you have both the overhead of copying and growing the object. This is not just restricted to loops, but if you copy/grow at each iteration of a loop, of course, the loop is going to be slow because you are incurring many copy/grow operations.
The general idiom for using for() loops in R is that you allocate the storage you require before the loop starts, and then fill in the object thus allocated. If you follow that idiom, loops will not be slow. This is what apply() manages for you, but it is just hidden from view.
Of course, if a vectorised function exists for the operation you are implementing with the for() loop, don't do that. Likewise, don't use apply() etc if a vectorised function exists (e.g. apply(foo, 2, mean) is better performed via colMeans(foo)).
Just as a comparison (don't read too much into it!): I ran a (very) simple for loop in R and in JavaScript in Chrome and IE 8.
Note that Chrome does compilation to native code, and R with the compiler package compiles to bytecode.
# In R 2.13.1, this took 500 ms
f <- function() { sum<-0.5; for(i in 1:1000000) sum<-sum+i; sum }
system.time( f() )
# And the compiled version took 130 ms
library(compiler)
g <- cmpfun(f)
system.time( g() )
#Gavin Simpson: Btw, it took 1162 ms in S-Plus...
And the "same" code as JavaScript:
// In IE8, this took 282 ms
// In Chrome 14.0, this took 4 ms
function f() {
var sum = 0.5;
for(i=1; i<=1000000; ++i) sum = sum + i;
return sum;
}
var start = new Date().getTime();
f();
time = new Date().getTime() - start;
Related
I am struggling with the parallel package. Part of the problem is that I am quite new to parallel computing and I lack a general understanding of what works and what doesn't (and why). So, apologies if what I am about to ask doesn't make sense from the outset or simply can't work in principle (that might well be).
I am trying to optimize a portfolio of securities that consists of individual sub portfolios. The sub portfolios are created independent from one-another, so this task should be suitable for a parallel approach (the portfolios are combined only at a later stage).
Currently I am using a serial approach, lapply takes care if it and it works just fine. The whole thing is wrapped in a function, whilst the wrapper doesn't really have a purpose beyond preparing the list upon which lapply will iterate, applying FUN.
The (serial) code looks as follows:
assemble_buckets <-function(bucket_categories, ...) {
optimize_bucket<-function(bucket_category, ...) {}
SAA_results<-lapply(bucket_categories, FUN=optimize_bucket, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
I am testing the performance using a simple loop.
a<-1000
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets(bucket_categories, ...)
if (n==a) {print(Sys.time()-start_time)}
}
Time for 1000 replications is ~19.78 mins - no too bad, but I need a quicker approach, because I want to let this run using a growing selection of securities.
So naturally, I d like to use a parallel approach. The (naïve) parallelized code using parLapply looks as follows (it really is my first attempt…):
assemble_buckets_p <-function(cluster_nr, bucket_categories, ...) {
f1 <-function(...)
f2 <-function(...)
optimize_bucket_p <-function(cluster_nr, bucket_categories, ...) {}
clusterExport(cluster_nr, varlist=list("optimize_bucket", "f1", "f2), envir=environment())
clusterCall(cluster_nr, function() library(...))
SAA_results<-parLapply(cluster_nr, bucket_categories, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
f1 and f2 were previously wrapped inside the optimizer function, they are now outside because the whole thing runs significantly faster with them being separate (would also be interesting to know why that is).
I am again testing the performance using a similar loop structure.
cluster_nr<-makeCluster(min(detectCores(), length(bucket_categories)))
b<-1000
for (n in 1:b) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets2(cluster_nr, bucket_categories, ...)
if (n==b) {print(Sys.time()-start_time)}
}
Runtime here is significantly faster, 5.97 mins, so there is some improvement. As the portfolios grow larger, the benefits should increase further, so I conclude parallelization is worthwhile.
Now, I am trying to use the parallelized version of the function inside a wrapper. The wrapper function has multiple layers and basically is, at its top-level, a loop, rebalancing the whole portfolio (multiple assets classes) for a given point in time.
Here comes the problem: When I let this run, something weird happens. Whilst the parallelized version actually does seem to be working (execution doesn’t stop), it takes much much longer than the serial one, like a factor of 100 longer.
In fact, the parallel version takes so much longer, that it certainly takes way too long to be of any use. What puzzles me, is that - as said above - when I am using the optimizer function on a standalone basis, it actually seems to be working, and it keeps getting more enigmatic...
I have been trying to further isolate the issue since an earlier version of this question and I think I've made some progress. I wrapped my optimizer function into a self sufficient test function, called test_p().
test_p<-function() {
a<-1
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets_p(...)
if (n==a) {print(Sys.time()-start_time)}
}
}
test_p() returns its runtime using print() and I can put it anywhere in the multi-layered wrapper I want, the wrapper structure is as follows:
optimize_SAA(...) <-function() { [1]
construct_portfolio(...) <-function() { [2]
construct_assetclass(...)<-function() { [3]
assemble_buckets(...) <-function() { #note that this is were I initially wanted to put the parallel part
}}}}
So now here's the thing: when I add test_p() to the [1] and [2] layers, it will work just as if it were standalone, it can't do anything useful there because it's in the wrong place, but it yields a result using multiple CPU cores within 0.636 secs.
As soon as I put it down to the [3] layer and below, executing the very same function takes 40 seconds. I really have tried everything that I could think of, but I have no idea why that is??
To sum it up, those would be my questions:
So has anyone an idea what might be the rootcause of this problem?
Why does the runtime of parallel code seem to depend on where the
code sits?
Was there anything obvious that I could/should try to fix this?
Many thanks in advance!
Is there an equivalent to numpy's apply_along_axis() (or R's apply())in Julia? I've got a 3D array and I would like to apply a custom function to each pair of co-ordinates of dimensions 1 and 2. The results should be in a 2D array.
Obviously, I could do two nested for loops iterating over the first and second dimension and then reshape, but I'm worried about performance.
This Example produces the output I desire (I am aware this is slightly pointless for sum(). It's just a dummy here:
test = reshape(collect(1:250), 5, 10, 5)
a=[]
for(i in 1:5)
for(j in 1:10)
push!(a,sum(test[i,j,:]))
end
end
println(reshape(a, 5,10))
Any suggestions for a faster version?
Cheers
Julia has the mapslices function which should do exactly what you want. But keep in mind that Julia is different from other languages you might know: library functions are not necessarily faster than your own code, because they may be written to a level of generality higher than what you actually need, and in Julia loops are fast. So it's quite likely that just writing out the loops will be faster.
That said, a couple of tips:
Read the performance tips section of the manual. From that you'd learn to put everything in a function, and to not use untyped arrays like a = [].
The slice or sub function can avoid making a copy of the data.
How about
f = sum # your function here
Int[f(test[i, j, :]) for i in 1:5, j in 1:10]
The last line is a two-dimensional array comprehension.
The Int in front is to guarantee the type of the elements; this should not be necessary if the comprehension is inside a function.
Note that you should (almost) never use untyped (Any) arrays, like your a = [], since this will be slow. You can write a = Int[] instead to create an empty array of Ints.
EDIT: Note that in Julia, loops are fast. The need for creating functions like that in Python and R comes from the inherent slowness of loops in those languages. In Julia it's much more common to just write out the loop.
I know that loops are slow in R and that I should try to do things in a vectorised manner instead.
But, why? Why are loops slow and apply is fast? apply calls several sub-functions -- that doesn't seem fast.
Update: I'm sorry, the question was ill-posed. I was confusing vectorisation with apply. My question should have been,
"Why is vectorisation faster?"
It's not always the case that loops are slow and apply is fast. There's a nice discussion of this in the May, 2008, issue of R News:
Uwe Ligges and John Fox. R Help Desk: How can I avoid this loop or
make it faster? R News, 8(1):46-50, May 2008.
In the section "Loops!" (starting on pg 48), they say:
Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory.
They further suggest:
Initialize new objects to full length before the loop, rather
than increasing their size within the loop. Do not do things in a
loop that can be done outside the loop. Do not avoid loops simply
for the sake of avoiding loops.
They have a simple example where a for loop takes 1.3 sec but apply runs out of memory.
Loops in R are slow for the same reason any interpreted language is slow: every
operation carries around a lot of extra baggage.
Look at R_execClosure in eval.c (this is the function called to call a
user-defined function). It's nearly 100 lines long and performs all sorts of
operations -- creating an environment for execution, assigning arguments into
the environment, etc.
Think how much less happens when you call a function in C (push args on to
stack, jump, pop args).
So that is why you get timings like these (as joran pointed out in the comment,
it's not actually apply that's being fast; it's the internal C loop in mean
that's being fast. apply is just regular old R code):
A = matrix(as.numeric(1:100000))
Using a loop: 0.342 seconds:
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
Using sum: unmeasurably small:
sum(A)
It's a little disconcerting because, asymptotically, the loop is just as good
as sum; there's no practical reason it should be slow; it's just doing more
extra work each iteration.
So consider:
# 0.370 seconds
system.time({
I = 0
while (I < 100000) {
10
I = I + 1
}
})
# 0.743 seconds -- double the time just adding parentheses
system.time({
I = 0
while (I < 100000) {
((((((((((10))))))))))
I = I + 1
}
})
(That example was discovered by Radford Neal)
Because ( in R is an operator, and actually requires a name lookup every time you use it:
> `(` = function(x) 2
> (3)
[1] 2
Or, in general, interpreted operations (in any language) have more steps. Of course, those steps provide benefits as well: you couldn't do that ( trick in C.
The only Answer to the Question posed is; loops are not slow if what you need to do is iterate over a set of data performing some function and that function or the operation is not vectorized. A for() loop will be as quick, in general, as apply(), but possibly a little bit slower than an lapply() call. The last point is well covered on SO, for example in this Answer, and applies if the code involved in setting up and operating the loop is a significant part of the overall computational burden of the loop.
Why many people think for() loops are slow is because they, the user, are writing bad code. In general (though there are several exceptions), if you need to expand/grow an object, that too will involve copying so you have both the overhead of copying and growing the object. This is not just restricted to loops, but if you copy/grow at each iteration of a loop, of course, the loop is going to be slow because you are incurring many copy/grow operations.
The general idiom for using for() loops in R is that you allocate the storage you require before the loop starts, and then fill in the object thus allocated. If you follow that idiom, loops will not be slow. This is what apply() manages for you, but it is just hidden from view.
Of course, if a vectorised function exists for the operation you are implementing with the for() loop, don't do that. Likewise, don't use apply() etc if a vectorised function exists (e.g. apply(foo, 2, mean) is better performed via colMeans(foo)).
Just as a comparison (don't read too much into it!): I ran a (very) simple for loop in R and in JavaScript in Chrome and IE 8.
Note that Chrome does compilation to native code, and R with the compiler package compiles to bytecode.
# In R 2.13.1, this took 500 ms
f <- function() { sum<-0.5; for(i in 1:1000000) sum<-sum+i; sum }
system.time( f() )
# And the compiled version took 130 ms
library(compiler)
g <- cmpfun(f)
system.time( g() )
#Gavin Simpson: Btw, it took 1162 ms in S-Plus...
And the "same" code as JavaScript:
// In IE8, this took 282 ms
// In Chrome 14.0, this took 4 ms
function f() {
var sum = 0.5;
for(i=1; i<=1000000; ++i) sum = sum + i;
return sum;
}
var start = new Date().getTime();
f();
time = new Date().getTime() - start;
I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii < x.nrow(); ii++){
for (jj=0; jj < x.ncol(); jj++){
if(x(ii,jj) < y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.
I've been working on a simple collection of functions for my supervisor that will do some simple initial genome scale stats, that is easy to do to give my team a quick indication as to future analyses which may make more time - for example RDP4 or BioC (just to explain why I haven't just gone straight to BioConductor). I'd like to speed some things up to allow larger contig sizes so I've decided to use doParallel and foreach to edit some for loops to allow this. Below is one simple function which identifies bases in some sequences (stored as a matrix) which are identical and removes them.
strip.invar <- function(x) {
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
After reading the introduction to doParallel and foreach I tried to make a version to accommodate for more cores - on my mac this is 8 - a quad core with two threads per core - 8 virtual cores:
strip.invar <- function(x, coresnum=detectCores()){
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
if(coresnum > 1) {
cl <- makeCluster(coresnum)
registerDoParallel(cl)
foreach(i=1:ncol(x)) %dopar% {
setTxtProgressBar(prog, i)
if(all(x[,i] == x[[1,i]])){
removals <- append(removals, i)
}
}
} else {
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
However if I run this and have the number of cores set to 8 I'm not entirely sure it works - I can't see the progress bar doing anything but then I've heard that printing to screen and stuff involving graphic devices is tricky with parallel computing in R. But it still seems to take some time and my laptop get's 'very' hot so I'm not sure if I've done this correctly, I've tried after seeing a few examples (I successfully ran a nice bootstrap example in the vignette), but I'm bound to hit learning bumps. As an aside, I thought I'd also ask people's opinion, what is the best speed up for R code bottlenecks where loops or apply is involved - parallelising it, or Rcpp?
Thanks.
My other answer was not correct, since the colmean being equal to the first value is not sufficient as a test for the number of unique values. So here is another answer:
You can optimize the loop by using apply.
set.seed(42)
dat <- matrix(sample(1:5,1e5,replace=TRUE),nrow=2)
res1 <- strip.invar(dat)
strip.invar2 <- function(dat) {
ix <- apply(dat,2,function(x) length(unique(x))>1)
dat[,ix]}
res2 <- strip.invar2(dat)
all.equal(res1,res2)
#TRUE
library(microbenchmark)
microbenchmark(strip.invar(dat),strip.invar2(dat),times=10)
#Unit: milliseconds
# expr min lq median uq max neval
#strip.invar(dat) 2514.7995 2529.2827 2547.6751 2678.464 2792.405 10
#strip.invar2(dat) 933.3701 945.5689 974.7564 1008.589 1018.400 10
This improves performance quite a bit, though not as much as you could achieve if vectorization was possible.
Parallelization won't give better performance here, since each iteration does not require much performance on is own, so parallelization overhead will actually increase the time needed. However, you could split the data and process chunks in parallel.
Firstly, try running cl <- makeCluster( coresnum-1 ). The master R process is already using one of your cores and is used to dispatch and receive results from the slave jobs, so you have 7 free cores for the slave jobs. I think you will be effectively queuing one of your foreach loops to wait until one of the previous loops finishes and therefore the job will take longer to complete.
Secondly, what you would normally see on the console running this function in a non-parallel environment is stil printed to the console, it's just that each jobs output is printed to the slave processes console so you won't see it. You can however save the output from the different foreach loops to a text file to examine them. Here is an example of how to save console output. Stick the code there inside your foreach statement.
Your laptop will get very hot because all of your cores are working at 100% capacity while you are running this job.
I have found the foreach package to provide an excellent set of functions to provide simple parallel processing. Rcpp may (will?!) give you much greater performance, but how are you at writing C++ code and what is the runtime of this function and how often will it be used? I always think about these things first.