Based on what I've read before, vectorization is a form of parallelization known as SIMD. It allows processors to execute the same instruction (such as addition) on an array simultaneously.
However, I got confused when reading The Relationship between Vectorized and Devectorized Code regarding Julia's and R's vectorization performance. The post claims that devectorized Julia code (via loops) is faster than the vectorized code in both Julia and R, because:
This confuses some people who are not familiar with the internals of
R. It is therefore worth noting how one improves the speed of R code.
The process of performance improvement is quite simple: one starts
with devectorized R code, then replaces it with vectorized R code and
then finally implements this vectorized R code in devectorized C code.
This last step is unfortunately invisible to many R users, who
therefore think of vectorization per se as a mechanism for increasing
performance. Vectorization per se does not help make code faster. What
makes vectorization in R effective is that it provides a mechanism for
moving computations into C, where a hidden layer of devectorization
can do its magic.
It claims that R turns vectorized code, written in R, into devectorized code in C. If vectorization is faster (as a form of parallelization), why would R devectorize the code and why is that a plus?
"Vectorization" in R, is a vector processing in R's interpreter's view. Take the function cumsum as an example. On entry, R interpreter sees that a vector x is passed into this function. However, the work is then passed to C language that R interpreter can not analyze / track. While C is doing work, R is just waiting. By the time that R's interpreter comes back to work, a vector has been processed. So in R's view, it has issued a single instruction but processed a vector. This is an analogy to the concept of SIMD - "single instruction, multiple data".
Not just the cumsum function that takes a vector and returns a vector is seen as "vectorization" in R, functions like sum that takes a vector and returns a scalar is also a "vectorization".
Simply put: whenever R calls some compiled code for a loop, it is a "vectorization". If you wonder why this kind of "vectorization" is useful, it is because a loop written by a compiled language is faster than a loop written in an interpreted language. The C loop is translated to machine language that a CPU can understand. However, if a CPU wants to execute an R loop, it needs R's interpreter's help to read it, iteration by iteration. This is like, if you know Chinese (the hardest human language), you can respond to someone speaking Chinese to you faster; otherwise, you need a translator to first translator Chinese to you sentence after sentence in English, then you respond in English, and the translator make it back to Chinese sentence by sentence. The effectiveness of communication is largely reduced.
x <- runif(1e+7)
## R loop
system.time({
sumx <- 0
for (x0 in x) sumx <- sumx + x0
sumx
})
# user system elapsed
# 1.388 0.000 1.347
## C loop
system.time(sum(x))
# user system elapsed
# 0.032 0.000 0.030
Be aware that "vectorization" in R is just an analogy to SIMD but not a real one. A real SIMD uses CPU's vector registers for computations hence is a true parallel computing via data parallelism. R is not a language where you can program CPU registers; you have to write compiled code or assembly code for that purpose.
R's "vectorization" does not care how a loop written in a compiled language is really executed; after all that is beyond R's interpreter's knowledge. Regarding whether these compiled code will be executed with SIMD, read Does R leverage SIMD when doing vectorized calculations?
More on "vectorization" in R
I am not a Julia user, but Bogumił Kamiński has demonstrated an impressive feature of that language: loop fusion. Julia can do this, because, as he points out, "vectorization in Julia is implemented in Julia", not outside the language.
This reveals a downside of R's vectorization: speed often comes at a price of memory usage. I am not saying that Julia won't have this problem (as I don't use it, I don't know), but this is definitely true for R.
Here is an example: Fastest way to compute row-wise dot products between two skinny tall matrices in R. rowSums(A * B) is a "vectorization" in R, as both "*" and rowSums are coded in C language as a loop. However, R can not fuse them into a single C loop to avoid generating the temporary matrix C = A * B into RAM.
Another example is R's recycling rule or any computations relying on such rule. For example, when you add a scalar a to a matrix A by A + a, what really happens is that a is first replicated to be a matrix B that has the same dimension with A, i.e., B <- matrix(a, nrow(A), ncol(A)), then an addition between two matrices are calculated: A + B. Clearly the generation of the temporary matrix B is undesired, but sorry, you can't do it better unless you write your own C function for A + a and call it in R. This is described as "such a fusion is possible only if explicitly implemented" in Bogumił Kamiński's answer.
To deal with the memory effects of many temporary results, R has a sophisticated mechanism called "garbage collection". It helps, but memory can still explode if you generate some really big temporary result somewhere in your code. A good example is the function outer. I have written many answers using this function, but it is particularly memory-unfriendly.
I might have been off-topic in this edit, as I begin to discuss the side effect of "vectorization". Use it with care.
Put memory usage in mind; there might be a more memory efficient vectorized implementation. For example, as mentioned in the linked thread on row-wise dot products between two matrices, c(crossprod(x, y)) is better than sum(x * y).
Be prepared to use CRAN R packages that have compiled code. If you find existing vectorized functions in R limited to do your task, explore CRAN for possible R packages that can do it. You can ask a question with your coding bottleneck on Stack Overflow, and somebody may point you to the right function in the right package.
Be happy to write your own compiled code.
I think it is worth to note that the post you are referring to does not cover all current functionality of vectorization in Julia.
The important thing is that vectorization in Julia is implemented in Julia, as opposed to R, where it is implemented outside of the language. This is explained in detail in this post: https://julialang.org/blog/2017/01/moredots.
The consequence of the fact that Julia can perform fusion of any sequence of broadcasted operations into a single loop. In other languages that provide vectorization such a fusion is possible only if explicitly implemented.
In summary:
In Julia you can expect that vectorized code is as fast as a loop.
If you perform a sequence of vectorized operations then in general you can expect Julia to be faster than R as it can avoid allocation of intermediate results of the computations.
EDIT:
Following the comment of 李哲源 here is an example showing that Julia is able to avoid any allocations if you want to increase all elements of a vector x by 1:
julia> using BenchmarkTools
julia> x = rand(10^6);
julia> #benchmark ($x .+= 1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 819.230 μs (0.00% GC)
median time: 890.610 μs (0.00% GC)
mean time: 929.659 μs (0.00% GC)
maximum time: 2.802 ms (0.00% GC)
--------------
samples: 5300
evals/sample: 1
In the code .+= performs addition in place (adding $ in front of the expression is only needed for benchmarking, in normal code it would be x .+= 1). And we see that no memory allocation was done.
If we compare this to a possible implementation in R:
> library(microbenchmark)
> x <- runif(10^6)
> microbenchmark(x <- x + 1)
Unit: milliseconds
expr min lq mean median uq max neval
x <- x + 1 2.205764 2.391911 3.999179 2.599051 5.061874 30.91569 100
we can see that it not only saves memory, but also leads to a faster execution of the code.
Related
Original motivation behind this is that I have a dynamically sized array of floats that I want to pass to R through Rcpp without either incurring the cost of a zeroing out nor the cost of a deep copy.
Originally I had thought that there might be some way to take heap allocated array, make it aware to R's gc system and then wrap it with other data to create a "Rcpp::NumericVector" but it seems like that that's not possible - or doable with my current knowledge.
However and correct me if I'm wrong it looks like simply constructing a NumericVector with a size N and then using it as an N sized allocation will call R.h's Rf_allocVector and that itself does not either zero out the allocated array - I tested it on a small C program that gets dyn.loaded into R and it looks like garbage values. I also took a peek at the assembly and there doesn't seem to be any zeroing out.
Can anyone confirm this or offer any alternate solution?
Welcome to StackOverflow.
You marked this rcpp but that is a function from the C API of R -- whereas the Rcpp API offers you its constructors which do in fact set the memory tp zero:
> Rcpp::cppFunction("NumericVector goodVec(int n) { return NumericVector(n); }")
> sum(goodVec(1e7))
[1] 0
>
This creates a dynamically allocated vector using R's memory functions. The vector is indistinguishable from R's own. And it has the memory set to zero
as we use R_Calloc, which is documented in Writing R Extension to setting the memory to zero. (We may also use memcpy() explicitly, you can check the sources.)
So in short, you just have yourself confused over what the C API of R, as well as Rcpp offer, and what is easiest to use when. Keep reading documentation, running and writing examples, and studying existing code. It's all out there!
Let
x = matrix(rnorm(1000000), nrow = 5000)
I would like to compute matrix multiplication with its transpose: x %*% t(x).
After googling I found a possible faster way of doing the above is
tcrossprod(x)
And time taken is
user system elapsed
2.975 0.000 2.960
Is there is any other R-function which can do the task faster than the above function?
No. At R level this is already the fastest. But internally it calls level-3 BLAS routine dsyrk. So if you can have a high performance BLAS library this will be a lot faster. Try linking OpenBLAS to your R.
Linking a BLAS library does not require rebuilding R. You may have a read on my question linking R to BLAS library for an overview, which contains several links showing you how to set up alias then switch between different BLAS libraries on the machine.
Alternatively, you can read my extremely long question and answer Without root access, run R with tuned BLAS when it is linked with reference BLAS which gives various ways to use an external BLAS library even if R is linked to reference BLAS library.
As a side note, for a matrix with dimension m * n, dsyrk has FLOP counts n * m ^ 2. (Note, this is the computational costs for tcrossprod. For crossprod it is m * n ^ 2.)
You have m = 5000 and n = 200, and computation takes 2.96s. Thus, computation has speed: (200 * 5000 ^ 2 / 2.96) * 1e-9 = 1.68 GFLOPs. Well, this is an ordinary level of performance so at the moment you are definitely using reference BLAS. With OpenBLAS, performance can reach 10 GFLOPs or more, depending on your CPU. Good luck!
I know that loops are slow in R and that I should try to do things in a vectorised manner instead.
But, why? Why are loops slow and apply is fast? apply calls several sub-functions -- that doesn't seem fast.
Update: I'm sorry, the question was ill-posed. I was confusing vectorisation with apply. My question should have been,
"Why is vectorisation faster?"
It's not always the case that loops are slow and apply is fast. There's a nice discussion of this in the May, 2008, issue of R News:
Uwe Ligges and John Fox. R Help Desk: How can I avoid this loop or
make it faster? R News, 8(1):46-50, May 2008.
In the section "Loops!" (starting on pg 48), they say:
Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory.
They further suggest:
Initialize new objects to full length before the loop, rather
than increasing their size within the loop. Do not do things in a
loop that can be done outside the loop. Do not avoid loops simply
for the sake of avoiding loops.
They have a simple example where a for loop takes 1.3 sec but apply runs out of memory.
Loops in R are slow for the same reason any interpreted language is slow: every
operation carries around a lot of extra baggage.
Look at R_execClosure in eval.c (this is the function called to call a
user-defined function). It's nearly 100 lines long and performs all sorts of
operations -- creating an environment for execution, assigning arguments into
the environment, etc.
Think how much less happens when you call a function in C (push args on to
stack, jump, pop args).
So that is why you get timings like these (as joran pointed out in the comment,
it's not actually apply that's being fast; it's the internal C loop in mean
that's being fast. apply is just regular old R code):
A = matrix(as.numeric(1:100000))
Using a loop: 0.342 seconds:
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
Using sum: unmeasurably small:
sum(A)
It's a little disconcerting because, asymptotically, the loop is just as good
as sum; there's no practical reason it should be slow; it's just doing more
extra work each iteration.
So consider:
# 0.370 seconds
system.time({
I = 0
while (I < 100000) {
10
I = I + 1
}
})
# 0.743 seconds -- double the time just adding parentheses
system.time({
I = 0
while (I < 100000) {
((((((((((10))))))))))
I = I + 1
}
})
(That example was discovered by Radford Neal)
Because ( in R is an operator, and actually requires a name lookup every time you use it:
> `(` = function(x) 2
> (3)
[1] 2
Or, in general, interpreted operations (in any language) have more steps. Of course, those steps provide benefits as well: you couldn't do that ( trick in C.
The only Answer to the Question posed is; loops are not slow if what you need to do is iterate over a set of data performing some function and that function or the operation is not vectorized. A for() loop will be as quick, in general, as apply(), but possibly a little bit slower than an lapply() call. The last point is well covered on SO, for example in this Answer, and applies if the code involved in setting up and operating the loop is a significant part of the overall computational burden of the loop.
Why many people think for() loops are slow is because they, the user, are writing bad code. In general (though there are several exceptions), if you need to expand/grow an object, that too will involve copying so you have both the overhead of copying and growing the object. This is not just restricted to loops, but if you copy/grow at each iteration of a loop, of course, the loop is going to be slow because you are incurring many copy/grow operations.
The general idiom for using for() loops in R is that you allocate the storage you require before the loop starts, and then fill in the object thus allocated. If you follow that idiom, loops will not be slow. This is what apply() manages for you, but it is just hidden from view.
Of course, if a vectorised function exists for the operation you are implementing with the for() loop, don't do that. Likewise, don't use apply() etc if a vectorised function exists (e.g. apply(foo, 2, mean) is better performed via colMeans(foo)).
Just as a comparison (don't read too much into it!): I ran a (very) simple for loop in R and in JavaScript in Chrome and IE 8.
Note that Chrome does compilation to native code, and R with the compiler package compiles to bytecode.
# In R 2.13.1, this took 500 ms
f <- function() { sum<-0.5; for(i in 1:1000000) sum<-sum+i; sum }
system.time( f() )
# And the compiled version took 130 ms
library(compiler)
g <- cmpfun(f)
system.time( g() )
#Gavin Simpson: Btw, it took 1162 ms in S-Plus...
And the "same" code as JavaScript:
// In IE8, this took 282 ms
// In Chrome 14.0, this took 4 ms
function f() {
var sum = 0.5;
for(i=1; i<=1000000; ++i) sum = sum + i;
return sum;
}
var start = new Date().getTime();
f();
time = new Date().getTime() - start;
I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii < x.nrow(); ii++){
for (jj=0; jj < x.ncol(); jj++){
if(x(ii,jj) < y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.
I know that loops are slow in R and that I should try to do things in a vectorised manner instead.
But, why? Why are loops slow and apply is fast? apply calls several sub-functions -- that doesn't seem fast.
Update: I'm sorry, the question was ill-posed. I was confusing vectorisation with apply. My question should have been,
"Why is vectorisation faster?"
It's not always the case that loops are slow and apply is fast. There's a nice discussion of this in the May, 2008, issue of R News:
Uwe Ligges and John Fox. R Help Desk: How can I avoid this loop or
make it faster? R News, 8(1):46-50, May 2008.
In the section "Loops!" (starting on pg 48), they say:
Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory.
They further suggest:
Initialize new objects to full length before the loop, rather
than increasing their size within the loop. Do not do things in a
loop that can be done outside the loop. Do not avoid loops simply
for the sake of avoiding loops.
They have a simple example where a for loop takes 1.3 sec but apply runs out of memory.
Loops in R are slow for the same reason any interpreted language is slow: every
operation carries around a lot of extra baggage.
Look at R_execClosure in eval.c (this is the function called to call a
user-defined function). It's nearly 100 lines long and performs all sorts of
operations -- creating an environment for execution, assigning arguments into
the environment, etc.
Think how much less happens when you call a function in C (push args on to
stack, jump, pop args).
So that is why you get timings like these (as joran pointed out in the comment,
it's not actually apply that's being fast; it's the internal C loop in mean
that's being fast. apply is just regular old R code):
A = matrix(as.numeric(1:100000))
Using a loop: 0.342 seconds:
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
Using sum: unmeasurably small:
sum(A)
It's a little disconcerting because, asymptotically, the loop is just as good
as sum; there's no practical reason it should be slow; it's just doing more
extra work each iteration.
So consider:
# 0.370 seconds
system.time({
I = 0
while (I < 100000) {
10
I = I + 1
}
})
# 0.743 seconds -- double the time just adding parentheses
system.time({
I = 0
while (I < 100000) {
((((((((((10))))))))))
I = I + 1
}
})
(That example was discovered by Radford Neal)
Because ( in R is an operator, and actually requires a name lookup every time you use it:
> `(` = function(x) 2
> (3)
[1] 2
Or, in general, interpreted operations (in any language) have more steps. Of course, those steps provide benefits as well: you couldn't do that ( trick in C.
The only Answer to the Question posed is; loops are not slow if what you need to do is iterate over a set of data performing some function and that function or the operation is not vectorized. A for() loop will be as quick, in general, as apply(), but possibly a little bit slower than an lapply() call. The last point is well covered on SO, for example in this Answer, and applies if the code involved in setting up and operating the loop is a significant part of the overall computational burden of the loop.
Why many people think for() loops are slow is because they, the user, are writing bad code. In general (though there are several exceptions), if you need to expand/grow an object, that too will involve copying so you have both the overhead of copying and growing the object. This is not just restricted to loops, but if you copy/grow at each iteration of a loop, of course, the loop is going to be slow because you are incurring many copy/grow operations.
The general idiom for using for() loops in R is that you allocate the storage you require before the loop starts, and then fill in the object thus allocated. If you follow that idiom, loops will not be slow. This is what apply() manages for you, but it is just hidden from view.
Of course, if a vectorised function exists for the operation you are implementing with the for() loop, don't do that. Likewise, don't use apply() etc if a vectorised function exists (e.g. apply(foo, 2, mean) is better performed via colMeans(foo)).
Just as a comparison (don't read too much into it!): I ran a (very) simple for loop in R and in JavaScript in Chrome and IE 8.
Note that Chrome does compilation to native code, and R with the compiler package compiles to bytecode.
# In R 2.13.1, this took 500 ms
f <- function() { sum<-0.5; for(i in 1:1000000) sum<-sum+i; sum }
system.time( f() )
# And the compiled version took 130 ms
library(compiler)
g <- cmpfun(f)
system.time( g() )
#Gavin Simpson: Btw, it took 1162 ms in S-Plus...
And the "same" code as JavaScript:
// In IE8, this took 282 ms
// In Chrome 14.0, this took 4 ms
function f() {
var sum = 0.5;
for(i=1; i<=1000000; ++i) sum = sum + i;
return sum;
}
var start = new Date().getTime();
f();
time = new Date().getTime() - start;