%dopar% or alternative method to speed up sequential stochastic calculation - r

I have written a stochastic process simulator but I would like to speed it up since it's pretty slow.
The main part of the simulator is made of a for loop which I would like to re-write as a foreach with `%dopar%.
I have tried doing so with a simplified loop but I'm running into some problems. Suppose my for loop looks like this
library(foreach)
r=0
t<-rep(0,500)
for(n in 1:500){
s<-1/2+r
u<-runif(1, min = 0, max = 1)
if(u<s){
t[n]<-u
r<-r+0.001
}else{r<-r-0.001}
}
which means that at each iteration I update the value of r and s and, in one of the two outcomes, populate my vector t. I have tried several different ways of re-writing it as a foreach loop but it seems like with each iteration my values don't get updated and I get some pretty strange results. I have tried using return but it doesn't seem to work!
This is an example of what I have come up with.
rr=0
tt<-foreach(i=1:500, .combine=c) %dopar% {
ss<-1/2+rr
uu<-runif(1, min = 0, max = 1)
if(uu<=ss){
return(uu)
rr<-rr+0.001
}else{
return(0)
rr<-rr-0.001}
}
If it is impossible to use foreach what other way is there for me to re-write the loop so to be able to use all cores and speed up things?

Since your comments, about turning to C, were encouraging and -mostly- to prove that this isn't a hard task (especially for such operations) and it's worth looking into, here is a comparison of two sample functions that accept a number of iterations and perform the steps of your loop:
ffR = function(n)
{
r = 0
t = rep(0, n)
for(i in 1:n) {
s = 1/2 + r
u = runif(1)
if(u < s) {
t[i] = u
r = r + 0.001
} else r = r - 0.001
}
return(t)
}
ffC = inline::cfunction(sig = c(R_n = "integer"), body = '
int n = INTEGER(AS_INTEGER(R_n))[0];
SEXP ans;
PROTECT(ans = allocVector(REALSXP, n));
double r = 0.0, s, u, *pans = REAL(ans);
GetRNGstate();
for(int i = 0; i < n; i++) {
s = 0.5 + r;
u = runif(0.0, 1.0);
if(u < s) {
pans[i] = u;
r += 0.001;
} else {
pans[i] = 0.0;
r -= 0.001;
}
}
PutRNGstate();
UNPROTECT(1);
return(ans);
', includes = "#include <Rmath.h>")
A comparison of results:
set.seed(007); ffR(5)
#[1] 0.00000000 0.39774545 0.11569778 0.06974868 0.24374939
set.seed(007); ffC(5)
#[1] 0.00000000 0.39774545 0.11569778 0.06974868 0.24374939
A comparison of speed:
microbenchmark::microbenchmark(ffR(1e5), ffC(1e5), times = 20)
#Unit: milliseconds
# expr min lq median uq max neval
# ffR(1e+05) 497.524808 519.692781 537.427332 668.875402 692.598785 20
# ffC(1e+05) 2.916289 3.019473 3.133967 3.445257 4.076541 20
And for the sake of completeness:
set.seed(101); ans1 = ffR(1e5)
set.seed(101); ans2 = ffC(1e5)
all.equal(ans1, ans2)
#[1] TRUE
Hope any of this could be helpful in some way.

What you are trying to do, since every iteration is dependent on the previous steps of the loop, doesn't seem to be parallelizable. You are updating the variable r and expecting other branches that are running simultaneously to know about it, and in fact wait for the update to happen, which
1) Doesn't happen. They won't wait, they'll just take r's current value whatever that is at the time they are running
2) If it did it would be same as running it without %dopar%

Related

Does anyone know how I can make my code run faster?

I am trying to calculate Sigma(n=0 to infinity) (−1)^n/(n + 1) as accurately as possible. But my code takes forever and I am not able to see whether my answer is right. Does anyone know how I can make my code faster? The sum is supposed to converge to log(2). My idea is that f(n) will eventually become a very small number (less than 2^-52) and a time would come when R would consider sum = sum + f(n) and that's when I'd want the code to stop running. But clearly, that doesn't seem to work and my code takes forever to run and at least to me, it doesn't seem to ever stop.
f <- function(n)
return(((-1)^(n))/(n+1))
s <- function(f){
sum <- 0
n <- 0
while(sum != sum + f(n)) {
sum <- sum + f(n)
n <- n + 1
}
return(c(sum, n))
}
s(f)
library(Rcpp)
cppFunction("
List s(int max_iter) {
double sum = 0;
double sum_prec=NA_REAL;
double n = 0;
for (;sum != sum_prec && n < max_iter;n++) {
sum_prec = sum;
sum+=pow(-1,n)/(n+1);
}
return List::create(
_[\"sum\"] = sum,
_[\"iterations\"] = n,
_[\"precision\"] = sum-sum_prec
) ;
}")
test <- s(100000000)
test
When you use a huge number of subsequent iterations you know that R is not appropriated. However C++ functions are very easy to use within R. You can do something like that by example. The function needs a max of iterations and returns a list with your sum, the number of iterations and the precision.
EDIT : By precision I only do sum-sum_prec so this is not the real interval.
EDIT 2 : I let the sum != sum_prec for the example but if you don't have a supercomputer you're not supposed to see the end lol
EDIT 3 :
Typically, a fast R base solution would be something like :
base_sol <- function(n_iter) {
v <- seq_len(n_iter)
v <- (-1L)^(v-1L)/v
list(
sum = sum(v),
iterations = n_iter,
precision = v[length(v)]
)
}
Which is only 1.5 times slower than c++, which is pretty fast for an interpreted language, but has the con of loading every member of the sum in ram (but then, R is made for stats not for calculating things at 2^-52)

R: Fast cartesian product calculation of two numeric matrices

I have two large numeric matrices and want to calculate their cartesian product in R. Is there a way to do it with higher performance and lower memory usage than with my current approach?
EDIT: I added a Rcpp version, which already performs a lot better than my first only-R approach. Since I'm not experienced with Rcpp or RcppArmadillo: Is there a faster/more standardized way to write this Rcpp-function?
m1 <- matrix(sample(0:9, size=110000, replace = TRUE), ncol = 110)
m2 <- matrix(sample(0:9, size=110000, replace = TRUE), ncol = 110)
#Current approach:
m3 <- apply(m1, 1, function(x) x * t(m2))
matrix(m3, ncol = 110, byrow = TRUE)
#EDIT - Rcpp approach
library(Rcpp)
#assuming ncol(m1) == ncol(m2)
cppFunction('IntegerMatrix cartProd(IntegerMatrix m1, IntegerMatrix m2) {
int nrow1 = m1.nrow(), ncol = m1.ncol(), nrow2 = m2.nrow();
int orow = 0;
IntegerMatrix out(nrow1 * nrow2, ncol);
for (int r1 = 0; r1 < nrow1; r1++) {
for (int r2 = 0; r2 < nrow2; r2++) {
for (int c = 0; c < ncol; c++){
out(orow, c) = m1(r1, c) * m2(r2, c);
}
orow++;
}
}
return out;
}')
m5 <- cartProd(m1, m2)
The best approach as you have surmised is to use C++ to perform the cartesian product that you desire. Trying to port the code over to Armadillo will yield a minor speed up compared to the pure Rcpp version, which is significantly faster than the written R version. For details on how well each method did, see the benchmark section at the end.
The first version is almost a direct port into armadillo and actually performs slightly worse than initial pure Rcpp function. The second version uses armadillo's built in submatrix views and each_row() functions to exploit in place evaluation. To achieve parity with the Rcpp version, note the use of the pass-by-reference and the use of a signed integer type yielding const arma::imat&. This avoids a deep copy of the two large integer matrices as the types are matched and a reference is established.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]
// --- Version 1
// [[Rcpp::export]]
arma::imat cartProd_arma(const arma::imat& m1, const arma::imat& m2) {
int nrow1 = m1.n_rows, ncol = m1.n_cols, nrow2 = m2.n_rows, orow = 0;
arma::imat out(nrow1 * nrow2, ncol);
for (int r1 = 0; r1 < nrow1; ++r1) {
for (int r2 = 0; r2 < nrow2; ++r2) {
out.row(orow) = m1.row(r1) % m2.row(r2);
orow++;
}
}
return out;
}
// --- Version 2
// [[Rcpp::export]]
arma::imat cartProd_arma2(const arma::imat& m1, const arma::imat& m2) {
int nrow1 = m1.n_rows, ncol = m1.n_cols, nrow2 = m2.n_rows, orow = 0;
arma::imat out(nrow1 * nrow2, ncol);
for (int r1 = 0; r1 < nrow1; ++r1) {
out.submat(orow, 0, orow + nrow2 - 1, ncol - 1) = m1.row(r1) % m2.each_row();
orow += nrow2;
}
return out;
}
Quick verification of implementation details matching the initial product
all.equal( cartProd(m1, m2), cartProd_arma(m1, m2))
# [1] TRUE
all.equal( cartProd(m1, m2), cartProd_arma2(m1, m2))
# [1] TRUE
To generate the benchmarks, I tidied up the initial function slightly by pre-transposing the matrix to avoid multiple transpose calls each time apply was called per row. Furthermore, I've included the function trick showed by #user20650.
# OP's initial R only solution with slight modifications
op_R = function(m1, m2){
m2 <- t(m2)
m3 <- matrix(apply(m1, 1, function(x) x * m2), ncol = ncol(m1), byrow = TRUE)
}
# user20650's comment
so_comment <- function(m1, m2){
m4 <- matrix(rep(t(m1), each=nrow(m1)) * c(m2), ncol=nrow(m1))
}
As a result, we have the following microbenchmark
library("microbenchmark")
out <- microbenchmark(op_r = op_R(m1, m2), so_comment_r = so_comment(m1, m2),
rcpp = cartProd(m1, m2), arma_v1 = cartProd_arma(m1, m2),
arma_v2 = cartProd_arma2(m1, m2),
times = 50)
out
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_r 1615.6572 1693.0526 1793.0515 1771.7353 1886.0988 2053.7050 50
# so_comment_r 2778.0971 2856.6429 2950.5837 2936.7459 3021.4249 3344.4401 50
# rcpp 463.6743 482.3118 565.0525 582.1660 614.3714 699.3516 50
# arma_v1 597.9004 620.5888 713.4101 726.7572 783.4225 820.3770 50
# arma_v2 384.7205 401.9744 490.5118 503.5007 574.6840 622.9876 50
So, from this, we can see that the cartProd_arma2, the submatrix armadillo implementation, is the best function followed closely by cartProd, the pure Rcpp implementation.

R recursive function or loop in loop

simple problem.
I want to check if the difference of two points (i, j) is greater than a threshold (diff).
If the difference between the points exceeds the threshold the index should be returned and the next distance is measured but from the new datapoint. It is a simple cutofffilter where all datapoints under a predefined threshold are filtered. The only trick is, that the measurement is performed from always the "last" point (that was "far enough away" from the point before).
I first wrote it as two nested loops like:
x <- sample(1:100)
for(i in 1:(length(x)-1)){
for(j in (i+1):length(x)){
if(abs(x[i] - x[j]) >= cutoff) {
print(j)
i <- j # set the index to the current datapoint
break }
}}
This solution is kind of intuitive. But does not work proper. I think the assignment of i and j is not valid. The first loop just ignores to jump and loops through all datapoints.
Well, I did not want to waste time with debugging and just thought I can do the same with a recursive function.
So I wrote it like:
checkCutOff.f <- function(x,cutoff,i = 1) {
options(expressions=500000)
# Loops through the data and comperes the temporally fixed point 'i with the looping points 'j
for(j in (i+1):length(x)){
if( abs(x[i] - x[j]) >= cutoff ){
break
}
}
# Recursive function to update the new 'i - stops at the end of the dataset
if( j<length(x) ) return(c(j,checkCutOff.f(x,cutoff,j)))
else return(j)
}
x<-sample(1:100000)
checkCutOff.f(x,1)
This code works. But I get a stack overflow with big datasets. That's why I ask myself if this code is efficient.
For me is increasing limits etc. always a hint for inefficient code...
So my question is:
What kind of solution is really efficient?
Thanks!
You should avoid growing your return value with c. That's inefficient. Allocate to the maximum size and subset to the needed size in the end.
Note that your function always includes length(x) in your result, which is wrong:
set.seed(42)
x<-sample(1:10)
checkCutOff.f(x, 100)
#[1] 10
Here is an R solution with a loop:
checkCutOff.f1 <- function(x,cutoff) {
i <- 1
j <- 1
k <- 1
result <- integer(length(x))
while(j < length(x)) {
j <- j + 1
if (abs(x[i] - x[j]) >= cutoff) {
result[k] <- j
k <- k + 1
i <- j
}
}
result[seq_len(k - 1)]
}
all.equal(checkCutOff.f(x, 4), checkCutOff.f1(x, 4))
#[1] TRUE
#the correct solution includes length(x) here (by chance)
It's easy to translate to Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector checkCutOff_f1cpp(NumericVector x, double cutoff) {
int i = 0;
int j = 1;
int k = 0;
IntegerVector result(x.size());
while(j < x.size()) {
if (std::abs(x[i] - x[j]) >= cutoff) {
result[k] = j + 1;
k++;
i = j;
}
j++;
}
result = result[seq_len(k)-1];
return result;
}
Then in R:
all.equal(checkCutOff.f(x, 4), checkCutOff_f1cpp(x, 4))
#[1] TRUE
Benchmarks:
library(microbenchmark)
y <- sample(1:1000)
microbenchmark(
checkCutOff.f(y, 4),
checkCutOff.f1(y, 4),
checkCutOff_f1cpp(y, 4)
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# checkCutOff.f(y, 4) 3665.105 4681.6005 7798.41776 5323.068 6635.9205 41028.930 100 c
# checkCutOff.f1(y, 4) 1384.524 1507.2635 1831.43236 1769.031 2070.7225 3012.279 100 b
# checkCutOff_f1cpp(y, 4) 8.765 10.7035 26.40709 14.240 18.0005 587.958 100 a
I'm sure this can be improved further and more testing should be done.

speeding up a loop with loop-carried values in R

I'm trying to speed up code that takes time series data and limits it to a maximum value and then stretches it forward until sum of original data and the "stretched" data are the same.
I have a more complicated version of this that is taking 6 hours to run on 100k rows. I don't think this is vectorizable because it uses values calculated on prior rows - is that correct?
x <- c(0,2101,3389,3200,1640,0,0,0,0,0,0,0)
dat <- data.frame(x=x,y=rep(0,length(x)))
remainder <- 0
upperlimit <- 2000
for(i in 1:length(dat$x)){
if(dat$x[i] >= upperlimit){
dat$y[i] <- upperlimit
} else {
dat$y[i] <- min(remainder,upperlimit)
}
remainder <- remainder + dat$x[i] - dat$y[i]
}
dat
I understand you can use ifelse but I don't think cumsum can be used to carry forward the remainder - apply doesn't help either as far as I know. Do I need to resort to Rcpp? Thank you greatly.
I went ahead and implemented this in Rcpp and made some adjustments to the R function:
require(Rcpp);require(microbenchmark);require(ggplot2);
limitstretchR <- function(upperlimit,original) {
remainder <- 0
out <- vector(length=length(original))
for(i in 1:length(original)){
if(original[i] >= upperlimit){
out[i] <- upperlimit
} else {
out[i] <- min(remainder,upperlimit)
}
remainder <- remainder + original[i] - out[i]
}
out
}
The Rcpp function:
cppFunction('
NumericVector limitstretchC(double upperlimit, NumericVector original) {
int n = original.size();
double remainder = 0.0;
NumericVector out(n);
for(int i = 0; i < n; ++i) {
if (original[i] >= upperlimit) {
out[i] = upperlimit;
} else {
out[i] = std::min<double>(remainder,upperlimit);
}
remainder = remainder + original[i] - out[i];
}
return out;
}
')
Testing them:
x <- c(0,2101,3389,3200,1640,0,0,0,0,0,0,0)
original <- rep(x,20000)
upperlimit <- 2000
system.time(limitstretchR(upperlimit,original))
system.time(limitstretchC(upperlimit,original))
That yielded 80.655 and 0.001 seconds respectively. Native R is quite bad for this. However, I ran a microbenchmark (using a smaller vector) and got some confusing results.
res <- microbenchmark(list=
list(limitstretchR=limitstretchR(upperlimit,rep(x,10000)),
limitstretchC=limitstretchC(upperlimit,rep(x,10000))),
times=110,
control=list(order="random",warmup=10))
print(qplot(y=time, data=res, colour=expr) + scale_y_log10())
boxplot(res)
print(res)
If you were to run that you would see nearly identical results for both functions. This is my first time using microbenchmark, any tips?

Speeding up Julia's poorly written R examples

The Julia examples to compare performance against R seem particularly convoluted. https://github.com/JuliaLang/julia/blob/master/test/perf/perf.R
What is the fastest performance you can eke out of the two algorithms below (preferably with an explanation of what you changed to make it more R-like)?
## mandel
mandel = function(z) {
c = z
maxiter = 80
for (n in 1:maxiter) {
if (Mod(z) > 2) return(n-1)
z = z^2+c
}
return(maxiter)
}
mandelperf = function() {
re = seq(-2,0.5,.1)
im = seq(-1,1,.1)
M = matrix(0.0,nrow=length(re),ncol=length(im))
count = 1
for (r in re) {
for (i in im) {
M[count] = mandel(complex(real=r,imag=i))
count = count + 1
}
}
return(M)
}
assert(sum(mandelperf()) == 14791)
## quicksort ##
qsort_kernel = function(a, lo, hi) {
i = lo
j = hi
while (i < hi) {
pivot = a[floor((lo+hi)/2)]
while (i <= j) {
while (a[i] < pivot) i = i + 1
while (a[j] > pivot) j = j - 1
if (i <= j) {
t = a[i]
a[i] = a[j]
a[j] = t
}
i = i + 1;
j = j - 1;
}
if (lo < j) qsort_kernel(a, lo, j)
lo = i
j = hi
}
return(a)
}
qsort = function(a) {
return(qsort_kernel(a, 1, length(a)))
}
sortperf = function(n) {
v = runif(n)
return(qsort(v))
}
sortperf(5000)
The key word in this question is "algorithm":
What is the fastest performance you can eke out of the two algorithms below (preferably with an explanation of what you changed to make it more R-like)?
As in "how fast can you make these algorithms in R?" The algorithms in question here are the standard Mandelbrot complex loop iteration algorithm and the standard recursive quicksort kernel.
There are certainly faster ways to compute the answers to the problems posed in these benchmarks – but not using the same algorithms. You can avoid recursion, avoid iteration, and avoid whatever else R isn't good at. But then you're no longer comparing the same algorithms.
If you really wanted to compute Mandelbrot sets in R or sort numbers, yes, this is not how you would write the code. You would either vectorize it as much as possible – thereby pushing all the work into predefined C kernels – or just write a custom C extension and do the computation there. Either way, the conclusion is that R isn't fast enough to get really good performance on its own – you need have C do most of the work in order to get good performance.
And that's exactly the point of these benchmarks: in Julia you never have to rely on C code to get good performance. You can just write what you want to do in pure Julia and it will have good performance. If an iterative scalar loop algorithm is the most natural way to do what you want to do, then just do that. If recursion is the most natural way to solve the problem, then that's ok too. At no point will you be forced to rely on C for performance – whether via unnatural vectorization or writing custom C extensions. Of course, you can write vectorized code when it's natural, as it often is in linear algebra; and you can call C if you already have some library that does what you want. But you don't have to.
We do want to have the fairest possible comparison of the same algorithms across languages:
If someone does have faster versions in R that use the same algorithm, please submit patches!
I believe that the R benchmarks on the julia site are already byte-compiled, but if I'm doing it wrong and the comparison is unfair to R, please let me know and I will fix it and update the benchmarks.
Hmm, in the Mandelbrot example the matrix M has its dimensions transposed
M = matrix(0.0,nrow=length(im), ncol=length(re))
because it's filled by incrementing count in the inner loop (successive values of im). My implementation creates a vector of complex numbers in mandelperf.1 and operates on all elements, using an index and subsetting to keep track of which elements of the vector have not yet satisfied the condition Mod(z) <= 2
mandel.1 = function(z, maxiter=80L) {
c <- z
result <- integer(length(z))
i <- seq_along(z)
n <- 0L
while (n < maxiter && length(z)) {
j <- Mod(z) <= 2
if (!all(j)) {
result[i[!j]] <- n
i <- i[j]
z <- z[j]
c <- c[j]
}
z <- z^2 + c
n <- n + 1L
}
result[i] <- maxiter
result
}
mandelperf.1 = function() {
re = seq(-2,0.5,.1)
im = seq(-1,1,.1)
mandel.1(complex(real=rep(re, each=length(im)),
imaginary=im))
}
for a 13-fold speed-up (the results are equal but not identical because the original returns numeric rather than integer values).
> library(rbenchmark)
> benchmark(mandelperf(), mandelperf.1(),
+ columns=c("test", "elapsed", "relative"),
+ order="relative")
test elapsed relative
2 mandelperf.1() 0.412 1.00000
1 mandelperf() 5.705 13.84709
> all.equal(sum(mandelperf()), sum(mandelperf.1()))
[1] TRUE
The quicksort example doesn't actually sort
> set.seed(123L); qsort(sample(5))
[1] 2 4 1 3 5
but my main speed-up was to vectorize the partition around the pivot
qsort_kernel.1 = function(a) {
if (length(a) < 2L)
return(a)
pivot <- a[floor(length(a) / 2)]
c(qsort_kernel.1(a[a < pivot]), a[a == pivot], qsort_kernel.1(a[a > pivot]))
}
qsort.1 = function(a) {
qsort_kernel.1(a)
}
sortperf.1 = function(n) {
v = runif(n)
return(qsort.1(v))
}
for a 7-fold speedup (in comparison to the uncorrected original)
> benchmark(sortperf(5000), sortperf.1(5000),
+ columns=c("test", "elapsed", "relative"),
+ order="relative")
test elapsed relative
2 sortperf.1(5000) 6.60 1.000000
1 sortperf(5000) 47.73 7.231818
Since in the original comparison Julia is about 30 times faster than R for mandel, and 500 times faster for quicksort, the implementations above are still not really competitive.

Resources