Is there an efficient way to parallelize mapply? - r

I have many rows and on every row I compute the uniroot of a non-linear function. I have a quad-core Ubuntu machine which hasn't stopped running my code for two days now. Not surprisingly, I'm looking for ways to speed things up ;-)
After some research, I noticed that only one core is currently used and parallelization is the thing to do. Digging deeper, I came to the conclusion (maybe incorrectly?) that the package foreach isn't really meant for my problem because too much overhead is produced (see, for example, SO). A good alternative seems to be multicore for Unix machines. In particular, the pvec function seems to be the most efficient one after I checked the help page.
However, if I understand it correctly, this function only takes one vector and splits it up accordingly. I need a function that can be parallized, but takes multiple vectors (or a data.frame instead), just like the mapply function does. Is there anything out there that I missed?
Here is a small example of what I want to do: (Note that I include a plyr example here because it can be an alternative to the base mapply function and it has a parallelize option. However, it is slower in my implementation and internally, it calls foreach to parallelize, so I think it won't help. Is that correct?)
n <- 10000
df <- data.frame(P = rnorm(n, mean=100, sd=10),
B0 = rnorm(n, mean=40, sd=5),
CF1 = rnorm(n, mean=30, sd=10),
CF2 = rnorm(n, mean=30, sd=5),
CF3 = rnorm(n, mean=90, sd=8))
get_uniroot <- function(P, B0, CF1, CF2, CF3) {
uniroot(function(x) {-P + B0 + CF1/x + CF2/x^2 + CF3/x^3},
lower = 1,
upper = 10,
tol = 0.00001)$root
system.time(x1 <- mapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3))
#user system elapsed
#0.91 0.00 0.90
system.time(x2 <- mdply(df, get_uniroot))
#user system elapsed
#5.85 0.00 5.85
system.time(x3 <- foreach(P=df$P, B0=df$B0, CF1=df$CF1, CF2=df$CF2, CF3=df$CF3, .combine = "c") %do% {
get_uniroot(P, B0, CF1, CF2, CF3)})
#user system elapsed
# 10.30 0.00 10.36
all.equal(x1, x2$V1) #TRUE
all.equal(x1, x3) #TRUE
Also, I tried to implement Ryan Thompson's function chunkapply from the SO link above (only got rid of doMC part, because I couldn't install it. His example works, though, even after adjusting his function.),
but didn't get it to work. However, since it uses foreach, I thought the same arguments mentioned above apply, so I didn't try it too long.
#chunkapply(get_uniroot, list(P=df$P, B0=df$B0, CF1=df$CF1, CF2=df$CF2, CF3=df$CF3))
#Error in { : task 1 failed - "invalid function value in 'zeroin'"
PS: I know that I could just increase tol to reduce the number of steps that are necessary to find a uniroot. However, I already set tol as big as possible.

I'd use the parallel package that's built into R 2.14 and work with matrices. You could then simply use mclapply like this:
dfm <- as.matrix(df)
result <- mclapply(seq_len(nrow(dfm)),
This is basically doing the same mapply does, but in a parallel way.
Mind you that parallelization always counts for some overhead as well. As I explained in the question you link to, going parallel only pays off if your inner function calculates significantly longer than the overhead involved. In your case, your uniroot function works pretty fast. You might then consider to cut your data frame in bigger chunks, and combine both mapply and mclapply. A possible way to do this is:
ncores <- 4
id <- floor(
idm <- embed(id,2)
mapply_uniroot <- function(id){
tmp <- df[(id[1]+1):id[2],]
mapply(get_uniroot, tmp$P, tmp$B0, tmp$CF1, tmp$CF2, tmp$CF3)
result <-mclapply(nrow(idm):1,
function(x) mapply_uniroot(idm[x,]),
final <- unlist(result)
This might need some tweaking, but it essentially breaks your df in exactly as many bits as there are cores, and run the mapply on every core. To show this works :
> x1 <- mapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3)
> all.equal(final,x1)
[1] TRUE

it's an old topic but fyi you now have parallel::mcmapply doc is here. don't forget to set mc.cores in the options. I usually use mc.cores=parallel::detectCores()-1 to let one cpu free for OS operations.
x4 <- mcmapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3,mc.cores=parallel::detectCores()-1)

This isn't exactly a best practices suggestion, but considerable speed-up can be had by identifying the root for all parameters in a 'vectorized' fashion. For instance,
bisect <-
function(f, interval, ..., lower=min(interval), upper=max(interval),
f.lower=f(lower, ...), f.upper=f(upper, ...), maxiter=20)
nrow <- length(f.lower)
bounds <- matrix(c(lower, upper), nrow, 2, byrow=TRUE)
for (i in seq_len(maxiter)) {
## move lower or upper bound to mid-point, preserving opposite signs
mid <- rowSums(bounds) / 2
updt <- ifelse(f(mid, ...) > 0, 0L, nrow) + seq_len(nrow)
bounds[updt] <- mid
rowSums(bounds) / 2
and then
> system.time(x2 <- with(df, {
+ f <- function(x, PB0, CF1, CF2, CF3)
+ PB0 + CF1/x + CF2/x^2 + CF3/x^3
+ bisect(f, c(1, 10), PB0, CF1, CF2, CF3)
+ }))
user system elapsed
0.180 0.000 0.181
> range(x1 - x2)
[1] -6.282406e-06 6.658593e-06
versus about 1.3s for application of uniroot separately to each. This also combined P and B0 into a single value ahead of time, since that is how they enter the equation.
The bounds on the final value are +/- diff(interval) * (.5 ^ maxiter) or so. A fancier implementation would replace bisection with linear or quadratic interpolation (as in the reference cited in ?uniroot), but then uniform efficient convergence (and in all cases error handling) would be more tricky to arrange.


Speed of Daisy Function

I'm working on improving the speed of a function (for a dissimilarity measure) I'm writing which is quite similar mathematically to the Euclidean distance function. However, when I time my function compared to that implemented in the daisy function from the cluster package, I find quite a significant difference in speed, with daisy performing much better. Given that (I'm assuming) a dissimilarity measure would require O(n x p) time due to the need to compare each object to itself over all variables (where n is number of objects and p is number of variables), I find it difficult to understand how the daisy function performs so well (near constant time, from the few experiments I've done) relative to my simple and direct implementation. I present the code I have used both to implement and test below. I have tried looking through the r source code for the implementation of the daisy function, but I found it difficult to understand. I found no nested for loop. Any help with understanding why this function performs so fast and how I could possibly modify my code to have similar speed would be very highly appreciated.
euclidean <- function (df){
no_obj <- nrow(df)
dist <- array(0, dim = c(no_obj, no_obj))
for (i in 1:no_obj){
for (j in 1:no_obj){
dist_v <- 0
if(i != j){
for (v in 1:ncol(df)){
dist_v <- dist_v + sqrt((df[i,v] - df[j,v])^2)
dist[i,j] <- dist_v
tic <- Sys.time()
dst <- euclidean(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Euclidean]: ", time))
tic <- Sys.time()
dst <- daisy(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Daisy]: ", time))
one option:
euclidean3 <- function(df) {
n <- nrow(df)
i <- CJ(1:n, 1:n) # generate all row combinations
dl <- sapply(df, function(x) sqrt((x[i[[1]]] - x[i[[2]]])^2)) # loop over columns
dv <- rowSums(dl) # sum values of columns
d <- matrix(dv, n, n) # fill in matrix
dst3 <- euclidean3(iris[,1:4])
all.equal(euclidean(iris[,1:4]), dst3) # TRUE
[1] "Time taken [Euclidean3]: 0.008"
[1] "Time taken [Daisy]: 0.002"
Largest bottleneck in your code is selecting data.frame elements in loop (df[j,v])). Maybe changing it to matrix also could improver speed. I believe there could be more performant approach on stackoverflow, you just need to search by correct keywords...

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use according to this Stack topic:
for(i in 1:10000)
x<- rnorm(n)
reg2<-, data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
with the following dataset 'sample.txt':
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).
Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with and fitLmPure, if I'm careful about random-number seeds:
nsim <- 1e3
n <- 1e5
dd <- data.frame(Y=rnorm(n))
testfun <- function(, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun( 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
x <- rnorm(10)
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

How to solve nonlinear equations in R with Controls

I try to solve nonlinear equations with controls.
Here is my code:
fun <- function(x) {
b0 <- (0.64*1+(1-0.64)*x[1]*(x[2]*x[1]-1)+x[1]*1*(1-x[2])*x[3])/(x[1]-1) -1805*2.85*0.64
b1plus <- (0.64*1+(1-0.64)*x[1]*(x[2]*x[1]-1.01)+x[1]*1.01*(1-x[2])*x[3])/(1.01*(x[1]-1)) -1805*2.85*0.64*(1+0.00235)
b1minus <- (0.64*1+(1-0.64)*x[1]*(x[2]*x[1]-0.99)+x[1]*0.99*(1-x[2])*x[3])/(0.99*(x[1]-1)) -1805*2.85*0.64*(1-0.00235)
multiroot(fun,c(1.5, 0, 0))
However, the result I get is far beyond the actual results. I wish to control x1 to the range (1.5,4), x2(0,1), x3(0,10000). How can I do that?
Thank you!!
Methods like 'Newton-Raphson' in multiroot or nleqslv do not work well together with bounds constraints. One possible approach is to square and sum the components of your function
fun1 <- function(x) sum(fun(x)^2)
and then treat this as a global optimization problem where you hope for a minimum value of 0.0. For example, GenSA provides an implementation of "simulated annealing" that works reasonably well in low dimensions.
res <- GenSA(par=NULL, fn=fun1,
lower=c(1.5,0,0), upper=c(4,1,10000),
res$value; res$par
## [1] 119.7869
## [1] 4.00 0.00 2469.44
Several tries did not find a lower function value than this one, which makes me think there is no common root in the constraint box you requested.

possible bug in `rbinom()` for large numbers of trials

I've been writing some code that iteratively performs binomial draws (using rbinom) and for some callee arguments I can end up with the size being large, which causes R (3.1.1, both official or homebrew builds tested—so unlikely to be compiler related) to return an unexpected NA. For example:
is what I'd expect to work, but gives NA back. However, running with size=2^31 or prob≤0.5 works.
The fine manual mentions inversion being used when size < .Machine$integer.max is false, could this be the issue?
Looking at the source rbinom does the equivalent (in C code) of the following for such large sizes:
qbinom(runif(n), size, prob, FALSE)
And indeed:
#[1] 2040095619
qbinom(runif(1), 2^31, 0.95, F)
#[1] 2040095619
#[1] NA
qbinom(runif(1), 2^32, 0.95, F)
#[1] 4080199349
As #BenBolker points out rbinom returns an integer and if the return value is larger than .Machine$integer.max, e.g., larger than 2147483647 on my machine, NA gets returned. In contrast qbinom returns a double. I don't know why and it doesn't seem to be documented.
So, it seems like there is at least undocumented behavior and you should probably report it.
I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)
rbinom_safe <- function(n,size,prob,max.size=2^31) {
maxlen <- max(length(size),length(prob),n)
prob <- rep(prob,length.out=maxlen)
size <- rep(size,length.out=maxlen)
res <- numeric(n)
bigvals <- size>max.size
if (nbig <- sum(bigvals>0)) {
m <- (size*prob)[bigvals]
sd <- sqrt(size*prob*(1-prob))[bigvals]
res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
if (nbig<n) {
res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
size <- c(1,5,10,2^31,2^32)
The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:
n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)![enter image description here][1]
rr <- rbinom_safe(10000,n,prob=p)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
This is the intended behaviour, but there are two issues:
1) The NA induced by coercion should raise a warning
2) The fact that discrete random variables have storage mode integer should be documented.
I have fixed 1) and will modify the documentation to fix 2) when I have a little more time.

Surprisingly Slow Standard Deviation in R

I am calculating standard deviations on an expanding window where at each point I recalculate the standard deviation. This seems like a fairly straightforward thing to do that should be relatively fast. However, it takes a lot longer than you might think (~45 seconds). Am I missing something here? In Matlab this is quite fast.
t0 <- proc.time()[[3]]
z <- rep(0, 7000)
x <- rnorm(8000)
for(i in 1000:8000){
## print(i)
z[i] <- sd(x[1:i])
print(proc.time()[[3]]- t0)
You might also try an algorithm that updates the standard deviation (well, actually, the sum of squares of differences from the mean) as you go. On my system this reduces the time from ~0.8 seconds to ~0.002 seconds.
n <- length(x)
m <- cumsum(x)/(1:n)
m1 <- c(NA,m[1:(n-1)])
ssd <- (x-m)*(x-m1)
v <- c(0,cumsum(ssd[-1])/(1:(n-1)))
z <- sqrt(v)
See for details.
Also see the answers to this question: Efficient calculation of matrix cumulative standard deviation in r
Edited to fix some typos, sorry.
This takes ~1.3 seconds on my machine:
t0 <- proc.time()[[3]]
x <- rnorm(8000)
z <- sapply(1000:8000,function(y){sd(x[seq_len(y)])})
print(proc.time()[[3]]- t0)
and I'd be willing to bet there are even faster ways of doing this. Avoid explicit for loops!
When a somewhat similar question about a cumulative variance and a cumularive kurtosis operation came up in rhelp a few days ago, here is what I offered :
daily <- rnorm(1000000)
mbar <- mean(daily)
cumvar <- cumsum( (daily-cumsum(daily)/1:length(daily) )^2)
cumskew <- cumsum( (daily-cumsum(daily)/1:length(daily))^3)/cumvar^(3/2)
It's certainly faster than the sapply method but may be comparable to Aaron's.
system.time( cumvar <- cumsum( (daily-cumsum(daily)/1:length(daily) )^2) )
user system elapsed
0.037 0.026 0.061
system.time(cumsd <- sqrt(cumvar) )
user system elapsed
0.009 0.005 0.013
