Correct way to generate parallel random booleans - r

Following on from this q (Cannot understand why random number algorithms give different results), I have some code simulating random booleans. Because I wish to do this ALOT and fast, I wish to wrap this in a function like so:
# setup external to function
number <- 5
probs <- rep(0.1, 5)
# core function
event.sim <- function(var, things){
mod.probs <- probs * var
events <- matrix(rbinom(things*number, 1, probs), ncol=number, byrow=FALSE)
av.events <- max(rowSums(events))
return(av.events)
}
library("parallel")
cl <- makeCluster(4)
clusterExport(cl, c("event.sim", "probs", "number"))
test <- clusterMap(cl, event.sim, var=df1$var1, things=df1$things, SIMPLIFY=TRUE)
stopCluster(cl)
and parallelize it using clusterMap() from parallel. Now this is no problem and I have this working, however I am concerned that by executing in parallel, my booleans are not sufficiently "random" anymore. I can find alot of info online about generating random numbers in parallel, but they all seem to describe generating lots of random numbers at once, and I can't relate that to my function that draws relatively few random booleans each time it is run. Have I problem here and do I need to do something differently?

You just need to use clusterSetRNGStream(cl) after creating your cluster and before running your function.

Related

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
{
x<-rnorm(n)
reg2<-lm(x~data$Y)
Max[i]<-max(abs(rstudent(reg2)))
}
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
{
x<-rnorm(n)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use lm.fit() according to this Stack topic:
for(i in 1:10000)
{
x<- rnorm(n)
reg2<- .lm.fit(as.matrix(x), data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
#Preallocation
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
{
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
}
with the following dataset 'sample.txt':
AGE
51
22
46
52
54
43
61
20
66
27
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).
Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
{
x <- rnorm(10)
}
})
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
})
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

Why the processing time behaves different with these two functions using parallel?

Imagine I have two functions, one is a simple mean of sum of squares, and the other, a little more elaborated that computes a regression, that I want to apply to the lines of a "big" matrix or data frame.
In order to take advantage of multiple cores (on Windows) I tried the parallel package and got very different results for the two functions using the same sequence of commands.
For the apparently more complex function (regression) it appears that the time reduction is significant using more cores (Here I show a result from a PC with 3 cores and a PC with 12 cores, the behavior is similar with up to 11 cores, the time reduction decreases with more cores).
But for the "simple" function, mean of squares, the time of executions is very variable, almost erratic (also tested with up to 11 cores).
First, Is there a reason why this is happening? Second, I imagine there are other ways to do that task, can you suggest any?
Here is the code to generate the plots:
library(parallel)
nc=detectCores()-1 #number of cores
myFun =function(z) coef(lm(rep(1,length(z))~z)) #regression
myFun2 =function(z) sum(z^2)/length(z) # mean of squares
my.mat = matrix(rnorm(1000000,.01,0.4),ncol=100) #data
# using FUN = myFun
# Replicate 10 times
for(j in 1:10){
ncor=2:nc
timed=c()
for (i in seq_along(ncor)){
cl <- makeCluster(mc <- getOption("cl.cores", ncor[i]))
stime <- Sys.time()
res=parApply(cl = cl, X = my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
timed[i]=tm
stopCluster(cl)
}
# no cores
stime <- Sys.time()
res=apply(my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
(dr=data.frame(nc=c(1,ncor),ts=as.numeric(c(tm,timed))))
plot(dr,type="l",col=3,main=j)
#stopCluster(cl)
if (j==1)fres1=dr else fres1=merge(fres1,dr,by="nc")
}
plot(fres1[,1:2],type="l",col=2,ylim=range(fres1[,-1]))
for(i in 3:11)lines(fres1[,i],col=i+1)
# For the second plot use the same code but change FUN = myFun2

Simulate over an xts object in R

I am looking for a function, or package, that will help me with this goal. I've looked through several packages but can't find what I am looking for:
Lets say I have an xts object with 10 columns and 250 rows.
What I want to do is run a simulation, such that I get a robust calculation of my performance metric over the period.
So, lets say that I have 250 data points, I want to run x number of simulations over random samples of the data computing the Sharpe Ratio using the function (PerformanceAnalytics::SharpeRatio) varying the random samples to be lengths 30-240, and then find the average. Keep in mind I want to do this for every column and I'd rather not have to use apply if possible. I'd also like to find something that processes the information rather quickly.
What package or functions would best serve this purpose?
Thank you!
Subsetting xts objects for the rows you want to randomly sample should be good enough, performance wise, if that is your main concern. If you want some other concrete examples, you may find it useful to look at the monte carlo simulation functions recently added to the R blotter package:
https://github.com/braverock/blotter/blob/master/R/mcsim.R
Your requirements are quite detailed and a little tricky to follow, but I think this example may be what you're after?
This solution does use apply functions though! Because it just makes life easier. If you don't use lapply, the code will expand quickly and distract from achieving the goal quickly (and you risk introducing bugs with longer, messier code; one reason to use apply family functions where you can).
library(quantmod)
library(PerformanceAnalytics)
# Set up the data:
syms <- c("GOOG", "FB", "TSLA", "SNAP", "MU")
getSymbols(syms)
z <- do.call(merge, lapply(syms, function(s) {
x <- get(s)
dailyReturn(Cl(x))
}))
# Here we have 250 rows, 5 columns:
z <- tail(z, 250)
colnames(z) <- paste0(syms, ".rets")
subSample <- function(x, n.sub = 40) {
# Assuming subsampling by row, preserving all returns and cross symbol dependence structure at a given timestamp
ii <- sample(1:NROW(x), size = n.sub, replace = FALSE)
# sort in order to preserve time ordering?
ii <- sort(ii)
xs <- x[ii, ]
xs
}
set.seed(5)
# test:
z2 <- subSample(z, n.sub = 40)
zShrp <- SharpeRatio(z2)[1, ]
# now run simulation:
nSteps <- seq(30, 240, by = 30)
sharpeSimulation <- function(x, n.sub) {
x <- subSample(x, n.sub)
SharpeRatio(x)[1, ]
}
res <- lapply(nSteps, FUN = sharpeSimulation, x = z)
res <- do.call(rbind, res)
resMean <- colMeans(res)
resMean
# GOOG.rets FB.rets TSLA.rets SNAP.rets MU.rets
# 0.085353854 0.059577882 0.009783841 0.026328660 0.080846592
Do you realise that SharpeRatio uses sapply? And it's likely other performance metrics you want to use will as well. Since you seem to have something against apply (possibly all apply functions in R), this might be worth noting.

How can I make a parallel operation faster than the serial version?

I'm attempting to "map" a function onto an array. However when trying both simple and complex functions, the parallel version is always slower than the serial version. How can I improve the performance of a parallel computation in R?
Simple parallel example:
library(parallel)
# Number of elements
arrayLength = 100
# Create data
input = 1:arrayLength
# A simple computation
foo = function(x, y) x^y - x^(y-1)
# Add complexity
iterations = 5 * 1000 * 1000
# Perform complex computation on each element
compute = function (x) {
y = x
for (i in 1:iterations) {
x = foo(x, y)
}
return(x)
}
# Parallelized compute
computeParallel = function(x) {
# Create a cluster with 1 fewer cores than are available.
cl <- makeCluster(detectCores() - 1) # 8-1 cores
# Send static vars & funcs to all cores
clusterExport(cl, c('foo', 'iterations'))
# Map
out = parSapply(cl, x, compute)
# Clean up
stopCluster(cl)
return(out)
}
system.time(out <- compute(input)) # 12 seconds using 25% of cpu
system.time(out <- computeParallel(input)) # 160 seconds using 100% of cpu
The problem is that you traded off all of the vectorization for parallelization, and that's a bad trade. You need to keep as much vectorization as possible to have any hope of getting an improvement with parallelization for this kind of problem.
The pvec function in the parallel package can be a good solution to this kind of problem, but it isn't supported in parallel on Windows. A more general solution which works on Windows is to use foreach with the itertools package which contains functions which are useful for iterating over various objects. Here's an example that uses the "isplitVector" function to create one subvector for each worker:
library(doParallel)
library(itertools)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
computeChunk <- function(x) {
foreach(xc=isplitVector(x, chunks=getDoParWorkers()),
.export=c('foo', 'iterations', 'compute'),
.combine='c') %dopar% {
compute(xc)
}
}
This still may not compare very well to the pure vector version, but it should get better as the value of "iterations" increases. It may actually help to decrease the number of workers unless the value of "iterations" is very large.
parSapply will run the function on each element of input separately, which means you are giving up the speed you gained from writing foo and compute in a vectorized fashion.
pvec will run a vectorized function on multiple cores by chunks. Try this:
system.time(out <- pvec(input, compute, mc.cores=4))

Parallelizing a double for loop in R

I've been using the parallel package in R to do loops like:
cl <- makeCluster(getOption("cl.cores", 6))
result <- parSapply(cl,1:k,function(i){ ... })
Is there a natural way to parallelize a nested for loop in R using this package? Or perhaps another package? I know there are several ways to implement parallelism in R.
My loop looks something like this. I simplified a bit but it gets the message across:
sup_mse <- matrix(0,nrow=k,ncol=length(sigma))
k <- 100000 #Number of iterations
sigma <- seq(from=0.1,to=10,by=0.2)
for(i in 1:k){
for(j in 1:length(sigma)){
sup<-supsmu(x,y)
sup_mse[i,j] <- mean((m(x)-sup$y)^2)
}
}
Thanks for making the reproducible example! I prefer snowfall for my parallel processing, so here's how it looks in there.
install.packages('snowfall')
require(snowfall)
### wasn't sure what you were using for x or y
set.seed(1001)
x <- sample(seq(1,100),20)
y <- sample(seq(1,100),20)
k <- 100
sigma <- seq(0.1, 10, 0.2)
### makes a local cluster on 4 cores and puts the data each core will need onto each
sfInit(parallel=TRUE,cpus=4, type="SOCK",socketHosts=rep("localhost",4))
sfExport('x','y','k','sigma')
answers <- sfSapply(seq(1,k), function(M)
sapply(seq(1,length(sigma)), function(N)
mean((mean(x)-supsmu(x,y)$y)^2) ## wasn't sure what you mean by m(x) so guessed mean
)
)
sup_mse <- t(answers) ## will give you a matrix with length(sigma) columns and k rows
sfStop()
I remember reading somewhere that you only want to use sfSapply in the outer loops and then use your regular apply functions inside of that loop. Hope this helps!

Resources