Parallel Computing for nested for loop in R - r

As I am trying to run the following code in R, it took a very long time to execute. Hence would like to check if I could use parallel programming to run this. I see online that people would convert the task to a function first. But not too sure how can i go about doing it.
holder = matrix(0, 1000, 20)
for (x in 1:1000) {
end = x + 99
thedata = dataindataframe[x:end,]
for (y in 1:20) {
m = garchFit(~garch(1,1), data = thedata[,y], trace = FALSE)
holder [x,y] = predict(m, 1)[,3]
}
}
holder
If you could help write the code for parallel programming, that will be of great. Thank you!

This is a pretty simple parallelization scenario.
Without a reproducible example I cannot guarantee this will work. However, this is how I would approach it, that is parallelizing the outermost loop.
library(doParallel)
library(foreach)
library(fGarch)
registerDoParallel(parallel::detectCores()-2) #Or set this to whatever is reasonable for your computer/server
holder <- foreach(x=1:1000, .combine = "rbind", .packages='fGarch') %dopar% {
end = x + 99
thedata = dataindataframe[x:end,]
pred <- numeric(20L)
for (y in 1:20) {
m = garchFit(~garch(1,1), data = thedata[,y], trace = FALSE)
pred[y] = predict(m, 1)[,3]
}
return(pred)
}
Some other resources:
foreach vignette
doparallel vignette

Related

Nested foreach with changing index size

I'm trying to obtain the return of daily prices for each stock I have. The data is cross-sectionnal and very large. Thus I use doParallel and nested foreach.
Here is the code I've been using so far. (this is a reproduceable example)
and here is a reproduce-able example
stock_name <- as.data.frame(sample(x = 1:100, size = 250, replace=TRUE))
price <- as.data.frame(sample(x = 1:100, size = 1000,replace=TRUE))
## Calculating daily returns.
stock_list<-as.tbl(distinct(stock_name))
numStock<- as.integer(count(stock_list)) #150 #as.integer(count(stock_list))
nCPUcores = detectCores()
if (nCPUcores < 3) {
registerDoSEQ()
}else{
cl = makeCluster(nCPUcores - 1)
registerDoParallel(cl)
}
d_ret<-c()
foreach (stock=1:numStock, .packages = c("doParallel","foreach","data.table","plyr","dplyr")) %dopar%{
s<-as.integer(unlist(stock_list[stock,]))
stock_price <- as.matrix(price[which(stock_name[1,]==s),])
u<-nrow(stock_price)
d_ret<-foreach (p=2:u) %:%{
c(d_ret,(stock_price[p,]-stock_price[p-1,])/stock_price[p-1,])
}
}
stopCluster(cl)
##--
But the code doesn't work. After Florian Prive's remark, I checked the library and it seems that I should write nested foreach loops like this:
x <- foreach(b=bvec, .combine='cbind') %:%
foreach(a=avec, .combine='c') %dopar% {
sim(a, b)
}
So what I understand is I shouldn't be writing anything between %:% and the second foreach.
However, in my case, the second loop would change with the first foreach because there aren't the same number of prices for each stocks. Therefore I can't just write ' foreach(a=avec) '.
The second foreach would ideally depend on variable u
u<-nrow(stock_price)
Is this even possible with the foreach library?
Thank you for the help

Produce a matrix using a foreach loop and parallel processing

I am trying to convert a for loop which I am currently using to run a process across a large matrix. The current for loop finds the maximum value within a 30 x 30 section and creates a new matrix with the maximum value.
The current code for the for loop looks like this:
mat <- as.matrix(CHM) # CHM is the original raster image
maxm <- matrix(nrow=nrow(mat)/30, ncol=ncol(mat)/30) # create new matrix with new dimensions
for(i in 1:dim(maxm)[1]) {
for(j in 1:dim(maxm)[2]) {
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
maxm[i,j] <- max(CHM[row:(row + 29), col:(col + 29)])
}
}
I want to convert this to a foreach loop to use parallel processing. I've got as far as producing the following code but this dosent work. I'm not sure how to produce the new matrix within the foreach loop:
ro<-nrow(mat)/30
co<-ncol(mat)/30
maxm <- matrix(nrow=nrow(mat)/30, ncol=ncol(mat)/30)
foreach(i=ro, .combine='cbind') %:%
foreach(j=co, .combine='c') %dopar% {
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
maxm[i,j]<-(max(CHM[row:(row + 29), col:(col + 29)]))
}
Any suggestions please!
Prior to performing any action in parallel, one should try to see if any vectorizing is possible. And once that is done question 'is parallelization reasonable?'
In this specific example, parallelization is unlikely to be as fast as you expect, as at each iteration you are saving your output into a common object. R does not commonly support this in parallelization, and instead one should seek parallelization in the so called 'embarrassingly parallel-able' problems, until one gets a better understanding of how parallel problems work. In short: Don't perform parallel changes to data in R, unless you know what you're doing. It is unlikely to be faster.
That said in your case it actually becomes quite tricky. You seem to be performing a 'rolling-max window', and the output should be saved in a combined matrix. An alternative method to saving the data directly int othe matrix, is to return a matrix with 3 columns x, i, j, where the latter two are indices that indicate which row/column the value of x should be placed in.
In order for this to work, as Dmitriy noted in his answer, the data needs to be exported to each cluster (parallel session), such that we can use it. Afterwards the following example shows how one can perform the parallization
First: Create a cluster and export the dataset
set.seed(1)
#Generate test example
n <- 3000
dat <- matrix(runif(n^2), ncol = n)
library(foreach)
library(doParallel)
#Create cluster
cl <- parallel::makeCluster(parallel::detectCores())
#Register it for the foreach loop
doParallel::registerDoParallel(cl)
#Export the dataset (could be done directly in the foreach, but this is more explicit)
parallel::clusterExport(cl, "dat")
Next we come to the foreach loop. Note that according to the documentation, nested foreach loops should be seperated using the %:% tag, as shown in my example below:
output <- foreach(i = 1:(nrow(dat)/30), .combine = rbind, .inorder = FALSE) %:%
foreach(j = 1:(ncol(dat)/30), .combine = rbind, .inorder = FALSE) %dopar%{
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
c(x = max(dat[row:(row + 29), col:(col + 29)]), i = i, j = j)
}
Note the .inorder = FALSE. As i return the indices i dont care about order, only about speed.
Last but not least, we need to create the matrix. The Matrix package function Matrix::SparseMatrix allows for specifying values and indices.
output <- Matrix::sparseMatrix(output[,"i"], output[,"j"], x = output[,"x"])
This is still rather slow. For n = 3000 it took roughly 6 seconds to perform calculations + a not-insignificant overhead from exporting the data. But it is likely faster than the same method using sequential loops.
Let me try to get an answer here.
As I know, R use cluster system for parallel computation, each node works with an own environment. So, foreach-%dopar%, firstly, copy all current .globalEnv to the each cluster node and after that tried to run your code which is written in the cycle body. With no backcopy after code execution. You'll just get only a result by result = foreach(...) { }. So, the code maxm[i,j]<-(max(CHM[row:(row + 29), col:(col + 29)])) in the each node changes only local copy of your matrix, nothing more.
So, the "correct" code, probably, will be like this:
mat <- as.matrix(CHM);
ro<-nrow(mat)/30;
co<-ncol(mat)/30;
maxm = foreach(i=1:ro, .combine='cbind') %:%
{
result = foreach(j = 1:co, .combine='c') %dopar%
{
row <- 30 * (i - 1) + 1;
col <- 30 * (j - 1) + 1;
max(CHM[row:(row + 29), col:(col + 29)]);
}
result;
}
Maybe it also be need to use as.matrix for maxm.

Foreach code works for %do% but not for %dopar%

This works normally on my computer:
registerDoSNOW(makeCluster(2, type = "SOCK"))
foreach(i = 1:M,.combine = "c") %dopar% {
sum(rnorm(M))
}
So I can say that I can run parallelized code on this computer, right?
Ok. I have a piece of code that I wish to run on parallel with foreach. It runs perfectly when it's written with %do%, but doesn't work properly when I change it to %dopar%. (PS: I have already initialized the cluster with registerDoSNOW(makeCluster(2, type = "SOCK")) in the same way as before.)
My main interest in the code is getting the vector u.varpred. I get it nicely with %do%, but when I run it with %dopar%, the vector comes as a NULL.
Here is the loop with the code that's needed to run it all properly. It uses functions in the geoR package.
#you can pretty much ignore all this, it's just preparation for the loop
N=20
NN=10
set.seed(111);
datap <- grf(N, cov.pars=c(20, 5),nug=1)
grid.o <- expand.grid(seq(0, 1, l=100), seq(0, 1, l=100))
grid.c <- expand.grid(seq(0, 1, l=NN), seq(0,1, l=NN))
beta1=mean(datap$data)
emv<- likfit(datap, ini=c(10,0.4), nug=1)
krieging <- krige.conv(datap, loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars))
names(grid.c) = names(as.data.frame(datap$coords))
list.geodatas<-list()
valores<-c(datap$data,0)
list.dataframes<-list()
list.krigings<-list(); i=0; u.varpred=NULL;
#here is the foreach code
t<-proc.time()
foreach(i=1:length(grid.c[,1]), .packages='geoR') %do% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)))
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars));
u.varpred[i] <- mean(krieging$krige.var - list.krigings[[i]]$krige.var)
list.dataframes[[i]]<-0 #i dont need those objects anymore but since they
# are lists i dont want to put <-NULL as it'll ruin their ordering
list.krigings[[i]]<- 0
list.geodatas[[i]] <-0
}
t<-proc.time()-t
t
You can check that this runs nicely (provided you have the following packages: geoR, foreach and doSNOW). But once I use registerDoSNOW(......) and %dopar%, u.varpred comes as a NULL.
Could you guys please try to see if I made a mistake in the foreach statement/process or if it's just the code that can't be parallel? (I thought it could, because any given iteration does not deppend on any of the iterations before it..)
I am sorry both the code and this question are so long. Thanks in advance for taking the time to read it.
My friend helped me directly. Here is a way it works:
u.varpred <- foreach(i = 1:length(grid.c[,1]), .packages = 'geoR', .combine = "c") %dopar% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)));
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc = grid.o,
krige = krige.control(type.krige = "SK", trend.d = "cte",
beta = beta1, cov.pars = emv$cov.pars));
u.varpred <- mean(krieging$krige.var - list.krigings[[i]]$krige.var);
list.dataframes[[i]] <- 0;
list.krigings[[i]] <- 0;
list.geodatas[[i]] <- 0;
u.varpred #this makes the results go into u.varpred
}
He gave me an example on why this works:
a <- NULL
foreach(i = 1:10) %dopar% {
a <- 5
}
print(a)
# a is still NULL
a <- NULL
a <- foreach(i = 1:10) %dopar% {
a <- 5
a
}
print(a)
#now it works
Hope this helps anyone.

using "foreach" for running different classifiers in R

I am trying to use foreach to run different classifiers on my data, but it doesn't work. In fact it doesn't return me anything.
my purpose is to parallelize my process. here is the simplified of my code:
library(foreach)
library(doParallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
registerDoParallel(no_cores)
model_list<-foreach(i = 1:2,
.combine = c,.packages=c("e1071","randomeForest")) %dopar%
if (i==1){
model1<-svm(x = X,y = as.factor(Y),type = "C-classification",probability = T)
}
if (i==2){
mode2<-randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,importance = T)
}
My way of parallelizing is correct overall?
Thanks indeed.
The main problem is that you're not enclosing the body of the foreach loop in curly braces. Because %dopar% is a binary operator, you have to be careful about precedence, which is why I recommend always using curly braces.
Also, you shouldn't use c as the combine function. Since svm and randomForest return objects, the default behavior of returning the results in a list is appropriate. Combining them with c will give you a garbage result.
Finally, it doesn't make sense to call registerDoParallel twice. It doesn't hurt, but it makes your code confusing.
I suggest:
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(no_cores)
model_list <- foreach(i = 1:2,
.packages=c("e1071","randomForest")) %dopar% {
if (i==1) {
svm(x = X,y = as.factor(Y),type = "C-classification",
probability = T)
} else {
randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,
importance = T)
}
}
I also removed the two unnecessary variable assignments to model1 and model2. Those variables won't be defined correctly on the master, and it obscures how the foreach loop really works.

nlsBoot and foreach %dopar%: scoping issues

I would like to do bootstrap of residuals for nls fits in a loop. I use nlsBoot and in order to decrease computation time I would like to do that in parallel (on a Windows 7 system at the moment). Here is some code, which reproduces my problem:
#function for fitting
Falge2000 <- function(GP2000,alpha,PAR) {
(GP2000*alpha*PAR)/(GP2000+alpha*PAR-GP2000/2000*PAR)
}
#some data
PAR <- 10:1600
GPP <- Falge2000(-450,-0.73,PAR) + rnorm(length(PAR),sd=0.0001)
df1 <- data.frame(PAR,GPP)
#nls fit
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
#bootstrap of residuals
library(nlstools)
summary(nlsBoot(mod,niter=5))
#works
#now do it several times
#and in parallel
library(foreach)
library(doParallel)
cl <- makeCluster(1)
registerDoParallel(cl)
ttt <- foreach(1:5, .packages='nlstools',.export="df1") %dopar% {
res <- nlsBoot(mod,niter=5)
summary(res)
}
#Error in { :
#task 1 failed - "Procedure aborted: the fit only converged in 1 % during bootstrapping"
stopCluster(cl)
I suspect this an issue with environments and after looking at the code of nlsBoot the problem seems to arise from the use of an anonymous function in a lapply call:
l1 <- lapply(1:niter, function(i) {
data2[, var1] <- fitted1 + sample(scale(resid1, scale = FALSE),
replace = TRUE)
nls2 <- try(update(nls, start = as.list(coef(nls)), data = data2),
silent = TRUE)
if (inherits(nls2, "nls"))
return(list(coef = coef(nls2), rse = summary(nls2)$sigma))
})
if (sum(sapply(l1, is.null)) > niter/2)
stop(paste("Procedure aborted: the fit only converged in",
round(sum(sapply(l1, is.null))/niter), "% during bootstrapping"))
Is there a way to use nlsBoot in a parallel loop? Or do I need to modify the function? (I could try to use a for loop instead of lapply.)
By moving the creation of the mod object into the %dopar% loop, it looks like everything works OK. Also, this automatically exports the df1 object, so you can remove the .export argument.
ttt <- foreach(1:5, .packages='nlstools') %dopar% {
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
res <- nlsBoot(mod,niter=5)
capture.output(summary(res))
}
However, you might need to work out what you want returned. Using capture.output was just to see if things were working, since summary(res) seemed to only return NULL.

Resources