using "foreach" for running different classifiers in R - r

I am trying to use foreach to run different classifiers on my data, but it doesn't work. In fact it doesn't return me anything.
my purpose is to parallelize my process. here is the simplified of my code:
library(foreach)
library(doParallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
registerDoParallel(no_cores)
model_list<-foreach(i = 1:2,
.combine = c,.packages=c("e1071","randomeForest")) %dopar%
if (i==1){
model1<-svm(x = X,y = as.factor(Y),type = "C-classification",probability = T)
}
if (i==2){
mode2<-randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,importance = T)
}
My way of parallelizing is correct overall?
Thanks indeed.

The main problem is that you're not enclosing the body of the foreach loop in curly braces. Because %dopar% is a binary operator, you have to be careful about precedence, which is why I recommend always using curly braces.
Also, you shouldn't use c as the combine function. Since svm and randomForest return objects, the default behavior of returning the results in a list is appropriate. Combining them with c will give you a garbage result.
Finally, it doesn't make sense to call registerDoParallel twice. It doesn't hurt, but it makes your code confusing.
I suggest:
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(no_cores)
model_list <- foreach(i = 1:2,
.packages=c("e1071","randomForest")) %dopar% {
if (i==1) {
svm(x = X,y = as.factor(Y),type = "C-classification",
probability = T)
} else {
randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,
importance = T)
}
}
I also removed the two unnecessary variable assignments to model1 and model2. Those variables won't be defined correctly on the master, and it obscures how the foreach loop really works.

Related

Parallel Computing for nested for loop in R

As I am trying to run the following code in R, it took a very long time to execute. Hence would like to check if I could use parallel programming to run this. I see online that people would convert the task to a function first. But not too sure how can i go about doing it.
holder = matrix(0, 1000, 20)
for (x in 1:1000) {
end = x + 99
thedata = dataindataframe[x:end,]
for (y in 1:20) {
m = garchFit(~garch(1,1), data = thedata[,y], trace = FALSE)
holder [x,y] = predict(m, 1)[,3]
}
}
holder
If you could help write the code for parallel programming, that will be of great. Thank you!
This is a pretty simple parallelization scenario.
Without a reproducible example I cannot guarantee this will work. However, this is how I would approach it, that is parallelizing the outermost loop.
library(doParallel)
library(foreach)
library(fGarch)
registerDoParallel(parallel::detectCores()-2) #Or set this to whatever is reasonable for your computer/server
holder <- foreach(x=1:1000, .combine = "rbind", .packages='fGarch') %dopar% {
end = x + 99
thedata = dataindataframe[x:end,]
pred <- numeric(20L)
for (y in 1:20) {
m = garchFit(~garch(1,1), data = thedata[,y], trace = FALSE)
pred[y] = predict(m, 1)[,3]
}
return(pred)
}
Some other resources:
foreach vignette
doparallel vignette

Nested foreach with changing index size

I'm trying to obtain the return of daily prices for each stock I have. The data is cross-sectionnal and very large. Thus I use doParallel and nested foreach.
Here is the code I've been using so far. (this is a reproduceable example)
and here is a reproduce-able example
stock_name <- as.data.frame(sample(x = 1:100, size = 250, replace=TRUE))
price <- as.data.frame(sample(x = 1:100, size = 1000,replace=TRUE))
## Calculating daily returns.
stock_list<-as.tbl(distinct(stock_name))
numStock<- as.integer(count(stock_list)) #150 #as.integer(count(stock_list))
nCPUcores = detectCores()
if (nCPUcores < 3) {
registerDoSEQ()
}else{
cl = makeCluster(nCPUcores - 1)
registerDoParallel(cl)
}
d_ret<-c()
foreach (stock=1:numStock, .packages = c("doParallel","foreach","data.table","plyr","dplyr")) %dopar%{
s<-as.integer(unlist(stock_list[stock,]))
stock_price <- as.matrix(price[which(stock_name[1,]==s),])
u<-nrow(stock_price)
d_ret<-foreach (p=2:u) %:%{
c(d_ret,(stock_price[p,]-stock_price[p-1,])/stock_price[p-1,])
}
}
stopCluster(cl)
##--
But the code doesn't work. After Florian Prive's remark, I checked the library and it seems that I should write nested foreach loops like this:
x <- foreach(b=bvec, .combine='cbind') %:%
foreach(a=avec, .combine='c') %dopar% {
sim(a, b)
}
So what I understand is I shouldn't be writing anything between %:% and the second foreach.
However, in my case, the second loop would change with the first foreach because there aren't the same number of prices for each stocks. Therefore I can't just write ' foreach(a=avec) '.
The second foreach would ideally depend on variable u
u<-nrow(stock_price)
Is this even possible with the foreach library?
Thank you for the help

How to prevent nested foreach loop using 100% CPU of all cores in R?

I’m running a nested 3 layers foreach loop but unable to prevent the codes from 100% occupying the remote server (Linux, centOS, physical cores = 14, logical cores = 56). The framework I use is:
Library(doParallel)
doParallel::registerDoParallel(20)
outRes <- foreach1(I = seq1, …) %:%
foreach2(j = seq2, …) %dopar% {
innerRes <- foreach3(k = seq3, …)
}
There are three questions occuring to me.
For nested foreach loops, will the registered backend be passed to each foreach loop and actually result in 20*3 = 60 workers?
What’s the mathematical relationship between number of workers and the CPU utility percentage?
In my real case, foreach1 and foreach2 are small processes, while foreach3 is large process. This causes a problem that most time the workers are idle waiting, leading to waste of workers. Is there any solution to fix it?
PS: a reproducible codes example is attached.
library(mlbench)
data("Sonar")
str(Sonar)
table(Sonar$Class)
seed <- 1234
# for cross validation
number_outCV <- 10
repeats_outCV <- 10
number_innerCV <- 10
repeats_innerCV <- 10
# list of numbers of features to model
featureSeq <- c(10, 30, 50)
# for LASSO training
lambda <- exp(seq(-7, 0, 1))
alpha <- 1
dataList <- list(data1 = Sonar, data2 = Sonar, data3 = Sonar, data4 = Sonar, data5 = Sonar, data6 = Sonar)
# library(doMC)
# doMC::registerDoMC(cores = 20)
library(doParallel)
doParallel::registerDoParallel(20)
nestedCV <- foreach::foreach(clust = 1:length(dataList), .combine = "c", .verbose = TRUE) %:%
foreach::foreach(outCV = 1:(number_outCV*repeats_outCV), .combine = "c", .verbose = TRUE) %dopar% {
# prepare data
dataset <- dataList[[clust]]
table(dataset$Class)
# split data into model developing and testing data in the outCV: repeated 10-fold CV
set.seed(seed)
ResampIndex <- caret::createMultiFolds(y = dataset$Class, k = number_outCV, times = repeats_outCV)
developIndex <- ResampIndex[[outCV]]
developX <- dataset[developIndex, !colnames(dataset) %in% c("Class")]
developY <- dataset$Class[developIndex]
testX <- dataset[-developIndex, !colnames(dataset) %in% c("Class")]
testY <- dataset$Class[-developIndex]
# get a pool of all the features
features_all <- colnames(developX)
# training model with inner repeated 10-fold CV
# foreach for nfeature search
nfeatureRes <- foreach::foreach(featNumIndex = seq(along = featureSeq), .combine = "c", .verbose = TRUE) %dopar% {
nfeature <- featureSeq[featNumIndex]
selectedFeatures <- features_all[1:nfeature]
# train LASSO
lassoCtrl <- trainControl(method = "repeatedCV",
number = number_innerCV,
repeats = repeats_innerCV,
verboseIter = TRUE, returnResamp = "all", savePredictions = "all",
classProbs = TRUE, summaryFunction = twoClassSummary)
lassofit.cv <- train(x = developX[, selectedFeatures],
y = developY,
method = "glmnet",
metric = "ROC",
trControl = lassoCtrl,
tuneGrid = expand.grid(lambda = lambda, alpha = alpha),
preProcess = c("center", "scale"))
AUC.test <- pROC::auc(response = testY, predictor = predict(lassofit.cv, newdata = testX[, selectedFeatures], type = "prob")[[2]])
performance <- data.frame(Class = clust, outCV = outCV, nfeature = nfeature, AUC.cv = max(lassofit.cv$results$ROC), AUC.test = as.numeric(AUC.test))
}
# end of nfeature search foreach loop
nfeatureRes
}
# end of outCV foreach loop as well as the dataList foreach loop
foreach::registerDoSEQ()
If you want to make sure your code only uses a certain number of cores, you can pin your process to specific cores. This is called "CPU affinity" and in R you can use parallel::mcaffinity to set it, e.g.:
parallel::mcaffinity(1:20)
to allow your R process to use only the first 20 cores. This works regardless of other libraries used inside this process, because it invokes OS-level control over resources (some rare libraries spawn or communicate with other processes, but your code doesn't seem to use anything like that).
%:% is the right way to nest foreach loops — the foreach package will consider both inner and outer loop in its scheduling, and execute only registerDoParallel inner bodies at a time — whether they are from the same outer loop iteration or not. The wrong way would be e.g. foreach(…) %dopar% { foreach(…) %dopar% { … } } — this would spawn registerDoParallel-squared number of computations at a time (so, 400 in your case). foreach(…) %do% { foreach(…) %dopar% { … } } (or the other way around) would be better, but suboptimal. See the foreach's nesting vignette for details.
In your case it'd probably be the best to keep the two outer loops as they are now (%:% and %doPar%), and change the inner loop to %do%. You still have quite a lot of iterations total in the two outer loops to fill 20 cores, and the common rule is that it's better to parallelize outer loops than inner, if it's possible.
With many experiments, I'm guessing this is how foreach() might fork the workers:
if nested foreach used (e.g. foreach() %:% foreach() %dopar% {} ): the workers (logical CPU cores which share storage) forked will be the cores registered before foreach() multiplying times of foreach(). E.g.:
registerDoMc(cores = 10)
foreach() %:% foreach() %:% foreach() %dopar% {} # 10x3 = 30 workers will be finally forked in the following example.
If a foreach() nested in another foreach() without using %:%, the workers (logical CUP cores) forked will be the cores registered from the %:% part multiplying the independent nested part. E.g.:
registerDoMc(cores = 10)
foreach() %:% foreach() %dopar% { foreach()} # (10+10)x10 = 200 workers will the finally forked.
Welcome any corrections if wrong.
I don't know if this is possible but maybe you could try to reduce the server priority by running it with the "nice" command (so that, even if it's using 100% CPU, it will only be taken on idle time) ?

Foreach code works for %do% but not for %dopar%

This works normally on my computer:
registerDoSNOW(makeCluster(2, type = "SOCK"))
foreach(i = 1:M,.combine = "c") %dopar% {
sum(rnorm(M))
}
So I can say that I can run parallelized code on this computer, right?
Ok. I have a piece of code that I wish to run on parallel with foreach. It runs perfectly when it's written with %do%, but doesn't work properly when I change it to %dopar%. (PS: I have already initialized the cluster with registerDoSNOW(makeCluster(2, type = "SOCK")) in the same way as before.)
My main interest in the code is getting the vector u.varpred. I get it nicely with %do%, but when I run it with %dopar%, the vector comes as a NULL.
Here is the loop with the code that's needed to run it all properly. It uses functions in the geoR package.
#you can pretty much ignore all this, it's just preparation for the loop
N=20
NN=10
set.seed(111);
datap <- grf(N, cov.pars=c(20, 5),nug=1)
grid.o <- expand.grid(seq(0, 1, l=100), seq(0, 1, l=100))
grid.c <- expand.grid(seq(0, 1, l=NN), seq(0,1, l=NN))
beta1=mean(datap$data)
emv<- likfit(datap, ini=c(10,0.4), nug=1)
krieging <- krige.conv(datap, loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars))
names(grid.c) = names(as.data.frame(datap$coords))
list.geodatas<-list()
valores<-c(datap$data,0)
list.dataframes<-list()
list.krigings<-list(); i=0; u.varpred=NULL;
#here is the foreach code
t<-proc.time()
foreach(i=1:length(grid.c[,1]), .packages='geoR') %do% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)))
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars));
u.varpred[i] <- mean(krieging$krige.var - list.krigings[[i]]$krige.var)
list.dataframes[[i]]<-0 #i dont need those objects anymore but since they
# are lists i dont want to put <-NULL as it'll ruin their ordering
list.krigings[[i]]<- 0
list.geodatas[[i]] <-0
}
t<-proc.time()-t
t
You can check that this runs nicely (provided you have the following packages: geoR, foreach and doSNOW). But once I use registerDoSNOW(......) and %dopar%, u.varpred comes as a NULL.
Could you guys please try to see if I made a mistake in the foreach statement/process or if it's just the code that can't be parallel? (I thought it could, because any given iteration does not deppend on any of the iterations before it..)
I am sorry both the code and this question are so long. Thanks in advance for taking the time to read it.
My friend helped me directly. Here is a way it works:
u.varpred <- foreach(i = 1:length(grid.c[,1]), .packages = 'geoR', .combine = "c") %dopar% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)));
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc = grid.o,
krige = krige.control(type.krige = "SK", trend.d = "cte",
beta = beta1, cov.pars = emv$cov.pars));
u.varpred <- mean(krieging$krige.var - list.krigings[[i]]$krige.var);
list.dataframes[[i]] <- 0;
list.krigings[[i]] <- 0;
list.geodatas[[i]] <- 0;
u.varpred #this makes the results go into u.varpred
}
He gave me an example on why this works:
a <- NULL
foreach(i = 1:10) %dopar% {
a <- 5
}
print(a)
# a is still NULL
a <- NULL
a <- foreach(i = 1:10) %dopar% {
a <- 5
a
}
print(a)
#now it works
Hope this helps anyone.

nlsBoot and foreach %dopar%: scoping issues

I would like to do bootstrap of residuals for nls fits in a loop. I use nlsBoot and in order to decrease computation time I would like to do that in parallel (on a Windows 7 system at the moment). Here is some code, which reproduces my problem:
#function for fitting
Falge2000 <- function(GP2000,alpha,PAR) {
(GP2000*alpha*PAR)/(GP2000+alpha*PAR-GP2000/2000*PAR)
}
#some data
PAR <- 10:1600
GPP <- Falge2000(-450,-0.73,PAR) + rnorm(length(PAR),sd=0.0001)
df1 <- data.frame(PAR,GPP)
#nls fit
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
#bootstrap of residuals
library(nlstools)
summary(nlsBoot(mod,niter=5))
#works
#now do it several times
#and in parallel
library(foreach)
library(doParallel)
cl <- makeCluster(1)
registerDoParallel(cl)
ttt <- foreach(1:5, .packages='nlstools',.export="df1") %dopar% {
res <- nlsBoot(mod,niter=5)
summary(res)
}
#Error in { :
#task 1 failed - "Procedure aborted: the fit only converged in 1 % during bootstrapping"
stopCluster(cl)
I suspect this an issue with environments and after looking at the code of nlsBoot the problem seems to arise from the use of an anonymous function in a lapply call:
l1 <- lapply(1:niter, function(i) {
data2[, var1] <- fitted1 + sample(scale(resid1, scale = FALSE),
replace = TRUE)
nls2 <- try(update(nls, start = as.list(coef(nls)), data = data2),
silent = TRUE)
if (inherits(nls2, "nls"))
return(list(coef = coef(nls2), rse = summary(nls2)$sigma))
})
if (sum(sapply(l1, is.null)) > niter/2)
stop(paste("Procedure aborted: the fit only converged in",
round(sum(sapply(l1, is.null))/niter), "% during bootstrapping"))
Is there a way to use nlsBoot in a parallel loop? Or do I need to modify the function? (I could try to use a for loop instead of lapply.)
By moving the creation of the mod object into the %dopar% loop, it looks like everything works OK. Also, this automatically exports the df1 object, so you can remove the .export argument.
ttt <- foreach(1:5, .packages='nlstools') %dopar% {
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
res <- nlsBoot(mod,niter=5)
capture.output(summary(res))
}
However, you might need to work out what you want returned. Using capture.output was just to see if things were working, since summary(res) seemed to only return NULL.

Resources