I am using foreach and parallel libraries to perform parallel computation, but for some reason, while running, it only uses 1 CPU at a time (I look it up using 'top' (Bash on Linux Terminal).
The server has 48 cores, and I've tried:
Using 24, 12 or 5 cores
Example codes (as the one below)
In Windows, where the tasks as such appear, but they do not use any CPU
list.of.packages <- c("foreach", "doParallel")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if (length(new.packages)) install.packages(new.packages)
library(foreach)
library(doParallel)
no_cores <- detectCores() / 2 # 24 cores
cl<-makeCluster(no_cores)
registerDoParallel(cl)
df.a = data.frame(str = cbind(paste('name',seq(1:60000))), int = rnorm(60000))
df.b = data.frame(str = sample(df.a[, 1]))
df.b$int = NA
foreach(row.a = 1:length(df.a$str),
.combine = rbind,
.verbose = T) %dopar% {
row.b = grep(pattern = df.a$str[row.a], x = df.b$str)
df.b$int[row.b] = df.a$int[row.a]
df.b
}
stopCluster(cl)
I expect this code to use several CPUs (as many as defined), but it actually uses 1.
Run foreach(..., verbose = TRUE) to understand what is going on. Have slightly changed the code that is running to better identify when the parallel code is running.
library(foreach)
library(doParallel)
no_cores <- detectCores() / 2 # 24 cores
cl<-makeCluster(no_cores)
registerDoParallel(cl)
base = 2
out <- foreach(exponent = 2:400,
.combine = sum, .verbose = TRUE) %dopar%
runif(1000000)
First segment:
# discovered package(s):
# no variables are automatically exported
# explicitly exporting package(s):
# numValues: 3999, numResults: 0, stopped: TRUE
This setup is not parallel - this is your master setting up your children. This takes a very long time with 2:40000000, which may be where you are stopping, and you would only see one core in use.
# numValues: 79, numResults: 0, stopped: TRUE
# got results for task n
This computation while you are waiting for this to be printed should be parallel. On Windows I see 4 cores working to calculate each runif.
# calling combine function
# evaluating call object to combine results:
# fun(accum, result.n)
This runs for each worker with a different value for n. This is your combine function and is not parallel either.
I think your code is getting hung up on the setup piece, and you are only observing the serial part of the operation. If not, I would watch what is happening with verbose = TRUE and watch for more clues.
I don't know how your main problem is setup, but your example is not a good example of how to set up parallelization - you are using millions of workers to do very small tasks, so your serial overhead costs per worker are very high. You will see improved performance if you can send larger pieces to each worker.
Related
I am imputing missing values by missRanger and it takes too long as I have 1000 variables. I tried to use parallel computing, but it does not make the process faster. Here is the code
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
library(missRanger)
train[1:lengthvar] <- missRanger(train[1:lengthvar], pmm.k = 3, num.trees = 100)
stopCluster(cl)
I am not sure what to add to this code to make it work.
missRanger is based on a parallelized random forest implementation in R -ranger. Thus, the code is already running on all cores and stuff like doParallel just renders the code clumsy.
Try to speed up the calculations by passing relevant arguments to ranger via the ... argument of missRanger, e.g.
num.trees = 20 or
max.depth = 8
instead.
Disclaimer: I am the author of missRanger.
This is a basic example of the concept of multiple cores. This will highlight the basic concept instead of looking at the timing issue. By my test runs (for larger number of columns), the non parallel version is faster.
library(doParallel)
library(missRanger)
library(data.table) #Needed for rbindlist at the end
cores=detectCores()
cl <- makeCluster(cores[1])
registerDoParallel(cl)
clusterEvalQ(cl, {library(missRanger)}) #Passing the package missRanger to all the cores
#Create some random columns
A=as.numeric(c(1,2,"",4,5,6,7,8,9,10,11,12,13,"",15,16,17,18,19,20))
B=as.numeric(c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,"",122.3,120.9,122.4,125.7,"",128.2,129.1,121.2,128.4,127.6,125.1))
m = as.data.frame(matrix(0, ncol = 10, nrow = 20))
m[,1:5]=A
m[,6:10]=B
list_num=as.data.frame(seq(1,10,by=1)) #A sequence of column numbers for the different cores to run the function for
#Note that the optimal process would have been to take columns 1:3
#and run it on one core, 4:6 to run it on the 2nd core and so on.
#Function to run on the parallel cores
zzz=function(list_num){
m_new=m[,list_num] #Note the function takes the column number as an argument
m_new=missRanger(m_new[1:length(m_new)], pmm.k = 3, num.trees = 100)
}
clusterExport(cl=cl, list("m"),envir=environment()) #Export your list
zz=parLapply(cl=cl,fun=zzz,X=list_num) #Pass the function and the list of numbers here
zzzz=data.frame(rbindlist(zz)) #rbind the
stopCluster(cl)
I am running a function that utilizes the functions biganalytics::bigkmeans and xgboost (through Caret). Both of these support parallel processing if it is registered first by doing registerDoMC(cores = 4). However, to utilize the power of the 64 core machine I have access to without adding too much parallel overhead, I want to a run the following function in 16 instances (total of 64 processes.
example = function (x) {
biganalytics:: bigkmeans (matrix(rnorm(10*5,1000,1),ncol=500))
mod <- train(Class ~ ., data = df ,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, example, mc.cores = 16).
It is important that I stick to mclapply because I need a shared memory parallel backend so that I don't run out of ram in my actual use of data sets over 50gb.
My question is, where would I do registerDoMC in this case?
Thanks!
Using nested parallelism isn't often a good idea, but if the outer loop has many fewer iterations than cores, it might be.
You can load doMC and call registerDoMC inside the foreach loop to prepare the workers to call train. But note that it doesn't make sense to call mclapply with more workers than tasks, otherwise some of the workers won't have any work to do.
You could do something like this:
example <- function (dat, nw) {
library(doMC)
registerDoMC(nw)
# call train function on dat...
}
# This assumes that length(datlist) is much less than ncores
ncores <- 64
m <- length(datlist)
nw <- ncores %/% m
mclapply(datlist, example, nw, mc.cores=m)
If length(datlist) is 4, then each "train" task will use 16 workers. You can certainly use fewer workers per "train" task, but you probably shouldn't use more.
In a different post here I asked for help on parallel processing a call to h2o.gbm inside a foreach loop.
Following the answers provided, I run a script similar to this example:
library(h2o)
data(iris)
data <- as.h2o(iris)
ss <- h2o.splitFrame(data)
gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
h2o.saveModel(path="some path")
h2o.shutdown(prompt = FALSE)
library(foreach)
library(doParallel)
#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)
#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
library(h2o)
port <- 54321 + 3*i
print(paste0("http://localhost:", port))
h2o.init(nthreads = 1, max_mem_size = "10G", port = port) #my local machine runs 128GB
df4 <- data.frame()
gbm <- h2o.loadModel(path="some path")
df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}
It runs really well on a small sample of my real data (at least 50% faster than sequential)
But when I run this on all of my data, I get the following error code after 45 minutes:
Error in { : task 2 failed - "
ERROR MESSAGE:
DistributedException from localhost/127.0.0.1:60984, caused by
java.lang.IllegalStateException: Unable to clean up RollupStats after an
exception (see cause). This could cause a key leakage, key=$05ff14000000feffffff$_b66dbd609dc068f0137cc88cb42a
"
I am not sure what causes this error code. I guess it has to do with a memory issue because this code will take up 85-95% of my RAM (128GB) and 100% of my CPU (12 threads).
Anyone any ideas to work around this?
For those who are interested, I found the reason of this error. It is actually really simple.
Using makeCluster(12) I request 12 threads on my CPU.
Later on in the foreach call, I make a h2o.init call in which I request yet another thread.
Since my machine only has 12 threads, this last call for an additional (12+1) thread can't be correctly processed.
I fixed this by assigning 6 threads to the cluster. This leaves me 6 threads to make six individual calls to h2o.init (one in each foreach call).
This works great.
So I tried using the snowfall package for parallel execution in R, using all my cpu cores. This is the code I used for testing:
library(snow)
library(snowfall)
sfInit(parallel = TRUE, cpus = 16, type = "SOCK")
data <- array(1:1000000, dim=c(1000000,1))
system.time(x <- sfLapply(data, fun=function(x){return (x*x) }))
Which effectively runs 16 times faster as it uses all CPU cores available.
But when I try this:
system.time(m2 <- J48(CHURNED_F~., data = data[, -c(1)]))
It takes about 50 seconds, as a test (with only about 1% of the whole data set)
The following runs correctly but takes the same time and only uses one CPU:
library(snow)
library(snowfall)
sfInit(parallel = TRUE, cpus = 16, type = "SOCK")
system.time(m2 <- sfLapply("CHURNED_F~.", J48, data[, -c(1)]))
Am I just using the wrong syntax? How can I make this run in parallel?
I'm familiar with foreach, %dopar% and the like. I am also familiar with the parallel option for cv.glmnet. But how do you set up the nested parallelistion as below?
library(glmnet)
library(foreach)
library(parallel)
library(doSNOW)
Npar <- 1000
Nobs <- 200
Xdat <- matrix(rnorm(Nobs * Npar), ncol = Npar)
Xclass <- rep(1:2, each = Nobs/2)
Ydat <- rnorm(Nobs)
Parallel cross-validation:
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
idx <- Xclass == x
cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = TRUE)
})
stopCluster(cl)
Not parallel cross-validation:
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
idx <- Xclass == x
cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = FALSE)
})
stopCluster(cl)
For the two system times I am only getting a very marginal difference.
Is parallelistion taken are of? Or do I need to use the nested operator explicitly?
Side-question: If 8 cores are available in a cluster object and the foreach loop contains two tasks, will each task be given 1 core (and the other 6 cores left idle) or will each task be given four cores (using up all 8 cores in total)? What's the way to query how many cores are being used at a given time?
In your parallel cross-validation example, cv.glmnet itself will not run in parallel because there is no foreach parallel backend registered in the cluster workers. The outer foreach loop will run in parallel, but not the foreach loop in the cv.glmnet function.
To use doSNOW for the outer and inner foreach loops, you could initialize the snow cluster workers using clusterCall:
cl <- makeCluster(2, type = "SOCK")
clusterCall(cl, function() {
library(doSNOW)
registerDoSNOW(makeCluster(2, type = "SOCK"))
NULL
})
registerDoSNOW(cl)
This registers doSNOW for both the master and the workers so that each call to cv.glmnet will execute on a two-worker cluster when parallel=TRUE is specified.
The trick with nested parallelism is to avoid creating too many processes and oversubscribing the CPU (or CPUs), so you need to be careful when registering the parallel backends. My example makes sense for a CPU with four cores even though a total of six workers are created, since the "outer" workers don't do much while the inner foreach loops execute. It is common when running on a cluster to use doSNOW to start one worker per node, and then use doMC to start one worker per core on each of those nodes.
Note that your example doesn't use much compute time, so it's not really worthwhile to use two levels of parallelism. I would use a much bigger problem in order to determine the benefits of the different approaches.