Although I modeled on a 40-core computer, I only used a few of them when modeling with Keras. I would like to know how to call all CPU cores for computation. I tried to implement parallel computing in the following way but failed. Scholars, please enlighten me on other methods.
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
history <- model %>% fit(xtrain, ytrain,
epochs = 200, batch_size=100, verbose = 1)
Tensorflow/Keras takes care of parallelism in fit(), and it generally won't work if you manually try to fork the parent R process or manage a PSOCK cluster. The {parallel} package that comes with R is not compatible with Tensorflow/Keras.
If it looks like Tensorflow/Keras is not using all your CPU cores with the default settings, you can adjust the thread pool size here: https://www.tensorflow.org/api_docs/python/tf/config/threading (but, in my experience, it's more likely that you're IO-limited, or the CPU is waiting on the GPU, and probably not that the thread pool size is too small).
If you're interested in distributed computing with Tensorflow, here is a good place to get started: https://www.tensorflow.org/api_docs/python/tf/distribute
Related
I need to do a model estimation using the MCMC method on a SLURM cluster (the system is CentOS). The estimation takes a very long time to finish.
Within each MCMC interaction, there is one step taking a particularly long time. Since this step is doing a lapply-loop (around 100000 loops, 30s to finish all loops), so as far as I understand, I should be able to use parallel computing to speed up.
I tried several packages (doMC, doParallel, doSNOW) together with the foreach framework. The setup is
parallel_cores=8
#doParallel
library(doParallel)
cl<-makeCluster(parallel_cores)
registerDoParallel(cl)
#doMC
library(doMC)
registerDoMC(parallel_cores)
#doSNOW, this is also fast
library(doSNOW)
ml<-makeCluster( parallel_cores)
registerDoSNOW(cl)
#foreach framework
#data is a list
data2=foreach(
data_i=
data,
.packages=c("somePackage")
) %dopar% {
data_i=some_operation(data_i)
list(beta=data_i$beta,sigma=data_i$sigma)
}
Using a doMC, the time for this step can be reduced to about 9s. However, as doMC is using shared memory and I have a large array to store estimate results, I quickly ran out of memory (i.e. slurmstepd: error: Exceeded job memory limit).
Using doParallel and doSNOW, the time for this step even increased, to about 120s, which sounds ridiculous. The mysterious thing is that when I tested the code in both my Mac and Windows machines, doParallel and doSNOW actually gave similar speed compared to doMC.
I'm stuck and not sure how to proceed. Any suggestions will be greatly appreciated!
I am trying to train a list of R caret models on Google Cloud Compute Engine (Ubuntu LTS16.04). The xgboost (both xgblinear and xgbtree) model took forever to complete the training. In fact, the CPU utilization is always 0 from GCP status monitoring.
I used doMC library for parallel execution. It works very well for models like C5.0, glmnet and gbm. However, for xgboost (both xgblinear and xgbtree),due to some reason, the CPU seems not running because the utilization remains 0. Troubleshooting:
1. Removed the doMC and run with single core only, same problem remained.
2. Changed the parallel execution library to doParallel instead of doMC. This round the CPU utilization went up, but it took 5 mins to complete the training on GCP. The same codes finished in just 12 seconds on my local laptop. (I ran 24 CPUs on GCP, and 4 CPUs on my local laptop)
3. The doMC parallel execution works well for other algorithm. Only xgboost has this problem.
Code:
xgblinear_Grid <- expand.grid(nrounds = c(50, 100),
lambda = c(.05,.5),
alpha = c(.5),
eta = c(.3))
registerDoMC(cores = mc - 1)
set.seed(123)
xgbLinear_varimp <- train(formula2, data=train_data, method="xgbLinear", metric=metric, tuneGrid = xgblinear_Grid, trControl=fitControl, preProcess = c("center", "scale", "zv"))
print(xgbLinear_varimp)
No error message generated. It simply runs endlessly.R sessionInfo
I encountered the same problem, and it took a long time to understand the three reasons behind it:
xgbLinear requires more memory than any other machine learning algorithm available in the caret library. For every core, you can assume at least 1GB RAM even for tiny datasets of only 1000 x 20 dimension, for bigger datasets more.
xgbLinear in combination with parallel execution has a final process that recollects the data from the threads. This process is usually responsible for the 'endless' execution time. Again, the RAM is the limiting factor. You might have seen the following error message that which is often caused by to little allocation of RAM:
Error in unserialize(socklist[[n]]) : error reading from connection
xgbLinear has its own parallel processing algorithm which gets mixed up with the doParallel algorithm. Here, the effective solution is to set xgbLinear to single-thread by an additional parameter in caret::train() - nthread = 1 - and let doParallel do the parallelization
As illustration for (1), you can see here that the memory utilization nears 80 GB:
and 235GB for a training a still tiny dataset of 2500x14 dimensionality:
As illustration for (2), you can see here that this is the process that takes forever if you don't have enough memory:
Is there any way to unlimit the CPU usage so my PC puts more effort in finishing a task for rapidly? At the moment the k-means algorithm is estimated to finish in about 10 days, which is something I would like to reduce.
R is single-threaded by default, and runs only on a single thread on the CPU, which is a pity if you have a machine with 16 or 32 cores. By unlimiting the CPU usage, I have to assume you're asking if there's any way to have an R process (let's say part of the k-means algorithm) take advantage of your full CPU power by running the process in-parallel.
Many R packages and processes are not going to be helped by parallel processing though. So the technical solution to your particular problem goes down to the package implementation you're using. Popular packages like caret do support parallelization when that's possible, even though you may need to add an additional allowParallel=T parameter. They work in conjunction with a library such as doMC to allow multi-core processes. In the following sample code, I have my machine use 8 cores through the registerDoMC(8) function, and then set allowParallel=T.
library(doMC)
registerDoMC(8)
system.time({
ctrl_2 <- trainControl(method="cv", number=3, allowParallel=T)
fb_forest_2 <- train(classe ~ ., data=fb_train, method="rf", trControl = ctrl_2)
})
Again, parallel processing doesn't always help - Not all process can be parallelized! The documentation for foreach are a great read so if you can afford the time take a look at it. The specific code solution for your problem also depend on the library implementation you're using.
I have a large n (>1,000,000) dataset with a small number of features to estimate (regression) random forest and have been looking to implement Rborist (in R). I'd like to parallelize my work, but am not finding much guidance on how that would be done. I have 16 processors to use on the machine where it's running. When I use doParallel with the randomForest package, for example, the command:
rf <- foreach(ntree=rep(32, 16), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, nodesize = 25, ntree=ntree)
It launches 16 R processes, and works slowly as randomForest does, but works.
The analogous command for Rborist:
rb <- foreach(ntree=rep(32, 16), .combine=combine, .packages='Rborist') %dopar% Rborist(x, y, minNode = 25, ntree=ntree)
Throws the error:
error calling combine function:
Warning message: In mclapply(argsList, FUN, mc.preschedule =
preschedule, mc.set.seed = set.seed, : all scheduled cores
encountered errors in user code
Does anyone know how to parallelize with Rborist? It does not appear to be happening under the hood as it's only using 1 cpu when I run:
rb <- Rborist(x, y, minNode = 25, ntree = 512)
Rborist runs in parallel by itself. It uses all my threads on my machine (win 10 64bit). But then I didn't load doParallel / foreach first.
Same goes for the ranger package, but in ranger you can set the number of threads to use.
Speedy implementations of a rf are of the top of my head:
Rborist (large n, low p)
ranger (handles large p, modest n)
random forest.ddr (haven't tested)
distributed random forest in H2O. very fast, but makes use of
stopping criteria.
Rborist currently uses all available cores. Would it be useful to offer a way to tune this?
Have you tried the latest version on CRAN, 0.1-3? This contains a change to the default minimum node size for regression, improving accuracy in some cases.
We've been making some strides toward improving performance with modest (as opposed to small) predictor count. This should also be reflected by changes in the latest release.
Large running memory footprint is probably a consequence of the breadth-first splitting approach. One way to conserve memory is to carve the problem into chunks, but we haven't gotten there yet.
Large final memory size is chiefly due to caching leaf information for subsequent use by other packages or for quantile regression. Perhaps we should add a "noLeaf" option for users who are not interested in either of those options.
I'm using R's caret package to do modeling for Coursera class on machine learning.
I'm currently building Random Forest with 500 trees on a data set of 11k observations and 40 features.
It took about 3 hours for single core implementation to compute results and I'm experimenting with multi-core implementation right now (code below)
library(parallel)
library(caret)
library(doParallel)
library(foreach)
cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)
trCtrl <- trainControl(allowParallel = TRUE)
modFit2 <- train(classe~ ., data=training, trControl = trCtrl, method="parRF", prox=TRUE, ntree = 500)
Now my question is this: Is there a way to view progress on build model during run-time? Is there a package/implementation of parallelized RF that outputs for example progress on number of trees built as it run?
Obvious question is: why do I need to know? Cant I just wait this hour or two for results? It wont be faster but might be slower that way!
I have a lot of models to build for my class and I dont want to spend few hours on each model and wonder if it is running or not. I want to confirm that it is building trees, stop execution and schedule it for the night when I will run full models. I will be running different configurations of parameters for RF and also some other time intensive models so I would rather spend my day-time on writing code while leave my computer on the mercy of running computation full speed when I'm sleeping (my browser is barely working right now :P as both my RAM and CPU are almost at 100%)
You could use getModelInfo to add cat statements to the fit function. Also, there is a verboseIter option in trainControl that you are ignoring here.
Probably the problem is that you are using trainControl(allowParallel = TRUE). This is going to try to fit the resampling iterations across different cores and using method="parRF" fits each of those in parallel.
If you specify 4 cores on your machine, you have probably spawned 16 workers. You are probably better off using method = "rf" and trainControl(allowParallel = TRUE). That might also mean that you have 17 copies of the data in memory.