I have a large n (>1,000,000) dataset with a small number of features to estimate (regression) random forest and have been looking to implement Rborist (in R). I'd like to parallelize my work, but am not finding much guidance on how that would be done. I have 16 processors to use on the machine where it's running. When I use doParallel with the randomForest package, for example, the command:
rf <- foreach(ntree=rep(32, 16), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, nodesize = 25, ntree=ntree)
It launches 16 R processes, and works slowly as randomForest does, but works.
The analogous command for Rborist:
rb <- foreach(ntree=rep(32, 16), .combine=combine, .packages='Rborist') %dopar% Rborist(x, y, minNode = 25, ntree=ntree)
Throws the error:
error calling combine function:
Warning message: In mclapply(argsList, FUN, mc.preschedule =
preschedule, mc.set.seed = set.seed, : all scheduled cores
encountered errors in user code
Does anyone know how to parallelize with Rborist? It does not appear to be happening under the hood as it's only using 1 cpu when I run:
rb <- Rborist(x, y, minNode = 25, ntree = 512)
Rborist runs in parallel by itself. It uses all my threads on my machine (win 10 64bit). But then I didn't load doParallel / foreach first.
Same goes for the ranger package, but in ranger you can set the number of threads to use.
Speedy implementations of a rf are of the top of my head:
Rborist (large n, low p)
ranger (handles large p, modest n)
random forest.ddr (haven't tested)
distributed random forest in H2O. very fast, but makes use of
stopping criteria.
Rborist currently uses all available cores. Would it be useful to offer a way to tune this?
Have you tried the latest version on CRAN, 0.1-3? This contains a change to the default minimum node size for regression, improving accuracy in some cases.
We've been making some strides toward improving performance with modest (as opposed to small) predictor count. This should also be reflected by changes in the latest release.
Large running memory footprint is probably a consequence of the breadth-first splitting approach. One way to conserve memory is to carve the problem into chunks, but we haven't gotten there yet.
Large final memory size is chiefly due to caching leaf information for subsequent use by other packages or for quantile regression. Perhaps we should add a "noLeaf" option for users who are not interested in either of those options.
Related
Although I modeled on a 40-core computer, I only used a few of them when modeling with Keras. I would like to know how to call all CPU cores for computation. I tried to implement parallel computing in the following way but failed. Scholars, please enlighten me on other methods.
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
history <- model %>% fit(xtrain, ytrain,
epochs = 200, batch_size=100, verbose = 1)
Tensorflow/Keras takes care of parallelism in fit(), and it generally won't work if you manually try to fork the parent R process or manage a PSOCK cluster. The {parallel} package that comes with R is not compatible with Tensorflow/Keras.
If it looks like Tensorflow/Keras is not using all your CPU cores with the default settings, you can adjust the thread pool size here: https://www.tensorflow.org/api_docs/python/tf/config/threading (but, in my experience, it's more likely that you're IO-limited, or the CPU is waiting on the GPU, and probably not that the thread pool size is too small).
If you're interested in distributed computing with Tensorflow, here is a good place to get started: https://www.tensorflow.org/api_docs/python/tf/distribute
I am trying to train a list of R caret models on Google Cloud Compute Engine (Ubuntu LTS16.04). The xgboost (both xgblinear and xgbtree) model took forever to complete the training. In fact, the CPU utilization is always 0 from GCP status monitoring.
I used doMC library for parallel execution. It works very well for models like C5.0, glmnet and gbm. However, for xgboost (both xgblinear and xgbtree),due to some reason, the CPU seems not running because the utilization remains 0. Troubleshooting:
1. Removed the doMC and run with single core only, same problem remained.
2. Changed the parallel execution library to doParallel instead of doMC. This round the CPU utilization went up, but it took 5 mins to complete the training on GCP. The same codes finished in just 12 seconds on my local laptop. (I ran 24 CPUs on GCP, and 4 CPUs on my local laptop)
3. The doMC parallel execution works well for other algorithm. Only xgboost has this problem.
Code:
xgblinear_Grid <- expand.grid(nrounds = c(50, 100),
lambda = c(.05,.5),
alpha = c(.5),
eta = c(.3))
registerDoMC(cores = mc - 1)
set.seed(123)
xgbLinear_varimp <- train(formula2, data=train_data, method="xgbLinear", metric=metric, tuneGrid = xgblinear_Grid, trControl=fitControl, preProcess = c("center", "scale", "zv"))
print(xgbLinear_varimp)
No error message generated. It simply runs endlessly.R sessionInfo
I encountered the same problem, and it took a long time to understand the three reasons behind it:
xgbLinear requires more memory than any other machine learning algorithm available in the caret library. For every core, you can assume at least 1GB RAM even for tiny datasets of only 1000 x 20 dimension, for bigger datasets more.
xgbLinear in combination with parallel execution has a final process that recollects the data from the threads. This process is usually responsible for the 'endless' execution time. Again, the RAM is the limiting factor. You might have seen the following error message that which is often caused by to little allocation of RAM:
Error in unserialize(socklist[[n]]) : error reading from connection
xgbLinear has its own parallel processing algorithm which gets mixed up with the doParallel algorithm. Here, the effective solution is to set xgbLinear to single-thread by an additional parameter in caret::train() - nthread = 1 - and let doParallel do the parallelization
As illustration for (1), you can see here that the memory utilization nears 80 GB:
and 235GB for a training a still tiny dataset of 2500x14 dimensionality:
As illustration for (2), you can see here that this is the process that takes forever if you don't have enough memory:
Common sense indicates that any computation should be faster the more cores or threads we use. If the scaling is bad, the computation time will not improve with increasing number of threads. Thus, how come increasing threads considerably reduces the computation time when fitting a gam with R package MGCV, as shown by this example? :
library(boot) # loads data "amis"
t1<-Sys.time()
mod <- gam(speed ~ s(period, warning, pair, k = 12), data = amis, family=tw (link = log),method="REML",control=list(nthreads=1)) #
t2<-Sys.time()
print("Model fitted in:")
print(t2-t1)
If you increase the number of threads in this example to 2, 4, etc, the fitting procedure will take longer and longer, instead of being faster as we would expect. In my particular case:
1 thread: 32.85333 secs
2 threads: 50.63166 secs
3 threads: 1.2635 mins
Why is this? If I am doing something wrong, what can I do to obtain the desired behavior (i.e., increasing performance with increasing number of threads)?
Some notes:
1) The model, family and solving method shown here make no particular sense. This is only an example. However, I’ve got into this problem with real data and a reasonable model (but for simplicity I use this small code to exemplify the problem). Data, functional form of model, family, solving method seem all to be irrelevant: after many tests I get always the same behaviour, i.e., increasing the number of used threads, decreases performance (i.e., increases computation time).
2) Operative System: Linux Ubuntu 18.04;
3) Architecture: DELL Power Edge with two physical CPUs Intel Xeon X5660 each of them with 6 cores #2800 Mhz and each core being able of handling 2 threads (i.e., total of 24 threads). 80Gb RAM.
4) OpenMP libraries (which are needed for the multi-threath capacity of function gam) were installed with
sudo apt-get install libomp-dev
5) I am aware of the help page for multi-core use of gam (https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/mgcv-parallel.html). The only thing written there pointing to a decrease of performance with increasing number of threads is "Because the computational burden in mgcv is all in the linear algebra, then parallel computation may provide reduced (...) benefit with a tuned BLAS".
I'm using R's caret package to do modeling for Coursera class on machine learning.
I'm currently building Random Forest with 500 trees on a data set of 11k observations and 40 features.
It took about 3 hours for single core implementation to compute results and I'm experimenting with multi-core implementation right now (code below)
library(parallel)
library(caret)
library(doParallel)
library(foreach)
cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)
trCtrl <- trainControl(allowParallel = TRUE)
modFit2 <- train(classe~ ., data=training, trControl = trCtrl, method="parRF", prox=TRUE, ntree = 500)
Now my question is this: Is there a way to view progress on build model during run-time? Is there a package/implementation of parallelized RF that outputs for example progress on number of trees built as it run?
Obvious question is: why do I need to know? Cant I just wait this hour or two for results? It wont be faster but might be slower that way!
I have a lot of models to build for my class and I dont want to spend few hours on each model and wonder if it is running or not. I want to confirm that it is building trees, stop execution and schedule it for the night when I will run full models. I will be running different configurations of parameters for RF and also some other time intensive models so I would rather spend my day-time on writing code while leave my computer on the mercy of running computation full speed when I'm sleeping (my browser is barely working right now :P as both my RAM and CPU are almost at 100%)
You could use getModelInfo to add cat statements to the fit function. Also, there is a verboseIter option in trainControl that you are ignoring here.
Probably the problem is that you are using trainControl(allowParallel = TRUE). This is going to try to fit the resampling iterations across different cores and using method="parRF" fits each of those in parallel.
If you specify 4 cores on your machine, you have probably spawned 16 workers. You are probably better off using method = "rf" and trainControl(allowParallel = TRUE). That might also mean that you have 17 copies of the data in memory.
I'm trying to speed up the prediction of a test-dataset (n=35000) by splitting it up and letting R run on smaller chunks. The model has been generated with party::cforest.
However, I can't get R to calculate even the smallest parts when trying to use foreach with %dopar%.
My prediction function takes about 7 seconds for both
predict(fit,newdata=a[1:100,]) and foreach(i=1:10) %do% {predict(fit,newdata=a[1:10,])}.
But when I try and use %dopar%instead, R seems to freeze.
Shouldn't :
foreach(i=1:10, .packages=c('party')) %dopar% {predict(fit,newdata=a[1:10,])}
be way faster? Or is the parallelization itself slowing R down somehow?
Test-running with another function (repeatedly calculating sqrt(3) as suggested here ) has shown significant improvement, so the %dopar% is working too.
Predictions with a randomForest behave similarly, with the difference that here even %do% for 10x1:10 predictions takes a lot more time than just predicting 1:100
For randomForest I don't really care though, because predicting all 35k datasets is not a problem anyway.
Btw. it only me, or is cforest taking more time and RAM for everything? Only having trouble where randomForest works like a charm..
(running on Windows 7, x64, 8GB RAM, 4 cores/8 threads - using 6 nodes in doSNOW parallelization cluster)
The primary problem with your example is that foreach is automatically exporting the entire a data frame to each of the workers. Instead, try something like:
library(itertools)
foreach(1:10, suba=isplitRows(a, chunkSize=10), .packages='party') %dopar% {
predict(fit, newdata=suba)
}
The 1:10 is for test purposes, to limit the loop to only 10 iterations, as you're doing in your example.
This still requires that fit be exported to all of the workers, and it might be quite large. But since there are many more tasks than workers and if predict takes enough time compared to the time to send the test data, it might be worthwhile to parallelize the prediction.