xgboost superslow on Google Cloud Compute Engine

xgboost superslow on Google Cloud Compute Engine - r

I am trying to train a list of R caret models on Google Cloud Compute Engine (Ubuntu LTS16.04). The xgboost (both xgblinear and xgbtree) model took forever to complete the training. In fact, the CPU utilization is always 0 from GCP status monitoring.
I used doMC library for parallel execution. It works very well for models like C5.0, glmnet and gbm. However, for xgboost (both xgblinear and xgbtree),due to some reason, the CPU seems not running because the utilization remains 0. Troubleshooting:
1. Removed the doMC and run with single core only, same problem remained.
2. Changed the parallel execution library to doParallel instead of doMC. This round the CPU utilization went up, but it took 5 mins to complete the training on GCP. The same codes finished in just 12 seconds on my local laptop. (I ran 24 CPUs on GCP, and 4 CPUs on my local laptop)
3. The doMC parallel execution works well for other algorithm. Only xgboost has this problem.
Code:
xgblinear_Grid <- expand.grid(nrounds = c(50, 100),
lambda = c(.05,.5),
alpha = c(.5),
eta = c(.3))
registerDoMC(cores = mc - 1)
set.seed(123)
xgbLinear_varimp <- train(formula2, data=train_data, method="xgbLinear", metric=metric, tuneGrid = xgblinear_Grid, trControl=fitControl, preProcess = c("center", "scale", "zv"))
print(xgbLinear_varimp)
No error message generated. It simply runs endlessly.R sessionInfo

I encountered the same problem, and it took a long time to understand the three reasons behind it:
xgbLinear requires more memory than any other machine learning algorithm available in the caret library. For every core, you can assume at least 1GB RAM even for tiny datasets of only 1000 x 20 dimension, for bigger datasets more.
xgbLinear in combination with parallel execution has a final process that recollects the data from the threads. This process is usually responsible for the 'endless' execution time. Again, the RAM is the limiting factor. You might have seen the following error message that which is often caused by to little allocation of RAM:
Error in unserialize(socklist[[n]]) : error reading from connection
xgbLinear has its own parallel processing algorithm which gets mixed up with the doParallel algorithm. Here, the effective solution is to set xgbLinear to single-thread by an additional parameter in caret::train() - nthread = 1 - and let doParallel do the parallelization
As illustration for (1), you can see here that the memory utilization nears 80 GB:
and 235GB for a training a still tiny dataset of 2500x14 dimensionality:
As illustration for (2), you can see here that this is the process that takes forever if you don't have enough memory:

Related

R language, keras, parallel computation

Although I modeled on a 40-core computer, I only used a few of them when modeling with Keras. I would like to know how to call all CPU cores for computation. I tried to implement parallel computing in the following way but failed. Scholars, please enlighten me on other methods.
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
history <- model %>% fit(xtrain, ytrain,
epochs = 200, batch_size=100, verbose = 1)

Tensorflow/Keras takes care of parallelism in fit(), and it generally won't work if you manually try to fork the parent R process or manage a PSOCK cluster. The {parallel} package that comes with R is not compatible with Tensorflow/Keras.
If it looks like Tensorflow/Keras is not using all your CPU cores with the default settings, you can adjust the thread pool size here: https://www.tensorflow.org/api_docs/python/tf/config/threading (but, in my experience, it's more likely that you're IO-limited, or the CPU is waiting on the GPU, and probably not that the thread pool size is too small).
If you're interested in distributed computing with Tensorflow, here is a good place to get started: https://www.tensorflow.org/api_docs/python/tf/distribute

Why does the fitting time of a gam increases with the number of threads used?

Common sense indicates that any computation should be faster the more cores or threads we use. If the scaling is bad, the computation time will not improve with increasing number of threads. Thus, how come increasing threads considerably reduces the computation time when fitting a gam with R package MGCV, as shown by this example? :
library(boot) # loads data "amis"
t1<-Sys.time()
mod <- gam(speed ~ s(period, warning, pair, k = 12), data = amis, family=tw (link = log),method="REML",control=list(nthreads=1)) #
t2<-Sys.time()
print("Model fitted in:")
print(t2-t1)
If you increase the number of threads in this example to 2, 4, etc, the fitting procedure will take longer and longer, instead of being faster as we would expect. In my particular case:
1 thread: 32.85333 secs
2 threads: 50.63166 secs
3 threads: 1.2635 mins
Why is this? If I am doing something wrong, what can I do to obtain the desired behavior (i.e., increasing performance with increasing number of threads)?
Some notes:
1) The model, family and solving method shown here make no particular sense. This is only an example. However, I’ve got into this problem with real data and a reasonable model (but for simplicity I use this small code to exemplify the problem). Data, functional form of model, family, solving method seem all to be irrelevant: after many tests I get always the same behaviour, i.e., increasing the number of used threads, decreases performance (i.e., increases computation time).
2) Operative System: Linux Ubuntu 18.04;
3) Architecture: DELL Power Edge with two physical CPUs Intel Xeon X5660 each of them with 6 cores #2800 Mhz and each core being able of handling 2 threads (i.e., total of 24 threads). 80Gb RAM.
4) OpenMP libraries (which are needed for the multi-threath capacity of function gam) were installed with
sudo apt-get install libomp-dev
5) I am aware of the help page for multi-core use of gam (https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/mgcv-parallel.html). The only thing written there pointing to a decrease of performance with increasing number of threads is "Because the computational burden in mgcv is all in the linear algebra, then parallel computation may provide reduced (...) benefit with a tuned BLAS".

Unlimiting the CPU usage from R

Is there any way to unlimit the CPU usage so my PC puts more effort in finishing a task for rapidly? At the moment the k-means algorithm is estimated to finish in about 10 days, which is something I would like to reduce.

R is single-threaded by default, and runs only on a single thread on the CPU, which is a pity if you have a machine with 16 or 32 cores. By unlimiting the CPU usage, I have to assume you're asking if there's any way to have an R process (let's say part of the k-means algorithm) take advantage of your full CPU power by running the process in-parallel.
Many R packages and processes are not going to be helped by parallel processing though. So the technical solution to your particular problem goes down to the package implementation you're using. Popular packages like caret do support parallelization when that's possible, even though you may need to add an additional allowParallel=T parameter. They work in conjunction with a library such as doMC to allow multi-core processes. In the following sample code, I have my machine use 8 cores through the registerDoMC(8) function, and then set allowParallel=T.
library(doMC)
registerDoMC(8)
system.time({
ctrl_2 <- trainControl(method="cv", number=3, allowParallel=T)
fb_forest_2 <- train(classe ~ ., data=fb_train, method="rf", trControl = ctrl_2)
})
Again, parallel processing doesn't always help - Not all process can be parallelized! The documentation for foreach are a great read so if you can afford the time take a look at it. The specific code solution for your problem also depend on the library implementation you're using.

Parallelization with Rborist

I have a large n (>1,000,000) dataset with a small number of features to estimate (regression) random forest and have been looking to implement Rborist (in R). I'd like to parallelize my work, but am not finding much guidance on how that would be done. I have 16 processors to use on the machine where it's running. When I use doParallel with the randomForest package, for example, the command:
rf <- foreach(ntree=rep(32, 16), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, nodesize = 25, ntree=ntree)
It launches 16 R processes, and works slowly as randomForest does, but works.
The analogous command for Rborist:
rb <- foreach(ntree=rep(32, 16), .combine=combine, .packages='Rborist') %dopar% Rborist(x, y, minNode = 25, ntree=ntree)
Throws the error:
error calling combine function:
Warning message: In mclapply(argsList, FUN, mc.preschedule =
preschedule, mc.set.seed = set.seed, : all scheduled cores
encountered errors in user code
Does anyone know how to parallelize with Rborist? It does not appear to be happening under the hood as it's only using 1 cpu when I run:
rb <- Rborist(x, y, minNode = 25, ntree = 512)

Rborist runs in parallel by itself. It uses all my threads on my machine (win 10 64bit). But then I didn't load doParallel / foreach first.
Same goes for the ranger package, but in ranger you can set the number of threads to use.
Speedy implementations of a rf are of the top of my head:
Rborist (large n, low p)
ranger (handles large p, modest n)
random forest.ddr (haven't tested)
distributed random forest in H2O. very fast, but makes use of
stopping criteria.

Rborist currently uses all available cores. Would it be useful to offer a way to tune this?
Have you tried the latest version on CRAN, 0.1-3? This contains a change to the default minimum node size for regression, improving accuracy in some cases.
We've been making some strides toward improving performance with modest (as opposed to small) predictor count. This should also be reflected by changes in the latest release.
Large running memory footprint is probably a consequence of the breadth-first splitting approach. One way to conserve memory is to carve the problem into chunks, but we haven't gotten there yet.
Large final memory size is chiefly due to caching leaf information for subsequent use by other packages or for quantile regression. Perhaps we should add a "noLeaf" option for users who are not interested in either of those options.

Is there a way to track progress during parallelized Random Forest building?

I'm using R's caret package to do modeling for Coursera class on machine learning.
I'm currently building Random Forest with 500 trees on a data set of 11k observations and 40 features.
It took about 3 hours for single core implementation to compute results and I'm experimenting with multi-core implementation right now (code below)
library(parallel)
library(caret)
library(doParallel)
library(foreach)
cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)
trCtrl <- trainControl(allowParallel = TRUE)
modFit2 <- train(classe~ ., data=training, trControl = trCtrl, method="parRF", prox=TRUE, ntree = 500)
Now my question is this: Is there a way to view progress on build model during run-time? Is there a package/implementation of parallelized RF that outputs for example progress on number of trees built as it run?
Obvious question is: why do I need to know? Cant I just wait this hour or two for results? It wont be faster but might be slower that way!
I have a lot of models to build for my class and I dont want to spend few hours on each model and wonder if it is running or not. I want to confirm that it is building trees, stop execution and schedule it for the night when I will run full models. I will be running different configurations of parameters for RF and also some other time intensive models so I would rather spend my day-time on writing code while leave my computer on the mercy of running computation full speed when I'm sleeping (my browser is barely working right now :P as both my RAM and CPU are almost at 100%)

You could use getModelInfo to add cat statements to the fit function. Also, there is a verboseIter option in trainControl that you are ignoring here.
Probably the problem is that you are using trainControl(allowParallel = TRUE). This is going to try to fit the resampling iterations across different cores and using method="parRF" fits each of those in parallel.
If you specify 4 cores on your machine, you have probably spawned 16 workers. You are probably better off using method = "rf" and trainControl(allowParallel = TRUE). That might also mean that you have 17 copies of the data in memory.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

xgboost superslow on Google Cloud Compute Engine - r

Related

R language, keras, parallel computation

Why does the fitting time of a gam increases with the number of threads used?

Unlimiting the CPU usage from R

Parallelization with Rborist

Is there a way to track progress during parallelized Random Forest building?

Categories

Resources