I am running random forest in R in parallel
library(doMC)
registerDoMC()
x <- matrix(runif(500), 100)
y <- gl(2, 50)
Parallel execution (took 73 sec)
rf <- foreach(ntree=rep(25000, 6), .combine=combine, .packages='randomForest') %dopar%
randomForest(x, y, ntree=ntree)
Sequential execution (took 82 sec)
rf <- foreach(ntree=rep(25000, 6), .combine=combine) %do%
randomForest(x, y, ntree=ntree)
In parallel execution, the tree generation is pretty quick like 3-7 sec, but the rest of the time is consumed in combining the results (combine option). So, its only worth to run parallel execution is the number of trees are really high. Is there any way I can tweak "combine" option to avoid any calculation at each node which I dont need and make it more faster
PS. Above is just an example of data. In real I have some 100 thousands features for some 100 observations.
Setting .multicombine to TRUE can make a significant difference:
rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,
.multicombine=TRUE, .packages='randomForest') %dopar% {
randomForest(x, y, ntree=ntree)
}
This causes combine to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.
Are you aware that the caret package can do a lot of the hand-holding for parallel runs (as well as data prep, summaries, ...) for you?
Ultimately, of course, if there are some costly operations left in the random forest computation itself, there is little you can do as Andy spent quite a few years on improving it. I would expect few to no low-hanging fruits to be around for the picking...
H20 package can be used to solve your problem.
According to H20 documentation page H2O is "the open source
math engine for big data that computes parallel distributed
machine learning algorithms such as generalized linear models,
gradient boosting machines, random forests, and neural networks
(deep learning) within various cluster environments."
Random Forest implementation using H2O:
https://www.analyticsvidhya.com/blog/2016/05/h2o-data-table-build-models-large-data-sets/
I wonder if the parallelRandomForest code would be helpful to you?
According to the author it ran about 6 times faster on his data set with 16 times less memory consumption.
SPRINT also has a parallel implementation here.
Depending on your CPU, you could probably get 5%-30% speed-up choosing number of jobs to match your number of registered cores matching the number of system logical cores. (sometimes it is more efficient to match number of system physical cores).
If you have a generic Intel dual-core laptop with Hyper Threading(4 logical cores), then DoMC probably registered a cluster of 4 cores. Thus 2 cores will idle when iteration 5 and 6 are computed plus the extra time starting/stopping two extra jobs. It would more efficient to make only 2-4 jobs of more trees.
Related
I have written some mpi code that solves systems of equations using the conjugate gradient method. In this method matrix-vector multiplication takes up most of the time. As a parallelization strategy, I do the multiplication in blocks of rows and then I
gather the results in the root process. The remaining steps are performed by the root process which broadcasts the results whenever a matrix-vector multiplication needs to be performed.
The strong scaling curve representing the speedup is fine
But the weak scaling curve representing the efficiency is quite bad
In theory, the blue curve should be close to the red one.
Is this intrinsic to the parallelization strategy or am I doing something wrong?
Details
The measurements are in seconds. The experiments are performed on a cluster where each node has
2 Skylake processors running at 2.3 GHz, with 18 cores each,192 GB of DDR3 RAM and 800GB NVMe local drive. Amdahl's prediction is computed with the formula (0.0163 + 0.9837 / p)^-1. Gustafson's prediction is computed with the formula 0.9873+0.0163/p where p is the number of processors. The experimental values are in both cases obtained by dividing the time spent by a single computation unit by the time spent by p computation units.
For weak scaling, I start with a load per processor of W = 1768^2 matrix entries. Then the load with p processors will be M^2 = pW matrix cells. Thus, we set the matrix's side to M = 1768 \sqrt{p} for p processes. This gives: 1768, 3536, 5000, 7071 and 10000 cells for 1, 2, 4, 8, 16, 32 processors respectively. I also fix the number of iterations to 500 so that the measurements are not affected by the variability in the data.
I think your Amdahl formula is wrong. It should be:
S_p = F_s + p F_p
You have a division that should be a multiplication. See for instance https://theartofhpc.com/istc/parallel.html#Gustafson'slaw
I’m trying to calculate local regression on R using the loess() function and the computer is taking forever to process it.
How do I make it work faster?
My laptop has 8 GB RAM and a quad core processor
Multi-threading in Caret.
Step 1: Detect the number of logical cores on your computer.
library(doParallel)
detectCores(all.tests = FALSE, logical = TRUE)
Step 2: Assign more cores by changing the value inside makePSOCKcluster()
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
Step 3: Inside carets "trainControl" set "allowParallel = TRUE"
Step 4: When you are finished multi-threading
stopCluster(cl)
I am setting up a piece of code to parallel processes some computations for N groups in my data using foreach.
I have a computation that involves a call to h2o.gbm.
In my current, sequential set-up, I use up to about 70% of my RAM.
How do I correctly set-up my h2o.init() within the parallel piece of code? I am afraid that I might run out of RAM when I use multiple cores.
My Windows 10 machine has 12 cores and 128GB of RAM.
Would something like this pseudo-code work?
library(foreach)
library(doParallel)
#setup parallel backend to use 12 processors
cl<-makeCluster(12)
registerDoParallel(cl)
#loop
df4 <-foreach(i = as.numeric(seq(1,999)), .combine=rbind) %dopar% {
df4 <- data.frame()
#bunch of computations
h2o.init(nthreads=1, max_mem_size="10G")
gbm <- h2o.gbm(train_some_model)
df4 <- data.frame(someoutput)
}
fwrite(df4, append=TRUE)
stopCluster(cl)
The way your code is currently set up won't be the best option. I understand what you are trying to do -- execute a bunch of GBMs in parallel (each on a single core H2O cluster), so you can maximize the CPU usage across the 12 cores on your machine. However, what your code will do is try to run all the GBMs in your foreach loop in parallel on the same single-core H2O cluster. You can only connect to one H2O cluster at a time from a single R instance, however the foreach loop will create a new R instance.
Unlike most machine learning algos in R, the H2O algos are all multi-core enabled so the training process will already be parallelized at the algorithm level, without the need for a parallel R package like foreach.
You have a few options (#1 or #3 is probably best):
Set h2o.init(nthreads = -1) at the top of your script to use all 12 of your cores. Change the foreach() loop to a regular loop and train each GBM (on a different data partition) sequentially. Although the different GBMs are trained sequentially, each single GBM will be fully parallelized across the H2O cluster.
Set h2o.init(nthreads = -1) at the top of your script, but keep your foreach() loop. This should run all your GBMs at once, with each GBM parallelized across all cores. This could overwhelm the H2O cluster a bit (this is not really how H2O is meant to be used) and could be a bit slower than #1, but it's hard to say without knowing the size of your data and the number of partitions of you want to train on. If you are already using 70% of your RAM for a single GBM, then this might not be the best option.
You can update your code to do the following (which most closely resembles your original script). This will preserve your foreach loop, creating a new 1-core H2O cluster at a different port on your machine. See below.
Updated R code example which uses the iris dataset and returns the predicted class for iris as a data.frame:
library(foreach)
library(doParallel)
library(h2o)
h2o.shutdown(prompt = FALSE)
#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)
#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
library(h2o)
port <- 54321 + 3*i
print(paste0("http://localhost:", port))
h2o.init(nthreads = 1, max_mem_size = "1G", port = port)
df4 <- data.frame()
data(iris)
data <- as.h2o(iris)
ss <- h2o.splitFrame(data)
gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}
In order to judge which option is best, I would try running this on a few data partitions (maybe 10-100) to see which approach seems to scale the best. If your training data is small, it's possible that #3 will be faster than #1, but overall, I'd say #1 is probably the most scalable/stable solution.
Following Erin LeDell's answer, I just wanted to add that in many cases a decent practical solution can be something in between #1 and #3. To increase CPU utilization and still save RAM you can use multiple H2O instances in parallel, but they each can use multiple cores without much performance loss relative to running more instances with only one core.
I ran an experiment using a relatively small 40MB dataset (240K rows, 22 columns) on a 36 core server.
Case 1: Use all 36 cores (nthreads=36) to estimate 120 GBM models (with default
hyper-parameters) sequentially.
Case 2: Use foreach to run 4 H2O instances on this machine, each
using 9 cores to estimate 30 GBM default models sequentially (total = 120 estimations).
Case 3: Use foreach to run 12 H2O instances on this machine, each
using 3 cores to estimate 10 GBM default models sequentially (total = 120 estimations).
Using 36 cores estimating a single GBM model on this dataset is very inefficient. CPU utilization in Case 1 is jumping around a lot, but is on average below 50%. So there is definitely something to gain using more than one H2O instance at a time.
Runtime Case 1: 264 seconds
Runtime Case 2: 132 seconds
Runtime Case 3: 130 seconds
Given the small improvement from 4 to 12 H2O instances, I did not even run 36 H2O instances each using one core in parallel.
I have a training set that looks like
Name Day Area X Y Month Night
ATTACK Monday LA -122.41 37.78 8 0
VEHICLE Saturday CHICAGO -1.67 3.15 2 0
MOUSE Monday TAIPEI -12.5 3.1 9 1
Name is the outcome/dependent variable.
Here is what my code looks like so far in case it helps
ynn <- model.matrix(~Name , data = trainDF)
mnn <- model.matrix(~ Day+Area +X + Y + Month + Night, data = trainDF)
yCat<-make.names(trainDF$Name, unique=FALSE, allow_=TRUE)
I then setup tuning the parameters
nnTrControl=trainControl(method = "repeatedcv",number = 3,repeats=5,verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = multiClassSummary,allowParallel = TRUE)
nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
model <- train(y=yCat, x=mnn, method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)
When I ran this, it was still running over 20 hours later, so I had to stop it
I read in the link below that its possible to parallelize the resampling of Caret using registerDoMC: R caret nnet package in Multicore
However, that only seems to work for cores. My machine uses 2 cores and 2 threads on each core. Is there a way to get a speedup using the threads in addition to using the 2 cores and registerDoMC(2)?
I also see in this link below that the user had to setup seeds for each resample: Fully reproducible parallel models using caret
Do I also have to do that for my code? Why was this not used in the former link? What about if I used xgboost instead of nnet?
If you want to reproduce your results you will have to set your seed on every thread that you spawn. This is required because every thread will have a different random number every time an instance is spawned. Depending on which OS you are working each thread will most likely be scheduled on a separate core on your CPU. This depends on your OS job scheduler. In regards to using xgboost versus nnet, I think that the most important aspect should be whether you are interested in the model properties. I think that if you are starting with machine learning xgboost may be a bit easier than nnet. If computational performance is your biggest concern you may try to run your problem on a smaller subset first.
One thing I would do first is run a MCA analysis, which can be find in FactoMineR. This will allow you to see the amount of variance in each of your variables. You could drop variables that have too little variance and thereby speed up the performance of your learning task.
I am trying to run an optimization using lp.transport from the package lpSolve in R, using the generic form
lp.transport (cost, "min", row.signs, row.rhs, col.signs, col.rhs)
The cost matrix is large, 6791 x 15594. The rows correspond to food producers and the columns to consumers, and obviously the sum of all values of row.rhs is equal to that of col.rhs.
The optimization has been running about 12 hours now (using about 30 Mb of memory, in a 64-bits R). Is there any way to estimate the time it will take? Any advice on how to modify the inputs to eventually reduce computational time?