parallel prediction with cforest/randomforest prediction (with doSNOW) - r

I'm trying to speed up the prediction of a test-dataset (n=35000) by splitting it up and letting R run on smaller chunks. The model has been generated with party::cforest.
However, I can't get R to calculate even the smallest parts when trying to use foreach with %dopar%.
My prediction function takes about 7 seconds for both
predict(fit,newdata=a[1:100,]) and foreach(i=1:10) %do% {predict(fit,newdata=a[1:10,])}.
But when I try and use %dopar%instead, R seems to freeze.
Shouldn't :
foreach(i=1:10, .packages=c('party')) %dopar% {predict(fit,newdata=a[1:10,])}
be way faster? Or is the parallelization itself slowing R down somehow?
Test-running with another function (repeatedly calculating sqrt(3) as suggested here ) has shown significant improvement, so the %dopar% is working too.
Predictions with a randomForest behave similarly, with the difference that here even %do% for 10x1:10 predictions takes a lot more time than just predicting 1:100
For randomForest I don't really care though, because predicting all 35k datasets is not a problem anyway.
Btw. it only me, or is cforest taking more time and RAM for everything? Only having trouble where randomForest works like a charm..
(running on Windows 7, x64, 8GB RAM, 4 cores/8 threads - using 6 nodes in doSNOW parallelization cluster)

The primary problem with your example is that foreach is automatically exporting the entire a data frame to each of the workers. Instead, try something like:
library(itertools)
foreach(1:10, suba=isplitRows(a, chunkSize=10), .packages='party') %dopar% {
predict(fit, newdata=suba)
}
The 1:10 is for test purposes, to limit the loop to only 10 iterations, as you're doing in your example.
This still requires that fit be exported to all of the workers, and it might be quite large. But since there are many more tasks than workers and if predict takes enough time compared to the time to send the test data, it might be worthwhile to parallelize the prediction.

Related

Parallel computing with R in a SLURM cluster

I need to do a model estimation using the MCMC method on a SLURM cluster (the system is CentOS). The estimation takes a very long time to finish.
Within each MCMC interaction, there is one step taking a particularly long time. Since this step is doing a lapply-loop (around 100000 loops, 30s to finish all loops), so as far as I understand, I should be able to use parallel computing to speed up.
I tried several packages (doMC, doParallel, doSNOW) together with the foreach framework. The setup is
parallel_cores=8
#doParallel
library(doParallel)
cl<-makeCluster(parallel_cores)
registerDoParallel(cl)
#doMC
library(doMC)
registerDoMC(parallel_cores)
#doSNOW, this is also fast
library(doSNOW)
ml<-makeCluster( parallel_cores)
registerDoSNOW(cl)
#foreach framework
#data is a list
data2=foreach(
data_i=
data,
.packages=c("somePackage")
) %dopar% {
data_i=some_operation(data_i)
list(beta=data_i$beta,sigma=data_i$sigma)
}
Using a doMC, the time for this step can be reduced to about 9s. However, as doMC is using shared memory and I have a large array to store estimate results, I quickly ran out of memory (i.e. slurmstepd: error: Exceeded job memory limit).
Using doParallel and doSNOW, the time for this step even increased, to about 120s, which sounds ridiculous. The mysterious thing is that when I tested the code in both my Mac and Windows machines, doParallel and doSNOW actually gave similar speed compared to doMC.
I'm stuck and not sure how to proceed. Any suggestions will be greatly appreciated!

Parallelization with Rborist

I have a large n (>1,000,000) dataset with a small number of features to estimate (regression) random forest and have been looking to implement Rborist (in R). I'd like to parallelize my work, but am not finding much guidance on how that would be done. I have 16 processors to use on the machine where it's running. When I use doParallel with the randomForest package, for example, the command:
rf <- foreach(ntree=rep(32, 16), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, nodesize = 25, ntree=ntree)
It launches 16 R processes, and works slowly as randomForest does, but works.
The analogous command for Rborist:
rb <- foreach(ntree=rep(32, 16), .combine=combine, .packages='Rborist') %dopar% Rborist(x, y, minNode = 25, ntree=ntree)
Throws the error:
error calling combine function:
Warning message: In mclapply(argsList, FUN, mc.preschedule =
preschedule, mc.set.seed = set.seed, : all scheduled cores
encountered errors in user code
Does anyone know how to parallelize with Rborist? It does not appear to be happening under the hood as it's only using 1 cpu when I run:
rb <- Rborist(x, y, minNode = 25, ntree = 512)
Rborist runs in parallel by itself. It uses all my threads on my machine (win 10 64bit). But then I didn't load doParallel / foreach first.
Same goes for the ranger package, but in ranger you can set the number of threads to use.
Speedy implementations of a rf are of the top of my head:
Rborist (large n, low p)
ranger (handles large p, modest n)
random forest.ddr (haven't tested)
distributed random forest in H2O. very fast, but makes use of
stopping criteria.
Rborist currently uses all available cores. Would it be useful to offer a way to tune this?
Have you tried the latest version on CRAN, 0.1-3? This contains a change to the default minimum node size for regression, improving accuracy in some cases.
We've been making some strides toward improving performance with modest (as opposed to small) predictor count. This should also be reflected by changes in the latest release.
Large running memory footprint is probably a consequence of the breadth-first splitting approach. One way to conserve memory is to carve the problem into chunks, but we haven't gotten there yet.
Large final memory size is chiefly due to caching leaf information for subsequent use by other packages or for quantile regression. Perhaps we should add a "noLeaf" option for users who are not interested in either of those options.

R foreach error unable to allocate size 34.6 GB when I have 140GB+ memory allocated to R

I am trying to use R's doSNOW and foreach packages to parallelize ranger random forest classification predictions on a file that is a little over 15MM records and 82 attributes.
I am using an AWS EC2 instance the size of m4.10xlarge. It is a Windows OS and R is 64-bit. I have definitely allocated over 100GB of RAM to R and I have cleared any garbage memory out. I only have a random forest ranger model and the 15MM record dataframe as objects in R, yet when I try run my code (regardless of the splits/chunks I try), I get errors similar to the following:
error calling combine function:
<simpleError: cannot allocate vector of size 34.6 Gb>
For each record, I'm predicting the class at each tree so I can later calculate class probabilities. I can run the prediction easily on the entire file, but it's too slow and I have a lot more files to predict on. All I'm trying to do is speed up the process, but no matter what I do, I keep getting this error. I have spent hours online looking for info, but cannot find anyone else with this issue.
Here is my prediction code:
num_splits <- 7
cl <- makeCluster(num_splits,outfile="cluster.txt")
registerDoSNOW(cl)
writeLines(c(""), "log.txt")
system.time(
predictions<-foreach(d=isplitRows(df2, chunks=num_splits),
.combine = rbind, .packages=c("ranger")) %dopar% {
sink("log.txt", append=TRUE)
(predict(model1, data=d,predict.all=TRUE)$predictions)
})
stopCluster(cl)
Does anyone know why I'm getting this error and how to fix it? Is it possible I'm allocating too much memory to R and that's why?
Or could the way I'm combining the results be causing issues -- I'm trying to stack predictions from each chunk on top of each other so I have a row of tree level predictions for each of the 15MM records.
Thanks so much!

Is there a way to track progress during parallelized Random Forest building?

I'm using R's caret package to do modeling for Coursera class on machine learning.
I'm currently building Random Forest with 500 trees on a data set of 11k observations and 40 features.
It took about 3 hours for single core implementation to compute results and I'm experimenting with multi-core implementation right now (code below)
library(parallel)
library(caret)
library(doParallel)
library(foreach)
cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)
trCtrl <- trainControl(allowParallel = TRUE)
modFit2 <- train(classe~ ., data=training, trControl = trCtrl, method="parRF", prox=TRUE, ntree = 500)
Now my question is this: Is there a way to view progress on build model during run-time? Is there a package/implementation of parallelized RF that outputs for example progress on number of trees built as it run?
Obvious question is: why do I need to know? Cant I just wait this hour or two for results? It wont be faster but might be slower that way!
I have a lot of models to build for my class and I dont want to spend few hours on each model and wonder if it is running or not. I want to confirm that it is building trees, stop execution and schedule it for the night when I will run full models. I will be running different configurations of parameters for RF and also some other time intensive models so I would rather spend my day-time on writing code while leave my computer on the mercy of running computation full speed when I'm sleeping (my browser is barely working right now :P as both my RAM and CPU are almost at 100%)
You could use getModelInfo to add cat statements to the fit function. Also, there is a verboseIter option in trainControl that you are ignoring here.
Probably the problem is that you are using trainControl(allowParallel = TRUE). This is going to try to fit the resampling iterations across different cores and using method="parRF" fits each of those in parallel.
If you specify 4 cores on your machine, you have probably spawned 16 workers. You are probably better off using method = "rf" and trainControl(allowParallel = TRUE). That might also mean that you have 17 copies of the data in memory.

R: SVM performance using laplace kernel is too slow

I've created an SVM in R using the kernlab package, however it's running incredibly slow (20,000 predictions takes ~45 seconds on win64 R distribution). CPU is running at 25% and RAM utilization is a mere 17% ... it's not a hardware bottleneck. Similar calculations using data mining algorithms in SQL Server analysis services run about 40x faster.
Through trial and error, we discovered that the laplacedot kernel gives us the best results by a wide margin. Rbfdot is about 15% less accurate, but twice as fast (but still too slow). The best performance is vanilladot. It runs more or less instantly but the accuracy is way too low to use.
We'd ideally like to use the laplacedot kernel but to do so we need a massive speedup. Does anyone have any ideas on how to do this?
Here is some profiling information I generated using rprof. It looks like most of the time is spent in low level math calls (the rest of the profile consists of similar data as rows 16-40). This should run very quickly but it looks like the code is just not optimized (and I don't know where to start).
http://pastebin.com/yVPC66Be
Edit: Sample code to reproduce:
dummy.length = 20000;
source.data = as.matrix(cbind(sample(1:dummy.length)/1300, sample(1:dummy.length)/1900))
colnames(source.data) <- c("column1", "column2")
y.value = as.matrix((sample(1:dummy.length) + 9) / 923)
model <- ksvm(source.data[,], y.value, type="eps-svr", kernel="laplacedot",C=1, kpar=list(sigma=3));
The source data has 7 numeric columns (floating point) and 20,000 rows. This takes about 2-3 minutes to train. The next call generates the predictions and consistently takes 40 seconds to run:
predictions <- predict(model, source.data)
Edit 2: The Laplacedot kernel calculates the dot product of two vectors using the following formula. This corresponds rather closely with the profr output. Strangely, it appears that the negative symbol (just before the round function) consumes about 50% of the runtime.
return(exp(-sigma * sqrt(-(round(2 * crossprod(x, y) - crossprod(x,x) - crossprod(y,y), 9)))))
Edit 3: Added sample code to reproduce - this gives me about the same runtimes as my actual data.
SVM itself is a very slow algorithm. The time complexity of SVM is O(n*n).
SMO (Sequence Minimum Optimization http://en.wikipedia.org/wiki/Sequential_minimal_optimization) is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines.
libsvm ( http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and liblinear are two open source implementation.

Resources