How to improve randomForest performance? - r

I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R snippet, to train the model using randomForest. This is taking hours for me.
rf.model <- randomForest(
Weekly_Sales~.,
data=newdata,
keep.forest=TRUE,
importance=TRUE,
ntree=200,
do.trace=TRUE,
na.action=na.roughfix
)
I think, due to na.roughfix, it is taking long time to execute. There are so many NA's in the training set.
Could someone let me know how can I improve the performance?
My system configuration is:
Intel(R) Core i7 CPU # 2.90 GHz
RAM - 8 GB
HDD - 500 GB
64 bit OS

(The tl;dr is you should a) increase nodesize to >> 1 and b) exclude very low-importance feature columns, maybe even exclude (say) 80% of your columns. Your issue is almost surely not na.roughfix, but if you suspect that, run na.roughfix separately as a standalone step, before calling randomForest. Get that red herring out of the way at first.)
Now, all of the following advice only applies until you blow out your memory limits, so measure your memory usage, and make sure you're not exceeding. (Start with ridiculously small parameters, then scale them up, measure the runtime, and keep checking it didn't increase disproportionately.)
The main parameters affecting the performance of randomForest are:
mtry (less is faster)
ntrees
number of features/cols in data - more is quadratically slower, or worse! See below
number of observations/rows in data
ncores (more is faster - as long as parallel option is being used)
some performance boost by setting importance=F and proximity=F (don't compute proximity matrix)
Never ever use the insane default nodesize=1, for classification! In Breiman's package, you can't directly set maxdepth, but use nodesize as a proxy for that, and also read all the good advice at: CrossValidated: "Practical questions on tuning Random Forests"
So here your data has 4.2e+5 rows, then if each node shouldn't be smaller than ~0.1%, try nodesize=42. (First try nodesize=420 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.)
runtime is proportional to ~ 2^D_max, i.e. polynomial to (-log1p(nodesize))
optionally you can also speedup by using sampling, see strata,sampsize arguments
Then a first-order estimate of runtime, denoting mtry=M, ntrees=T, ncores=C, nfeatures=F, nrows=R, maxdepth=D_max, is:
Runtime proportional to: T * F^2 * (R^1.something) * 2^D_max / C
(Again, all bets are off if you exceed memory. Also, try running on only one core, then 2, then 4 and verify you actually do get linear speedup. And not slowdown.)
(The effect of large R is worse than linear, maybe quadratic, since tree-partitioning has to consider all partitions of the data rows; certainly it's somewhat worse than linear. Check that by using sampling or indexing to only give it say 10% of rows).
Tip: keeping lots of crap low-importance features quadratically increases runtime, for a sublinear increase in accuracy. This is because at each node, we must consider all possible feature selection (or whatever number mtry) allows. And within each tree, we must consider all (F-choose-mtry) possible combinations of features.
So here's my methodology, doing "fast-and-dirty feature-selection for performance":
generate a tree normally (slow), although use a sane nodesize=42 or larger
look at rf$importances or randomForest::varImpPlot(). Pick only the top-K features, where you choose K; for a silly-fast example, choose K=3. Save that entire list for future reference.
now rerun the tree but only give it newdata[,importantCols]
confirm that speed is quadratically faster, and oob.error is not much worse
once you know your variable importances, you can turn off importance=F
tweak mtry and nodesize (tweak one at a time), rerun and measure speed improvement
plot your performance results on logarithmic axes
post us the results! Did you corroborate the above? Any comments on memory usage?
(Note that the above is not a statistically valid procedure for actual feature-selection, do not rely on it for that, read randomForest package for the actual proper methods for RF-based feature-selection.)

I suspect do.trace might also consume time... instead do.trace = TRUE, you can used do.trace = 5 (to show only 5 traces) just to have some feel about errors. For large dataset, do.trace would take up a lot time as well.

Another think I noticed:
the correct is ntrees, not ntree.
The default is 500 for the randomForest package.

Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
ranger (uses C++ compiler)
randomForestSRC (uses C++ compiler)
h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!

Related

Time complexity of nlm-package in R?

I'm estimating a Non-Linear system (via seemingly unrelated regressions - SUR), using systemfit (nlsystemfit() function) package with 4 equations, 32 parameters to estimate (!) and 412 observations. But my code is taking forever (my laptop it's not a super-powerful one tho). So far, the process was on a 13 hours run. I'm not an expert in computational stuff, but someone explained me some time ago the concept of Time Complexity of the algorithms (or big-o), then depending on this concept the time to compute a certain algorithm could rely on specific functional relation on the number of observations and/or coefficients.
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
But all this has a sense if I can change key elements in my model, which can reduce the time of processing significantly. That's why I was looking on google about the time complexity of nlm-package (nlsystemfit() function relies on nlm) but unsuccessfully. So, this is my question: Anybody knows where I can find that info, or at least give me advice on how test non-linear systems before run a whole model?
Since you didn't provide any substantial information regarding your model or some code for the same, its hard to express a betterment for your situation.
From what you said:
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
It seems you require benchmarking or to obtain the measured time taken to execute, as in your case. (although it can deal with memory usage or some other performance metric as well)
There are quite a few ways to benchmark code in R, which include the use of Sys.time() or system.time() just before and right after your algorithm/function executes, or libraries such as rbenchmark (which is a simple wrapper around the system.time function), tictoc, bench and microbenchmark.
Among these the last two are preferable options, as bench::mark includes system_time(), a higher precision alternative to system.time() and microbenchmark is known to be a reliable source to accurately measure and compare the execution time of R expressions/algorithms.

Why does the fitting time of a gam increases with the number of threads used?

Common sense indicates that any computation should be faster the more cores or threads we use. If the scaling is bad, the computation time will not improve with increasing number of threads. Thus, how come increasing threads considerably reduces the computation time when fitting a gam with R package MGCV, as shown by this example? :
library(boot) # loads data "amis"
t1<-Sys.time()
mod <- gam(speed ~ s(period, warning, pair, k = 12), data = amis, family=tw (link = log),method="REML",control=list(nthreads=1)) #
t2<-Sys.time()
print("Model fitted in:")
print(t2-t1)
If you increase the number of threads in this example to 2, 4, etc, the fitting procedure will take longer and longer, instead of being faster as we would expect. In my particular case:
1 thread: 32.85333 secs
2 threads: 50.63166 secs
3 threads: 1.2635 mins
Why is this? If I am doing something wrong, what can I do to obtain the desired behavior (i.e., increasing performance with increasing number of threads)?
Some notes:
1) The model, family and solving method shown here make no particular sense. This is only an example. However, I’ve got into this problem with real data and a reasonable model (but for simplicity I use this small code to exemplify the problem). Data, functional form of model, family, solving method seem all to be irrelevant: after many tests I get always the same behaviour, i.e., increasing the number of used threads, decreases performance (i.e., increases computation time).
2) Operative System: Linux Ubuntu 18.04;
3) Architecture: DELL Power Edge with two physical CPUs Intel Xeon X5660 each of them with 6 cores #2800 Mhz and each core being able of handling 2 threads (i.e., total of 24 threads). 80Gb RAM.
4) OpenMP libraries (which are needed for the multi-threath capacity of function gam) were installed with
sudo apt-get install libomp-dev
5) I am aware of the help page for multi-core use of gam (https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/mgcv-parallel.html). The only thing written there pointing to a decrease of performance with increasing number of threads is "Because the computational burden in mgcv is all in the linear algebra, then parallel computation may provide reduced (...) benefit with a tuned BLAS".

A framework for comparing the time performance of Expectation Maximization

I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.
Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?
Fair benchmarking of EM is hard. Very hard.
the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.
EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.
There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality.
Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.
Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.
For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?
You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.

How can I tell if R is still estimating my SVM model or has crashed?

I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.

R: SVM performance using laplace kernel is too slow

I've created an SVM in R using the kernlab package, however it's running incredibly slow (20,000 predictions takes ~45 seconds on win64 R distribution). CPU is running at 25% and RAM utilization is a mere 17% ... it's not a hardware bottleneck. Similar calculations using data mining algorithms in SQL Server analysis services run about 40x faster.
Through trial and error, we discovered that the laplacedot kernel gives us the best results by a wide margin. Rbfdot is about 15% less accurate, but twice as fast (but still too slow). The best performance is vanilladot. It runs more or less instantly but the accuracy is way too low to use.
We'd ideally like to use the laplacedot kernel but to do so we need a massive speedup. Does anyone have any ideas on how to do this?
Here is some profiling information I generated using rprof. It looks like most of the time is spent in low level math calls (the rest of the profile consists of similar data as rows 16-40). This should run very quickly but it looks like the code is just not optimized (and I don't know where to start).
http://pastebin.com/yVPC66Be
Edit: Sample code to reproduce:
dummy.length = 20000;
source.data = as.matrix(cbind(sample(1:dummy.length)/1300, sample(1:dummy.length)/1900))
colnames(source.data) <- c("column1", "column2")
y.value = as.matrix((sample(1:dummy.length) + 9) / 923)
model <- ksvm(source.data[,], y.value, type="eps-svr", kernel="laplacedot",C=1, kpar=list(sigma=3));
The source data has 7 numeric columns (floating point) and 20,000 rows. This takes about 2-3 minutes to train. The next call generates the predictions and consistently takes 40 seconds to run:
predictions <- predict(model, source.data)
Edit 2: The Laplacedot kernel calculates the dot product of two vectors using the following formula. This corresponds rather closely with the profr output. Strangely, it appears that the negative symbol (just before the round function) consumes about 50% of the runtime.
return(exp(-sigma * sqrt(-(round(2 * crossprod(x, y) - crossprod(x,x) - crossprod(y,y), 9)))))
Edit 3: Added sample code to reproduce - this gives me about the same runtimes as my actual data.
SVM itself is a very slow algorithm. The time complexity of SVM is O(n*n).
SMO (Sequence Minimum Optimization http://en.wikipedia.org/wiki/Sequential_minimal_optimization) is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines.
libsvm ( http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and liblinear are two open source implementation.

Resources