Random forest (Rborist) with large dataset in R

Random forest (Rborist) with large dataset in R - r

I am using Rborist to construct random forest in R. But, after building the model using training set, while using predict (predict.Rborist) function, R is crashing with the message "R for Windows GUI front-end has stopped working".
I am using a machine with 8 core CPU, 32 gb RAM and my data set has 150k records along with 2k variables. Building a random forest using the whole dataset requires 2 hours approx with parallel processing enabled.
While this might be a memory error, CPU or Memory usage status isn't indicating that. Please help.

Indranil,
This is likely not a memory problem. The predict() method had an error in which the row count was implicitly assumed to be less than or equal to the original training row count. The version on Github repairs this problem and appears to be stable. A new CRAN version is overdue, and awaits several changes.

Related

Error in RStudio while running decision tree (mac)

I am running a CART decision tree on a training set which I've tokenized using quanteda for a routine text analysis task. The resulting DFM from tokenizing was turned into a dataframe and appended with the class attribute I am predicting for.
Like many DFMs, the table is very wide (33k columns), but only contains about 5,500 rows of documents. Calling rpart on my training set returns a stack overflow error.
If it matters, to help increase the speed of calculations, I am using the doSNOW library so I can run the model on 3 out of 4 of my cores in parallel.
I've looked at this answer but can't figure out how to do the equivalent on my mac workstation to see if the same solution would work for me. There is a chance that even if I increase the ppsize of RStudio, I may still run into this error.
So my question is how do I increase the maxppsize of RStudio on a mac, or more generally, how can I fix this stack overflow so I can run my model?
Thanks!

In the end, I found that macs don't have this same command line option since the mac version of RStudio uses all available memory by default.
So the way I fixed this is by decreasing the complexity of the task by reducing the sparsity. I cleaned the document-term matrix by removing all tokens that did not occur in at least 5% of the corpus. This was enough to take a matrix with 33k columns down to a much more manageable 3k columns while still leading to a highly representative DFM.

R Large datasets and xgboost cv

Apologies if this question is too broad..
I'm running a large data set (around 20Gb on a 64Gb 4 core Linux machine) through cv.xgb in R. I'm currently hitting two issues:
Trying 10-fold cv crashes R (no error from xgboost, session just terminates).
Trying 5-fold, the code will run but reserves 100Gb of virtual memory, and slows to a crawl.
I'm confused as to why the code can do 5-fold but not 10-fold, I would have thought each fold would be treated seperately and would just take twice as long. What is xgboost doing across all folds?
With swap issues, is there any way to better manage the memory to avoid slowdown? the 5-fold cv is taking >10 times as long as a single run on a similar number of trees.
are there any packages better adapted to large data sets? or do I just need more RAM?

How can I tell if R is still estimating my SVM model or has crashed?

I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.

As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.

By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.

R becomes unresponsive while running randomforest on huge data. Does this mean it is still running or it has stopped working?

My data contains 229907 rows and 200 columns. I am training randomforest on it. I know it will take time. But do not know how much. While running randomforest on this data, R becomes unresponsive. "R Console (64 Bit) (Not Responding)". I just want to know what does it mean? Is R still working or it has stopped working and I should close it and start again?

It's common for RGui to be unresponsive during a long calculation. If you wait long enough, it will usually come back.
The running time won't scale linearly with your data size. With the default parameters, more data means both more observations to process and more nodes per tree. Try building some small forests with ntree=1, different values of the maxnodes parameter and different amounts of data, to get a feel for how long it should take. Have the Windows task manager or similar open at the same time so that you can monitor CPU and RAM usage.
Another thing you can try is making some small forests (small values of ntree) and then using the combine function to make a big forest.

You should check your CPU usage and memory usage. If the CPU is still showing a high usage with the R process, R is probably still going strong.

Consider switching to R 32 bit. For some reason, it seems more stable for me - even when my system is perfectly capable of 64 bit support.

SVM modeling with BIG DATA

For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.

Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Random forest (Rborist) with large dataset in R - r

Related

Error in RStudio while running decision tree (mac)

R Large datasets and xgboost cv

How can I tell if R is still estimating my SVM model or has crashed?

R becomes unresponsive while running randomforest on huge data. Does this mean it is still running or it has stopped working?

SVM modeling with BIG DATA

Categories

Resources