R Large datasets and xgboost cv - r

Apologies if this question is too broad..
I'm running a large data set (around 20Gb on a 64Gb 4 core Linux machine) through cv.xgb in R. I'm currently hitting two issues:
Trying 10-fold cv crashes R (no error from xgboost, session just terminates).
Trying 5-fold, the code will run but reserves 100Gb of virtual memory, and slows to a crawl.
I'm confused as to why the code can do 5-fold but not 10-fold, I would have thought each fold would be treated seperately and would just take twice as long. What is xgboost doing across all folds?
With swap issues, is there any way to better manage the memory to avoid slowdown? the 5-fold cv is taking >10 times as long as a single run on a similar number of trees.
are there any packages better adapted to large data sets? or do I just need more RAM?

Related

Parallel cv.glmnet with large matrix in windows

I'm trying to run parallel cv.glmnet poisson models on a windows machine with 64Gb of RAM. My data is a 20 million row x 200 col sparse matrix, around 10Gb in size. I'm using makecluster and doParallel, and setting parallel = TRUE in cv.glmnet. I currently have two issues getting this setup:
Distributing data to different processes is taking hours, reducing speedup significantly. I know this can be solved using fork on linux machines, but is there any way of reducing this time on windows?
I'm running this for multiple models with data and responses, so the object size is changing each time. How can I work out in advance how many cores I can run before getting an 'out of memory' error? I'm particularly confused at how the data gets distributed. If I run on 4 cores, the first rsession will use 30Gb of memory, while the others will be closer to 10Gb. What does that 30 Gb go towards, and is there any way of reducing it?

How to do memory management, monitoring, and testing in R?

I notice that using R to program constantly causing some running speed problem, especially when the code involves growing a list. It's just very unintuitive what is slowing down the program so dramatically during the looping.
Specifically I'm using caret to train a gbm model, after getting the tuned hyperparameter, I need to do LOOCV to obtain the test error, which demands me to train the model for n times (n=number of samples). All I store in the list is my prediction result. Yet the list grows slower as the loop progress.
Can you offer some general advice for testing the memory issues related to R programming?
First create an empty list/vector. Size of your "n" so R has not to rewrite it every time it wants to add one additional value

Random forest (Rborist) with large dataset in R

I am using Rborist to construct random forest in R. But, after building the model using training set, while using predict (predict.Rborist) function, R is crashing with the message "R for Windows GUI front-end has stopped working".
I am using a machine with 8 core CPU, 32 gb RAM and my data set has 150k records along with 2k variables. Building a random forest using the whole dataset requires 2 hours approx with parallel processing enabled.
While this might be a memory error, CPU or Memory usage status isn't indicating that. Please help.
Indranil,
This is likely not a memory problem. The predict() method had an error in which the row count was implicitly assumed to be less than or equal to the original training row count. The version on Github repairs this problem and appears to be stable. A new CRAN version is overdue, and awaits several changes.

How can I tell if R is still estimating my SVM model or has crashed?

I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.

SVM modeling with BIG DATA

For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.

Resources