I am currently running some ML models on a R studio Server with 64gb of RAM,
My ML models are run relatively quickly and what one would normally expect given their sparse matrix size
the methods i have been using are Logistic regression and XGBOOST
However I now want to "profile" and see the memory being used at the actual model fitting stage - i have used Profvis, but it does not seem to work on my matrix of 760 variables by 228,000 rows on the rstudio server, it does not load the actual profvis viewer and uses up all 64GB of ram!
Is there any way around this? (aside from shrinking the data)
As in other packages aside from profvis, that allow you to profile code at any moment to see how much memory is being used?
Related
I would like to do several machine learning techniques (logistic regression, SVM, Random forrest, neural network) in R on a dataset of 224 GB while my RAM is only 16 GB.
I suppose a solution could be to rent a virtual PC in the cloud with 256 GB RAM. For example an EC2 at AWS based on an AMI from this post by Louis Aslett:
http://www.louisaslett.com/RStudio_AMI/
Alternatively I understood there are several parallel processing methods and packages. For example Sparklyr, Future and ff. Is parallel processing a solution to my problem of limited RAM? Or is parallel processing targetted at running code faster?
If I assume parallel processing is a solution, I assume I need to modify the processes within the machine learning packages. For example, logistic regression is done with this line of code:
model <- glm ( Y ~., family=binomial ( link='logit' ), data=train )
Although as far as I know I don’t have influence over the calculations within the glm-method.
Your problem is that you can't fit all the data in memory at once, and the standard glm() function needs that. Luckily, linear and generalized linear models can be computed using the data in batches. The issue is how to combine the computations between the batches.
Parallel algorithms need to break up datasets to send to workers, but if you only have one worker, you'd need to process them serially, so it's only the "breaking up" part that you need. The biglm package in R can do that for your class of models.
I'd suggest h2o. It has a lot of support for fitting logistic regression, SVM, Random Forrest, and neural network, among others.
Here's how to install h2o in R
I also didn't find bigmemory packages are limited in functionality available.
Apologies if this question is too broad..
I'm running a large data set (around 20Gb on a 64Gb 4 core Linux machine) through cv.xgb in R. I'm currently hitting two issues:
Trying 10-fold cv crashes R (no error from xgboost, session just terminates).
Trying 5-fold, the code will run but reserves 100Gb of virtual memory, and slows to a crawl.
I'm confused as to why the code can do 5-fold but not 10-fold, I would have thought each fold would be treated seperately and would just take twice as long. What is xgboost doing across all folds?
With swap issues, is there any way to better manage the memory to avoid slowdown? the 5-fold cv is taking >10 times as long as a single run on a similar number of trees.
are there any packages better adapted to large data sets? or do I just need more RAM?
I am using Rborist to construct random forest in R. But, after building the model using training set, while using predict (predict.Rborist) function, R is crashing with the message "R for Windows GUI front-end has stopped working".
I am using a machine with 8 core CPU, 32 gb RAM and my data set has 150k records along with 2k variables. Building a random forest using the whole dataset requires 2 hours approx with parallel processing enabled.
While this might be a memory error, CPU or Memory usage status isn't indicating that. Please help.
Indranil,
This is likely not a memory problem. The predict() method had an error in which the row count was implicitly assumed to be less than or equal to the original training row count. The version on Github repairs this problem and appears to be stable. A new CRAN version is overdue, and awaits several changes.
I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.
My data contains 229907 rows and 200 columns. I am training randomforest on it. I know it will take time. But do not know how much. While running randomforest on this data, R becomes unresponsive. "R Console (64 Bit) (Not Responding)". I just want to know what does it mean? Is R still working or it has stopped working and I should close it and start again?
It's common for RGui to be unresponsive during a long calculation. If you wait long enough, it will usually come back.
The running time won't scale linearly with your data size. With the default parameters, more data means both more observations to process and more nodes per tree. Try building some small forests with ntree=1, different values of the maxnodes parameter and different amounts of data, to get a feel for how long it should take. Have the Windows task manager or similar open at the same time so that you can monitor CPU and RAM usage.
Another thing you can try is making some small forests (small values of ntree) and then using the combine function to make a big forest.
You should check your CPU usage and memory usage. If the CPU is still showing a high usage with the R process, R is probably still going strong.
Consider switching to R 32 bit. For some reason, it seems more stable for me - even when my system is perfectly capable of 64 bit support.