sklearn-porter export() throws MemoryError - out-of-memory

I'm trying to export my RandomForestClassifier model into Java using sklearn-porter and it runs out of memory. How can I get past this issue? The Linux process grows to about 10+G and then fails - the machine has 24G available.
The training set has ~ 2.5M elements and is likely to increase in size as we target more classes to classify. The model has 20 features (can be potentially reduced). Currently we have let the trees grow as deep as possible. I'd hate to restrict the depth of the trees to get it to export to Java.
Related post: Darius Morawiec's response in Exporting python sklearn models to production (java/c++)
Code:
# train the forest
rfcf.fit(x_train, y_train)
# export
porter = Porter(rfcf, language='java')
output = porter.export(embed_data=True)
print(output)
BTW, I am retrying now after bumping up the machine memory to 52GB.
Update: It failed again with MemoryError after running for ~30 min.
BTW, I should also add that the same model was successfully exported in the verbose PMML format via sklean2pmml on a machine with 24G RAM. Pickling the model was also achieved in reasonable time (~30 min) but those files are pretty big (2-6 GB.)

Related

xgboost superslow on Google Cloud Compute Engine

I am trying to train a list of R caret models on Google Cloud Compute Engine (Ubuntu LTS16.04). The xgboost (both xgblinear and xgbtree) model took forever to complete the training. In fact, the CPU utilization is always 0 from GCP status monitoring.
I used doMC library for parallel execution. It works very well for models like C5.0, glmnet and gbm. However, for xgboost (both xgblinear and xgbtree),due to some reason, the CPU seems not running because the utilization remains 0. Troubleshooting:
1. Removed the doMC and run with single core only, same problem remained.
2. Changed the parallel execution library to doParallel instead of doMC. This round the CPU utilization went up, but it took 5 mins to complete the training on GCP. The same codes finished in just 12 seconds on my local laptop. (I ran 24 CPUs on GCP, and 4 CPUs on my local laptop)
3. The doMC parallel execution works well for other algorithm. Only xgboost has this problem.
Code:
xgblinear_Grid <- expand.grid(nrounds = c(50, 100),
lambda = c(.05,.5),
alpha = c(.5),
eta = c(.3))
registerDoMC(cores = mc - 1)
set.seed(123)
xgbLinear_varimp <- train(formula2, data=train_data, method="xgbLinear", metric=metric, tuneGrid = xgblinear_Grid, trControl=fitControl, preProcess = c("center", "scale", "zv"))
print(xgbLinear_varimp)
No error message generated. It simply runs endlessly.R sessionInfo
I encountered the same problem, and it took a long time to understand the three reasons behind it:
xgbLinear requires more memory than any other machine learning algorithm available in the caret library. For every core, you can assume at least 1GB RAM even for tiny datasets of only 1000 x 20 dimension, for bigger datasets more.
xgbLinear in combination with parallel execution has a final process that recollects the data from the threads. This process is usually responsible for the 'endless' execution time. Again, the RAM is the limiting factor. You might have seen the following error message that which is often caused by to little allocation of RAM:
Error in unserialize(socklist[[n]]) : error reading from connection
xgbLinear has its own parallel processing algorithm which gets mixed up with the doParallel algorithm. Here, the effective solution is to set xgbLinear to single-thread by an additional parameter in caret::train() - nthread = 1 - and let doParallel do the parallelization
As illustration for (1), you can see here that the memory utilization nears 80 GB:
and 235GB for a training a still tiny dataset of 2500x14 dimensionality:
As illustration for (2), you can see here that this is the process that takes forever if you don't have enough memory:

How to profile code on R studio Server, if Profvis keeps crashing?

I am currently running some ML models on a R studio Server with 64gb of RAM,
My ML models are run relatively quickly and what one would normally expect given their sparse matrix size
the methods i have been using are Logistic regression and XGBOOST
However I now want to "profile" and see the memory being used at the actual model fitting stage - i have used Profvis, but it does not seem to work on my matrix of 760 variables by 228,000 rows on the rstudio server, it does not load the actual profvis viewer and uses up all 64GB of ram!
Is there any way around this? (aside from shrinking the data)
As in other packages aside from profvis, that allow you to profile code at any moment to see how much memory is being used?

How to export an R Random Forest model for use in Excel VBA without API calls

Problem:
I have a Random Forest model trained in R. I need to deploy this model in a standalone Excel tool that will be used by 350 people across a sales network to perform real-time predictions based on data entered into the spreadsheet by users.
How can I do this?
Constraints:
It is not an option to require users to install R on their local machines.
It is not an option to have a server (physical or cloud) providing a scoring API.
What have I done so far?
1. PMML
I can export the model in PMML (XML structure). From research I can see there are libraries for loading and executing PMML inputs in Python and Java. However I haven't found anything implemented in VBA / VB.
2. Zementis
I looked into a solution called Zementis which offers an Excel add-in to deploy PMML models. However from my understanding this requires web-service calls to a cloud server (e.g. AWS) where the actual model execution happens. My IT security department will not allow this.
3. Others
The most common recommendation seems to be to call R to load the model and run the predict function. As noted above, this is not a viable option.
Detailed Context:
The Random Forest model is trained in R, with c. 30 variables. The model is used to recommend "personalised" prices for products as part of a sales process.
The model needs to be distributed to the sales network, with about 350 users. The business's preference is to integrate the model into an existing spreadsheet tool that sales teams currently use to calculate deal profitability.
This means that I need to be able to export the model in a way that it can be implemented in Excel VBA.
Given timescales, the implementation needs to be self-contained with no IT infrastructure or additional application installs. We are working with the organisation's IT team on a server based solution, however their deployment timescales are 12 months+ which means we need a tactical solution in the short-term.
Here's one approach to get the "rules" for the trees (example using the mtcars dataset)
install.packages("randomForest")
library(randomForest)
head(mtcars)
set.seed(1)
fit <- randomForest(mpg ~ ., data=mtcars, importance=TRUE, proximity=TRUE)
print(fit)
## Look at variable importance:
importance(fit)
# Print the rules for each tree in the forest
install.packages("rattle")
library(rattle)
printRandomForests(fit)
It is probably unrealistic to use the rules for 500 trees, but maybe you could implement 100 trees in your vba and then take an average of the results (for a continuous response) or predict the class with the most votes across the trees (for a categorical response).
Maybe you could recreate the model on a Worksheet.
As far as I know, Excel can import XML structures (on the Development Tools ribbon).
Edit: 1) save pmml structure in plaintext editor as .xml file.
2) Open the file in Excel 2013 (maybe other versions also do it)
3) Click through the error message and open the file anyway. Trees open as a table, a bit funny, but recognizable.
4) Create prediction calculation (generic fn in VBA) to operate on the tree.

Random forest (Rborist) with large dataset in R

I am using Rborist to construct random forest in R. But, after building the model using training set, while using predict (predict.Rborist) function, R is crashing with the message "R for Windows GUI front-end has stopped working".
I am using a machine with 8 core CPU, 32 gb RAM and my data set has 150k records along with 2k variables. Building a random forest using the whole dataset requires 2 hours approx with parallel processing enabled.
While this might be a memory error, CPU or Memory usage status isn't indicating that. Please help.
Indranil,
This is likely not a memory problem. The predict() method had an error in which the row count was implicitly assumed to be less than or equal to the original training row count. The version on Github repairs this problem and appears to be stable. A new CRAN version is overdue, and awaits several changes.

In R, is there any way to share a variable between difference processes of R in the same machine?

My problem is that I have a large model, which is slow to load to memory. To test it on many samples, I need to run some C program to generating input features for model, then run R script to predict. It takes too much time to load the model every time.
So I am wondering
1) if there is some method to keep the model ( a variable in R) in the memory.
or
2) Can I run a separative process of R as a dedicated server, then all the prediction processes of R can access the variable in the server on the same machine.
The model is never changed during for all the prediction. It is a randomForest model stored in a .rdata file, which has ~500MB. Loading this model is slow.
I know that I can use parallel R (snow, doPar, etc) to perform prediction in parallel, however, this is not what I want, since it require me to change the data flow I used.
Thanks a lot.
If you are regenerating the model every time, you can save the model as an RData file and then share it across the different machines. While it may still take time to load from disk to memory, it will save the time of regenerating.
save(myModel, file="path/to/file.Rda")
# then
load(file="path/to/file.Rda")
Edit per #VictorK's suggetsion:
As Victor points out, since you are saving only a single object, saveRDS may be a better choice.
saveRDS(myModel, file="path/to/file.Rds")
myModel <- readRDS(file="path/to/file.Rds")

Resources