How to save my trained Random Forest model and apply it to test data files one by one? - r

This is a long shot and more of a code designing sort of ask for a rookie like me but I think it has real value for real world applications
The core questions are:
Can I save a trained ML model, such as Random Forest (RF), in R and call/use it later without the need to reload all the data used for training it?
When, in real life, I have a massive folder of hundreds and thousands files of data to be tested, can I load that model I saved somewhere in R and ask it to go read the unknown files one by one (so I am not limited by RAM size) and perform regression/classification etc analysis for each of the file read in, and store ALL the output together into a file.
For example,
If I have 100,000 csv files of data in a folder, and I want to use 30% of them as training set, and the rest as test for a Random Forest (RF) classification.
I can select the files of interest, call them "control files". Then use fread() then randomly sample 50% of the data in those files, call the CARET library or RandomForest library, train my "model"
model <- train(,x,y,data,method="rf")
Now can I save the model somewhere? So I don't have to load all the control files each time I want to use the model?
Then I want to apply this model to all the remaining csv files in the folder, and I want it to read those csv files one by one when applying the model, instead of reading them all in, due to RAM issue.

Related

Store regression models in dataframe

I conduct a large number of regression analyses using ols and cph (different models, sensitivity analyses etc) which takes on my computer around two hours. Therefore, I would like to save these models so that I don't have to re-run the same analyses every time I want to work with them. The models all have very structured names, so I can create a list of names as follows:
model.names <- list()[grep("^im", ls())
But how can I use this to save those models? Could they be placed into a data frame?
I think you are looking for save()
save writes an external representation of R objects to the specified file. The objects can be read back from the file at a later date by using the function load or attach (or data in some cases).

R - Is it possible to run two (or more) consoles using the SAME environment simultaneously?

And if so, how?
I use RStudio. I know I can fork a project in order to perform calculations over two copies of the same environment (as described here). Although, it doesn't fit my needs because the environment I'm currently using is very big, and I don't have enough RAM for duplicating it.
Therefore, I am wondering if there is some way in which I can open two (or more) consoles using the same one (in particular, I would be particularly interested on not having to replicate the very big data frames).
Is there a way in which I can use RStudio this way, or is there any other IDE or tool which supports it?
Thank you for your help.
EDIT:
I will explain what I'm trying to do: I'm developing some machine learning models based on a large dataset.
I load the dataset into a data frame.
Then I perform different treatments over the data in order to transform them into ML-friendly data.
I perform these two steps in one R script, and I end up with an environment loaded with a heavy data frame, libraries and some other objects.
Then I'm using this dataset to feed several ML models: those models are of different classes, and within each class I'm trying several models with different parameters.
I have one R script for each class of models, and I would like to run and score each class parallel. Each model within each class will run sequentially.
The key here is: I know I can use different projects in order to do this, but that would suppose having to load several times the same environment, and for me that is problematic because it would mean having to load to RAM several times the same big data frame. Therefore I would like to know if there is a way to have several R scripts run in parallel while using the same environment.
Then I will use another script to rank all the models.

data mining with unstructured data how to implement?

I have unstructured data (screenshot of app) and semi-structured data(screen dumping file), i chose store it in hbase. my goal is find defect or issue on app (meaningfull data). Now, I'd like to apply data mining on these, so that is kind of text mining ? and how can i apply some data mining technical on this data ?
To begin with, you can use rule based approach where you define set of rules which detects the defect scenario.
Then you can prepare training data set which has many instances of defect, non-defect scenarios. In this step, for each screenshot or screen dump file you collect; you would manually tag it as defect or non-defect.
Then you can train classifier using this training data. Classifier would try to generalize training samples to predict the output label for the samples not seen in the past.
Since, your input is non-standard you might need some preprocessing to convert your input to standard form. For example, to process screenshots you might need some image processing, OCR, computer vision libraries.

How to export an R Random Forest model for use in Excel VBA without API calls

Problem:
I have a Random Forest model trained in R. I need to deploy this model in a standalone Excel tool that will be used by 350 people across a sales network to perform real-time predictions based on data entered into the spreadsheet by users.
How can I do this?
Constraints:
It is not an option to require users to install R on their local machines.
It is not an option to have a server (physical or cloud) providing a scoring API.
What have I done so far?
1. PMML
I can export the model in PMML (XML structure). From research I can see there are libraries for loading and executing PMML inputs in Python and Java. However I haven't found anything implemented in VBA / VB.
2. Zementis
I looked into a solution called Zementis which offers an Excel add-in to deploy PMML models. However from my understanding this requires web-service calls to a cloud server (e.g. AWS) where the actual model execution happens. My IT security department will not allow this.
3. Others
The most common recommendation seems to be to call R to load the model and run the predict function. As noted above, this is not a viable option.
Detailed Context:
The Random Forest model is trained in R, with c. 30 variables. The model is used to recommend "personalised" prices for products as part of a sales process.
The model needs to be distributed to the sales network, with about 350 users. The business's preference is to integrate the model into an existing spreadsheet tool that sales teams currently use to calculate deal profitability.
This means that I need to be able to export the model in a way that it can be implemented in Excel VBA.
Given timescales, the implementation needs to be self-contained with no IT infrastructure or additional application installs. We are working with the organisation's IT team on a server based solution, however their deployment timescales are 12 months+ which means we need a tactical solution in the short-term.
Here's one approach to get the "rules" for the trees (example using the mtcars dataset)
install.packages("randomForest")
library(randomForest)
head(mtcars)
set.seed(1)
fit <- randomForest(mpg ~ ., data=mtcars, importance=TRUE, proximity=TRUE)
print(fit)
## Look at variable importance:
importance(fit)
# Print the rules for each tree in the forest
install.packages("rattle")
library(rattle)
printRandomForests(fit)
It is probably unrealistic to use the rules for 500 trees, but maybe you could implement 100 trees in your vba and then take an average of the results (for a continuous response) or predict the class with the most votes across the trees (for a categorical response).
Maybe you could recreate the model on a Worksheet.
As far as I know, Excel can import XML structures (on the Development Tools ribbon).
Edit: 1) save pmml structure in plaintext editor as .xml file.
2) Open the file in Excel 2013 (maybe other versions also do it)
3) Click through the error message and open the file anyway. Trees open as a table, a bit funny, but recognizable.
4) Create prediction calculation (generic fn in VBA) to operate on the tree.

In R, is there any way to share a variable between difference processes of R in the same machine?

My problem is that I have a large model, which is slow to load to memory. To test it on many samples, I need to run some C program to generating input features for model, then run R script to predict. It takes too much time to load the model every time.
So I am wondering
1) if there is some method to keep the model ( a variable in R) in the memory.
or
2) Can I run a separative process of R as a dedicated server, then all the prediction processes of R can access the variable in the server on the same machine.
The model is never changed during for all the prediction. It is a randomForest model stored in a .rdata file, which has ~500MB. Loading this model is slow.
I know that I can use parallel R (snow, doPar, etc) to perform prediction in parallel, however, this is not what I want, since it require me to change the data flow I used.
Thanks a lot.
If you are regenerating the model every time, you can save the model as an RData file and then share it across the different machines. While it may still take time to load from disk to memory, it will save the time of regenerating.
save(myModel, file="path/to/file.Rda")
# then
load(file="path/to/file.Rda")
Edit per #VictorK's suggetsion:
As Victor points out, since you are saving only a single object, saveRDS may be a better choice.
saveRDS(myModel, file="path/to/file.Rds")
myModel <- readRDS(file="path/to/file.Rds")

Resources