Options for deploying R models in production - r

There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?
Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?
Any other options out there?

The following is a list of the alternatives that I have found so far to deploy an R model in production. Please note that the workflow to use these products varies significantly between each other, but they are all somehow oriented to facilitate the process of exposing a trained R model as a service:
openCPU
AzureML
DeployR
yhat (already mentioned by #Ramnath)
Domino
Sense.io

The answer really depends on what your production environment is.
If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.
Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)
How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.
There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.

You can create RESTful APIs for your R scripts using plumber (https://github.com/trestletech/plumber).
I wrote a blog post about it (http://www.knowru.com/blog/how-create-restful-api-for-machine-learning-credit-model-in-r/) using deploying credit models as an example.
In general, I do not recommend PMML because the packages you used might not support translation to PMML.

A common practice is scoring a new/updated dataset in R and moving only the results (IDs, scores, probabilities, other necessary fields) into the production environment/data warehouse.
I know this has its limitations (infrequent refreshes, reliance upon IT, data set size/computing power restrictions) and may not be the cutting edge answer many (of your bosses) are looking for; but for many use-cases this works well (and is cost friendly!).

It’s been a few years since the question was originally asked.
For rapid prototyping I would argue the easiest approach currently is to use the Jupyter Kernel Gateway. This allows you to add REST endpoints to any cell in your Jupyter notebook. This works for both R and Python, depending on the kernel you’re using.
This means you can easily call any R or Python code through a web interface. When used in conjunction with Docker it lends itself to a microservices approach to deploying and scaling your application.
Here’s an article that takes you from start to finish to quickly set up your Jupyter Notebook with the Jupyter Kernel Gateway.
Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users
For moving solutions to production the leading approach in 2019 is to use Kubeflow. Kubeflow was created and is maintained by Google, and makes "scaling machine learning (ML) models and deploying them to production as simple as possible."
From their website:
You adapt the configuration to choose the platforms and services that you want to use for each stage of the ML workflow: data preparation, model training, prediction serving, and service management.
You can choose to deploy your workloads locally or to a cloud environment.

Elise from Yhat here.
Like #Ramnath and #leo9r mentioned, our software allows you to put any R (or Python, for that matter) model directly into production via REST API endpoints.
We handle real-time or batch, as well as all of the model testing and versioning + systems management associated with the process.
This case study we co-authored with VIA SMS might be useful if you're thinking about how to get R models into production (their data sci team was recoding into PHP prior to using Yhat).
Cheers!

Related

Azure GUIX automated testing

Can one implement automated UI testing for any UI designed using Azure GUIX? I came across an earlier post asking the same question.
Azure GUIX automated testing specifically for languages
I searched for Azure Test Harness but could not find much on it. Any suggestions?
We have added a string-fit test in GUIX Studio for the 6.1.9.1 patch release, under Edit -> Run String Fit Test. This release has been pushed to the App Store and will be rolling out over the next couple of days. This tests to insure every string in every language fits within the assigned widgets. Of course you can assign and re-assign strings at runtime, so Studio can't test for that, only the know assignments of widgets and fonts can be statically tested by GUIX Studio.
For our own internal regression testing, we build our test apps as Win32 apps and write python scripts to generate events into those apps to drive them. We do md5sum calculations of the canvas memory and compare the computed value with "golden values" to insure nothing has been broken. We haven't yet instrumented anything similar to support on-target regression testing but we have this feature in our backlog, I will see if we can get this on the priority list for the next release.
Best Regards

Is there a way to protect my R code that runs on a AWS account owned by a client?

I just joined a company that needs to build an ETL pipeline inside an AWS account owned by a client.
There's one part of the ETL pipeline that runs a code written in R. The problem is, this R code is a very important part of our business, and our intelectual property. Our clients can't see this code.
Is there any way to run this in their AWS environment without them having access to our code? R is not compilable, so we can't just deploy an executable file there. And we HAVE to run this in their environment. I suggested creating an API to run this in our AWS environment, but this is not an option.
In my experience, these are the options I've realized in situations like this, in increasing order of difficulty:
Take the computation off-premises. This sounds like not an option for you.
Generate an API (e.g., shiny, opencpu, plumber) that is callable from their premises. This might require some finessing on their end, as I'm inferring (since they want it all done within their environment) that they might prefer a locked-down computation (perhaps disabling network access).
Rewrite the sensitive portions in Rcpp. While this does have the possible benefit of speed improvements, it makes it slightly harder for them to "discover" the underlying intellectual property. Realize that R and Rcpp are both GPL, which means that anything linked to by R must also be GPL, meaning source-code available. (It is feasible that since you are not making it public that you can argue your case here, but I am not a lawyer and would not want to be the first consultant found on the wrong side of GPL law here. Again, IANAL.)
Rewrite the sensitive portions in a non-R executable (note that I don't say "as a non-R library and link to it via R calls", since the linking action taints the library with R's GPL). This executable can be called by your otherwise releasable R package (via system or processx::run).
(For the record, one might infer C or C++ here, but other higher-level languages do allow compilable executables and are not GPL. Python has some such modes. Be sure to obfuscate your variables :-)
I think your "safest" options are #2 and #4.

Can you have multiple plans using R package drake?

I know it is not best practice to use the R package called drake within a notebook tool, but I'm doing it anyway as a workaround for the limitations to the collaboration infrastructure we have on my team at work. Since my code is broken up into chunks that are distributed throughout sections of the notebook, it would be useful to have multiple analysis plans, which I would execute in the appropriate section, and other plans may be written and executed in subsequent sections of the notebook. Is it possible to write multiple plans in drake?
Sorry I am late to this thread. I am the maintainer of the drake R package, and I usually expect to receive questions on the issue tracker. A drake-r-package StackOverflow tag would really help me keep up, but I do not have that privilege.
Anyway, interesting use case. I do see some workarounds:
Separate caches for separate plans. drake uses storr to cache its targets, and you could create different caches for different sections of your report. Essentially, your report would manage a bunch of separate drake projects. See this chapter in the manual for more on the caching system. Use the cache argument to make() to supply a manual or non-default storr cache.
Separate plans and a single cache. Here, you would need to ensure that each plan has a completely unique set of targets. If there is overlap, then some targets will always rebuild whenever your run the report.
A single cumulative plan. Essentially, when it comes time to build additional targets as you move through the report, you can add new rows to an existing plan. In fact, this is the recommended approach for large complex projects (related example here). For even more control, use the targets argument to make() to only build a select few targets and their out-of-date dependencies.
I’m not sure i understand the question? We often use drake in jupyter notebook, and do try to support that use case (via the python bindings).
By “plans”, do you mean mathematical programs? Or inverse kinematics calls? Both should be ok in a notebook framework. Or are you actually calling them in parallel?
Not sure how R fits into it?

Using R in Apache Spark

There are some options to access R libraries in Spark:
directly using sparkr
using language bindings like rpy2 or rscala
using standalone service like opencpu
It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.
Do you have any experience with integrating R and Spark you can share?
The main language for the project seems like an important factor.
If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.
There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)
If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.
If your main language is Scala, rscala should be your first try.
While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.

Google-analytics framework for predictive analysis

I'm trying to use the google-analytics framework to create predictive analysis tools. For example I would like to cluster my webpage visitors, etc.
In general, is there any list of machine learning algorithms implemented by this framework? for example: regression, clustering, classification, feature selection, etc.
Thank you for any help
Depending upon your language of choice, you might want to export your Google Analytics Metrics to flat files or a database and then start experimenting with ML models. Two popular languages with stable ML Implementations are Python and R. R's caret package includes tools for building a predictive model pipeline. Python's scikit-learn also contains implementations of all major classes of ML algorithms.
When you say GA framework I'll assume you're referring to the set of Google Analytics APIs listed here. The framework by itself doesn't provide machine learning capabilities. It merely provides access to the processed and aggregated GA data stored in Google's servers. You can use the API and feed the data to a machine learning application/system/program that does all of the stuff you mentioned.

Resources