There are some options to access R libraries in Spark:
directly using sparkr
using language bindings like rpy2 or rscala
using standalone service like opencpu
It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.
Do you have any experience with integrating R and Spark you can share?
The main language for the project seems like an important factor.
If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.
There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)
If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.
If your main language is Scala, rscala should be your first try.
While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.
Related
I'm currently working on an university research related software which uses statistical models in it in order to process some calculations around Item Response Theory. The entire source code was written in Go, whereas it communicates with a Rscript server to run scripts written in R and return the generated results. As expected, the software itself has some dependencies needed to work properly (one of them, as seen before, is to have R/Rscript installed and some of its packages).
Due to the fact I'm new to software development, I can't find a proper way to manage all these dependencies on Windows or Linux (but I'm prioritizing Windows right now). What I was thinking is to have a kind of script which checks if [for example] R is properly installed and, if so, if each used package is also installed. If everything went well, then the software could be installed without further problems.
My question is what's the best way to do anything like that and if it's possible to do the same for other possible dependencies, such as Python, Go and some of its libraries. I'm also open to hear suggestions if installing programming languages locally on the machine isn't the proper way to manage software dependencies, or if there's a most convenient way to do it aside from creating a script.
Sorry if any needed information is missing, I would also like to know.
Thanks in advance
I have written a piece of R-code that performs a numerical computation. Now, I want to implement it into a nice GUI. I know that there are some R-packages, that allow to create GUIs from within R (e.g. gWidgets, RGtk2, ...). However, they seem to be rather limited in the capabilities and complicate to build. So I thought about going the other way round and writing a windowed-program that incorporates my R-code.
Is it possible to write a nice GUI (for example in Visual Basic.NET or Java) that allows to gather some user inputs, call the R-computations and display the results?
I ask for Visual Basic because there is this new R-Open that comes along with Visual Studio which makes me think the two must offer natural ways of collaboration with each other. I also hope that I would be able to compile an exe with it in the end.
Thank you very much for you Help!
Bernd
You can embed R in C++ code. There are examples in the R source code and documentation.
Very briefly, you'll need to build a shared DLL version of R (i.e with the --enable-R-shlib option) from the source code, using the Windows Tools. This is how GUIs like RStudio function.
The R Admin manuals have detailed instructions. The RInside package might make this a bit easier.
With the shared DLL you could probably embed R in other languages (it works for R in Python).
There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?
Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?
Any other options out there?
The following is a list of the alternatives that I have found so far to deploy an R model in production. Please note that the workflow to use these products varies significantly between each other, but they are all somehow oriented to facilitate the process of exposing a trained R model as a service:
openCPU
AzureML
DeployR
yhat (already mentioned by #Ramnath)
Domino
Sense.io
The answer really depends on what your production environment is.
If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.
Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)
How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.
There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.
You can create RESTful APIs for your R scripts using plumber (https://github.com/trestletech/plumber).
I wrote a blog post about it (http://www.knowru.com/blog/how-create-restful-api-for-machine-learning-credit-model-in-r/) using deploying credit models as an example.
In general, I do not recommend PMML because the packages you used might not support translation to PMML.
A common practice is scoring a new/updated dataset in R and moving only the results (IDs, scores, probabilities, other necessary fields) into the production environment/data warehouse.
I know this has its limitations (infrequent refreshes, reliance upon IT, data set size/computing power restrictions) and may not be the cutting edge answer many (of your bosses) are looking for; but for many use-cases this works well (and is cost friendly!).
It’s been a few years since the question was originally asked.
For rapid prototyping I would argue the easiest approach currently is to use the Jupyter Kernel Gateway. This allows you to add REST endpoints to any cell in your Jupyter notebook. This works for both R and Python, depending on the kernel you’re using.
This means you can easily call any R or Python code through a web interface. When used in conjunction with Docker it lends itself to a microservices approach to deploying and scaling your application.
Here’s an article that takes you from start to finish to quickly set up your Jupyter Notebook with the Jupyter Kernel Gateway.
Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users
For moving solutions to production the leading approach in 2019 is to use Kubeflow. Kubeflow was created and is maintained by Google, and makes "scaling machine learning (ML) models and deploying them to production as simple as possible."
From their website:
You adapt the configuration to choose the platforms and services that you want to use for each stage of the ML workflow: data preparation, model training, prediction serving, and service management.
You can choose to deploy your workloads locally or to a cloud environment.
Elise from Yhat here.
Like #Ramnath and #leo9r mentioned, our software allows you to put any R (or Python, for that matter) model directly into production via REST API endpoints.
We handle real-time or batch, as well as all of the model testing and versioning + systems management associated with the process.
This case study we co-authored with VIA SMS might be useful if you're thinking about how to get R models into production (their data sci team was recoding into PHP prior to using Yhat).
Cheers!
I tried RInside's Qt example qdensity and really liked it. It was easy to setup and I was surprised how easy it was to understand and modify given that I have virtually no Qt experience. Now I wonder whether it is possible to use RInside with R somewhere on a remote machine.
It seems that I cannot use RInside for this purpose. I wonder whether there is another way of creating a Qt Desktop app, that communicates with R on some server. I got R Studio Server running and I am really happy with it, but it's for the R people. In order to promote my R stuff within our institute also among non-R people I would like to offer a simple, very limited GUI that can do basics things like showin' some graph or starting a R CMD Batch. I also know shiny (and shiny server) and have been actively testing it recently, but I am looking for a simple Desktop client go connect with my server-side R.
Is there a basis to start out with Rserve and Qt?
Any suggestions (where to start, examples, generally bad idea) ???
What are R's capabilities to handle something like this IPC or D-Bus stuff.
Use Qt with C++, and just process the files that you create with R on you're server.
So for example: create the graphic and save in a format that you can load. BMP, PNG etc. Load it to you're GUI.
Also I suggest Qt Creator for GUI design. Its fast and simple. This idea only fits you if you don't want to stay in in R environment.
When I have created programs that process data and calculate things like probabilities and charts, usually use HTML for the interface using PHP and leaving the rest of the processing (for example R scripts) to the server.
For any recent visitor: Take a look at openCPU, it publishes R functions as restful services and does all the marshalling from R data types from and to JSON.
Here's the scenario:
I have JBoss serving a web service with JBossWS providing me with a wsdl. I have connected and used it from both .NET and Java so far (and it has been quite easy once I figured it out). I am now trying to do the same with R.
Is there anything out there considered to be "the right way" for doing this? I am not that familiar with R, and my searches have not turned up much, so I figured I'd ask and maybe spare my head and the wall a bit of damage.
I have had good luck using rJava to recreate in R something that works in Java. I use this method for connecting to Amazon's AWS Java SDK for their API with R. This allows me, for example, to transfer files to/from S3 from R without having to recreate the whole connection/handshake/boogieWoogie from R.
If you wanted to go more "pure R" I think you'll have to use some combination of RCurl and the XML package to grab and parse the wsdl.
There are a number of ways:
You could retain your Java approach and use the rJava package around it
You could use RCurl which is used to power a few higher-level packages (accessing Google APIs, say)
I believe there is an older SSOAP package on Omegahat which may help too.