Is there a way to run complex R workloads in a serverless manner in the cloud? - r

We are currently manually running complex R workloads on a monster VM in the Azure Cloud. Some workloads consume all VM resources and create bottlenecks. Typically workloads take 30min - 3hrs
Is there a way to improve performance to run R workloads in a serverless and isolated manner, perhaps using containers or cloud functions ?
We are also interested in investing in a tool that we could use to manage/administer/orchestrate workloads in a seamless end to end fashion.
Something like Azure Data Factory but for stitching together stuff in R.
Any helpful suggestions would be appreciated. Thank you.

There are several options to run R in Azure. I would think HDInsights, Databricks or AML would be good for you to review.
Source

Related

Microservices orchestration choices

I am exploring the possible solutions for orchestrating my flows across multiple services via some infrastructure. Searching shows me a few options such as Conductor, Camunda, Airflow etc.
I am wondering what would fit my use case better
One of my service is in Java, the other is in Python
I need to pass info to the Java service, then take the output and pass it to the Python service
Final output is then published to another queue
It feels like Conductor is a good choice, but would love to hear your inputs!
All options can fulfill the requirement stated. Think about further / future requirements. Is it only a data pipe? Is it about orchestrating a larger end-to end business process? Do you need support for long-running processes? Is end-to-end transparency in a graphical form a benefit? Is graphical process modelling in BPMN2 standard going to be a benefit? Are there going to be audit or reporting requirements? Or is it going to be a simple, isolated, technical solution?
This article gives a great overview of tools in the market and what their primary use cases are: https://blog.bernd-ruecker.com/understanding-the-process-automation-landscape-9406fe019d93
All listed tools might technically be able to execute your workflow (I have no experience working with Conductor & Camunda). A few characteristics on which a decision is usually made are:
open vs closed source
how do you define workflows? (e.g. Python code in Airflow. Others use e.g. JSON/XML/something custom)
does it come with a UI?
can it scale out in case my workloads start growing?
is it agnostic to any technology or limited to running certain technologies? (e.g. Oozie is built for scheduling jobs on Hadoop)
other requirements could be e.g. security, logging, monitoring, etc.
There are many orchestration-tool-comparisons on the internet, e.g. 1 or 2.
Introduction to Container Orchestration
The practice of automating the administration of container-based microservice applications across different clusters is known as container orchestration. Within corporations, this notion is gaining popularity. In addition, a variety of Container Orchestration technologies have become indispensable in the deployment of microservice-based applications.
Software development in the modern era is no longer monolithic. Instead, it generates component-based apps that run across many containers. These adaptable and scalable containers work together to accomplish a specified purpose or microservice.
Depending on the complexity of the application and other requirements like load balancing, they may span many clusters.
Containers encapsulate application code as well as its dependencies. To function efficiently, they receive the resources they require from physical or virtual hosts. When complicated systems are built as containers, clustering them for deployment requires adequate management and priority.
How to Choose a Container Orchestration Tool?
We've looked at a number of Orchestration Tools that you may examine when selecting which is ideal for your business. To do so, make sure to understand your company's requirements and operations. Then you'll be able to more readily weigh the benefits and drawbacks of each option.
Kubernetes
Kubernetes has a lot of features and is ideally suited for container and cluster management at the corporate level. Kubernetes is managed by a number of platforms, including Google, AWS, Azure, Pivotal, and Docker. As the containerized workload grows, you have a lot of options.
The biggest disadvantage is that it does not work with Docker Swarm and Compose CLI manifests. It might also be difficult to understand and set up. Despite these flaws, it is one of the most used systems for cluster deployment and management.
Docker Swarm
For individuals who are already familiar with Docker Compose, Docker Swarm is a better option. It's easy to use and doesn't require any additional software. Unlike Kubernetes and Amazon ECS, however, Docker Swarm lacks sophisticated features such as built-in logging and monitoring. As a result, it is better suited to small-scale businesses that are just starting started with containers.
Amazon ECS
If you're already familiar with Amazon Web Services, Amazon ECS is a great way to install and configure clusters. It's a quick and easy method to get started, and it scales to match demand. It also connects with a number of other AWS services. It's also excellent for small teams with limited resources for container maintenance.
One of its disadvantages is that it is incompatible with nonstandard deployments. It also contains ECS-specific configuration files, which complicates debugging.

RStudio Connect Server vs Power BI

Has anyone done a comparison between R connect Server and Power BI. We are trying to work on the benefits of R Connect server over Power BI in order to convince our super strict IT management to go with R Connect Server.
Thank you.
You should figure out what decision variables are important to you. This RStudio thread goes into detail about the benefits, mostly if you are going lightweight it is better. Most likely your users are more technical and want more ability to build powerful tools themselves.
Power BI seems to be better for the Excel "power" users. It does not handle large datasets well, and most likely this is for a non-technical crowd.
Consider the end users before all else, then work backward from there.

Options for deploying R models in production

There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?
Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?
Any other options out there?
The following is a list of the alternatives that I have found so far to deploy an R model in production. Please note that the workflow to use these products varies significantly between each other, but they are all somehow oriented to facilitate the process of exposing a trained R model as a service:
openCPU
AzureML
DeployR
yhat (already mentioned by #Ramnath)
Domino
Sense.io
The answer really depends on what your production environment is.
If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.
Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)
How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.
There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.
You can create RESTful APIs for your R scripts using plumber (https://github.com/trestletech/plumber).
I wrote a blog post about it (http://www.knowru.com/blog/how-create-restful-api-for-machine-learning-credit-model-in-r/) using deploying credit models as an example.
In general, I do not recommend PMML because the packages you used might not support translation to PMML.
A common practice is scoring a new/updated dataset in R and moving only the results (IDs, scores, probabilities, other necessary fields) into the production environment/data warehouse.
I know this has its limitations (infrequent refreshes, reliance upon IT, data set size/computing power restrictions) and may not be the cutting edge answer many (of your bosses) are looking for; but for many use-cases this works well (and is cost friendly!).
It’s been a few years since the question was originally asked.
For rapid prototyping I would argue the easiest approach currently is to use the Jupyter Kernel Gateway. This allows you to add REST endpoints to any cell in your Jupyter notebook. This works for both R and Python, depending on the kernel you’re using.
This means you can easily call any R or Python code through a web interface. When used in conjunction with Docker it lends itself to a microservices approach to deploying and scaling your application.
Here’s an article that takes you from start to finish to quickly set up your Jupyter Notebook with the Jupyter Kernel Gateway.
Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users
For moving solutions to production the leading approach in 2019 is to use Kubeflow. Kubeflow was created and is maintained by Google, and makes "scaling machine learning (ML) models and deploying them to production as simple as possible."
From their website:
You adapt the configuration to choose the platforms and services that you want to use for each stage of the ML workflow: data preparation, model training, prediction serving, and service management.
You can choose to deploy your workloads locally or to a cloud environment.
Elise from Yhat here.
Like #Ramnath and #leo9r mentioned, our software allows you to put any R (or Python, for that matter) model directly into production via REST API endpoints.
We handle real-time or batch, as well as all of the model testing and versioning + systems management associated with the process.
This case study we co-authored with VIA SMS might be useful if you're thinking about how to get R models into production (their data sci team was recoding into PHP prior to using Yhat).
Cheers!

distributed scheduling system for R scripts

I would like to schedule and distribute on several machines - Windows or Ubuntu - (one task is only on one machine) the execution of R scripts (using RServe for instance).
I don't want to reinvent the wheel and would like to use a system that already exists to distribute these tasks in an optimal manner and ideally have a GUI to control the proper execution of the scripts.
1/ Is there a R package or a library that can be used for that?
2/ One library that seems to be quite widely used is mapReduce with Apache Hadoop.
I have no experience with this framework. What installation/plugin/setup would you advise for my purpose?
Edit: Here are more details about my setup:
I have indeed an office full of machines (small servers or workstations) that are sometimes also used for other purpose. I want to use the computing power of all these machines and distribute my R scripts on them.
I also need a scheduler eg. a tool to schedule the scripts at a fix time or regularly.
I am using both Windows and Ubuntu but a good solution on one of the system would be sufficient for now.
Finally, I don't need the server to get back the result of scripts. Scripts do stuff like accessing a database, saving files, etc, but do not return anything. I just would like to get back the errors/warnings if there are some.
If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:
Why write a doRedis package? After all, the foreach package already
has available many parallel back end packages, including doMC, doSNOW
and doMPI. The doRedis package allows for dynamic pools of workers.
New workers may be added at any time, even in the middle of running
computations. This feature is relevant, for example, to modern cloud
computing environments. Users can make an economic decision to \turn
on" more computing resources at any time in order to accelerate
running computations. Similarly, modernThe doRedis Package cluster
resource allocation systems can dynamically schedule R workers as
cluster resources become available
Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.
So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?
The devil is in the details, so you can get better answers if the details are explicit.
If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.
Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.

What's the best way to simulate a complex production web development environment?

I want to create a modestly scalable development environment for an in-development web service.
Ideally, there would be an nginx web server with haproxy and a few database servers, websockets, the works.
I'd be going with Amazon cloud services for all of this hosting... but I'd rather not pay for CPU cycles when I'm just developing... much less develop on a remote, cloud environment.
What's the best way to go about modeling a somewhat complex development environment locally that could - hopefully, at the press of a button - sync with a similarly architected Amazon cloud environment?
All I have is my Macbook Pro. I also have a fully built 1Ghz tower computer in the closet I could leverage, if needed, and wouldn't be opposed to buying more. But, ultimately, I'd like to have the ability to sync to production with minimal steps and reconfiguration.
Thanks!
Check out vagrant and virtualbox. That will get you local environments running nicely on your macbook. Syncing to EC2 is going to be tougher. At the system level you'll want to use something like puppet or chef (which are both nicely supported by vagrant). Add to that a solid automated application deployment mechanism and you should be close. Be prepared to put some time into this, it's not likely to be a trivial undertaking.

Resources