Rstudio - connect to HDInsight Spark Cluster SparkR - r

I am using RStudio on my local laptop and trying to connect to an existing, remote HDInsight Spark Cluster.
A couple questions:
1) Do I need to have RStudio installed on the HDInsight Spark Cluster?
2) How do I connect local RStudio to a remote Spark Cluster? I've been looking at the SparkR docs here but it doesn't seem to give a connect example (ie URL, credentials, etc.)?

HDInsight includes an R Server option to be integrated into your HDInsight cluster. This option allows R scripts to use Spark and MapReduce to run distributed computations.
For more details, refer “Get started using R Server on HDInsight”.

Related

Possible options to deploy R application in AWS Environment

I am looking for all the possible options to deploy an application written using R and the Shiny package on AWS. I see one of the options is to use EC2, but I would like to know if there is any serverless option available to deploy R applications on AWS.
Does Sagemaker RStudio have an option to deploy? I looked into AWS documents, but unfortunately I could not see an option.
Any leads would be much appreciated.
Using RStudio on SageMaker (or simply R kernels in SageMaker Studio / Notebooks) enables several deployment strategies for RESTful application endpoints.
In this example, you can see a trained R model deployed to a SageMaker serverless endpoint:
https://github.com/aws-samples/rstudio-on-sagemaker-workshop/blob/main/03_SageMakerPipelinesAndDeploy/01_Pipeline.Rmd#L124-L149
In this example, you will deploy a trained R model to a 24/7 RESTful endpoint using SageMaker real-time endpoints:
https://github.com/aws/amazon-sagemaker-examples/tree/main/r_examples/r_serving_with_plumber
In order to build the serving containers for those models, you can check the above examples or you may want to use recently launched functionality using the native R vetiver package:
https://github.com/juliasilge/vetiverdemo/blob/main/biv-svm/train_sagemaker.Rmd
If you're looking to deploy shiny applications and not models, you can use EC2 or a container service like ECS/Fargate to host the Shiney or RStudio connect server. These blogs may be a bit overkill for what you're looking to do, but they should have good examples of achieving hosted applications:
https://aws.amazon.com/blogs/machine-learning/host-rstudio-connect-and-package-manager-for-ml-development-in-rstudio-on-amazon-sagemaker/
https://aws.amazon.com/blogs/architecture/scaling-rstudio-shiny-using-serverless-architecture-and-aws-fargate/

Create pipeline in Azure that runs R script without generalizing the VM

I have an R script that I need to run once every week. I need it to be done using Azure. One option is to use Azure Data Factory and set up a pipline what will run this R script on a VM (Windows).
The problem I'm facing is that I will have to update every now and then both the R script and the R packages the R script is using.
When setting up this pipeline I will have to generalize the VM (correct me if I'm wrong) and doing so I can no longer log into this VM. And if I can't log into this VM I cannot update the R packages.
What options do I have here?
There is an alternate solution where you can use Azure Batch Service and Azure Data Factory to execute R scripts in your pipeline.
For more information, you can refer this blog: How to execute R Scripts using Azure Batch Services and Azure Data Factory? | by Aditya Kaushal | Medium
Alternatively, to run R scripts in your VM, you can use below options:
Custom Script Extension
Run Command
Hybrid Runbook Worker
Serial console
Reference: Run scripts in an Azure Windows VM - Azure Virtual Machines | Microsoft Docs

Connecting to Cassandra Data using Sparklyr

I am using RStudio. Installed a local version of Spark, run a few things, quite happy. Now I am trying to read my actual data from a Cluster, using RStudio Server and a standalone version of Spark. Data is in Cassandra, and I do not know how to connect to it. Can anyone point me to a good primer on how to connect and read that data?
there is a package called crassy and is listed on the official sparklyr site's under Extensions. You can search for crassy in GitHub.

Is it possible toaccess Hive data in Hadoop HDInsight cluster using R?

Is it possible to access Hive data in Hadoop HDInsight cluster using R? Say we don't have R Server, all I am interested in is by using R as a client tool accessing Hive data?
Yes, it's possible for accessing Hive without R Server. There are many solutions for doing this, as below.
RHive, an R extension facilitating distributed computing via Apache Hive. There is a slide you can refer to, but it seems to be too old and not support YARN.
RJDBC, a package implementing DBI in R on the basis of JDBC. There is a blog which introduce the usage for R with Hive.
R package hive, there is the document for this package, you can refer to and know how to use it.
It seems that the R package hive is a good choice, because it support the version of Hadoop is Apache Hadoop >= 2.6.0 that HDInsight based on it.
Hope it helps.

Publishing Tableau Dashboard with R scripts to Server

I have built a Dashboard that has multiple calculations using R Scripts and now I want to publish this to our internal Server. I get the following error message:
"This worksheet contains R scripts, which cannot be viewed on the target platform until the administrator configures an Rserve connection."
From what I understand, the administrator has to configure the Rserve, but what about the rest of the installed packages I have in use? Should the administrator install those too and every time I use a new package, I should inform him to install that particular package?
You need to install the packages on the server that your script will use. Then make sure to start Rserve there and connect your Tableau workbook to the server where Rserve is used (using Rserve connection in Tableau).
Tableau described the process pretty well:
http://kb.tableau.com/articles/knowledgebase/r-implementation-notes

Resources