Running R script on hadoop and mapreduce - r

I have an R-script that does stuff with a bunch of tweets and I would like to use the same script on the same data but saved in an Hadoop file system. According to this Hortonworks tutorial I could use R code with data from my HDFS, but it is not quite clear.
Can I use the very same R-script, taking advantage of the mapreduce paradigm, by using this Revolution R? Should I change my code or is there a way to execute the same functions optimized for an Hadoop architecture?
My wish would be to write my code on a standard R IDE like R-Studio and then use it, or use the most of it, on my cloud services (such as Microsoft Azure) with mapreduce on the base.

Yes, you can run any R script across different data platform from Hadoop to Spark to Teradata and SQL Server by using environment specific compute context.
Following two links should help you get started on how to use Revolution R / Microsoft R Server on Hadoop:
https://msdn.microsoft.com/en-us/microsoft-r/scaler-hadoop-getting-started
https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/MicrosoftR/Samples/NYCTaxi/NYC2013_MRS_LinearBinary.Rmd

Related

Is it possible toaccess Hive data in Hadoop HDInsight cluster using R?

Is it possible to access Hive data in Hadoop HDInsight cluster using R? Say we don't have R Server, all I am interested in is by using R as a client tool accessing Hive data?
Yes, it's possible for accessing Hive without R Server. There are many solutions for doing this, as below.
RHive, an R extension facilitating distributed computing via Apache Hive. There is a slide you can refer to, but it seems to be too old and not support YARN.
RJDBC, a package implementing DBI in R on the basis of JDBC. There is a blog which introduce the usage for R with Hive.
R package hive, there is the document for this package, you can refer to and know how to use it.
It seems that the R package hive is a good choice, because it support the version of Hadoop is Apache Hadoop >= 2.6.0 that HDInsight based on it.
Hope it helps.

How do I setup and run SparkR projects and scripts (like a jar file)?

We have successfully gone through all the SparkR tutorials about setting it up and running basic programs in RStudio on an EC2 instance.
What we can't figure out now is how to then create a project with SparkR as a dependency, compile/jar it, and run any of the various R programs within it.
We're coming from Scala and Java, so we may be coming at this with the wrong mindset. Is this even possible in R or is it done differently that Java's build files and jars or do you just have to run each R script individually without a packaged jar?
do you just have to run each R script individually without a packaged jar?
More or less. While you can create a R package(-s) to store reusable parts of your code (see for example devtools::create or R packages) and optionally distribute it over the cluster (since current public API is limited to high level interactions with JVM backend it shouldn't be required), what you pass to spark-submit is simply a single R script which:
creates a SparkContext - SparkR::sparkR.init
creates a SQLContext / HiveContexts - SparkR::sparkRSQL.init / SparkR::sparkRHive.init
executes the rest of your code
stops SparkContext - SparkR::sparkR.stop
Assuming that external dependencies are present on the workers, missing packages can installed on the runtime using if not require pattern, for example:
if(!require("some_package")) install.packages("some_package")
or
if(!require("some_package")) devtools::install_github("some_user/some_package")

Execute R script from SSIS Package

I wanted to execute R code from SSIS package. How can I add a data control step that executes R-code? SSIS supports only vb.net and asp.net.
SSIS has many data transformations available but R is very friendly when it comes to data manipulations.
I want to run a R-code from SSIS scripts or some other way.Basically, I'm trying to integrate R in ETL process.
I wanted to extract data(E) from from a CSV file.
Transform (T) it in R and load (L) it in Microsoft database.
Is it possible to get this workflow done in SSIS package by executing R-script using SSIS data control items? Thanks!
Here are a couple of ways you could integrate R into your ETL process.
Crude, fast and dirty - Execute Process Task in the Control Flow. This would be similar to calling RScript from the command line. You would likely make your transformation, save it to a file on disk, and get that filename from your Execute Process Task so you can feed it into a Data Flow task. Upside is you're keeping your R clean and separate from your C#/VB.
Integrated via Rdotnet - You could use the RDotNet library (I believe, haven't tried to integrate it). You would need to register the DLLs in the GAC, and then you can either work with .NET objects in your SSIS scripts or call R scripts directly.
Integrated in SQL Server 2016 - Microsoft has added R support via extended stored procedures. You call the R script via stored proc and use a sql query for input data and can store the output. See more detail here. This would mean utilizing an Execute SQL task in SSIS.
I hope it helps you or someone else, since you want data processing you might bring your dataset into a CSV file (throught a data flow task), execute the file using: "Rscript " (it might be executed as a command with the execute process task), inside the file you have to upload the dataset into a dataframe ( calling it with readLines() function), then do all the math/Calculation you request, write the data or calculation results into a CSV file an reading again it from SSIS.
It is not an elegant solution, but it works :), At least till microsoft integrates R as a control/data flow process.
CYA
PS. here you go how to execute files from the command line: Run R script from command line

Reading LabVIEW TDMS files with R

As part of a transition from MATLAB to R, I am trying to figure out how to read TDMS files created with National Instruments LabVIEW using R. TDMS is a fairly complex binary file format (http://www.ni.com/white-paper/5696/en/).
Add-ons exist for excel and open-office (http://www.ni.com/white-paper/3727/en/), and I could make something in LabVIEW to make the conversion, but I am looking for a solution that would let me read the TDMS files directly into R. This would allow us to test out the use of R for certain data processing requirements without changing what we do earlier in the data acquisition process. Having a simple process would also reduce the barriers to others trying out R for this purpose.
Does anyone have any experience with reading TDMS files directly into R, that they could share?
This is far from supporting all TDMS specifications but I started a port of a python npTDMS package into R here https://github.com/msuefishlab/tdmsreader and it has been tested out in the context of a shiny app here
You don't say if you need to automate the reading of these files using R, or just convert the data manually. I'm assuming you or your colleagues don't have any access to LabVIEW yourselves otherwise you could just create a LabVIEW tool to do the conversion (and build it as a standalone application or DLL, if you have the professional development system or app builder - you could run the built app from your R code by passing parameters on a command line).
The document on your first link refers to (a) add-ins for OpenOffice Calc and for Excel, which should work for a manual conversion and which you might be able to automate using those programs' respective macro languages, and (b) a C DLL for reading TDMS - would it be possible for you to use one of those?

Amazon EMR: Using R code in Amazon EMR

I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike
#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.

Resources