I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike
#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.
Related
I can create the batch service resources using Power shell as described here: https://learn.microsoft.com/en-us/azure/batch/batch-powershell-cmdlets-get-started
I want to run a R script on the nodes and I need R installed on the nodes as none of the available VM's(windows or linux) come with R installed. I have currently installed R by manually logging into the VM. But I want to create the batch resources and then install R on the nodes preferably through a script before I run the R code. How can I go about this?
There are 4 main ways to load necessary software on to VMs:
Create a start task along with potentially resource files to prep the compute node per your requirements.
Create a custom image that already contains all of your software preconfigured.
Use containers instead of directly loading software on the compute node.
Utilize application packages.
I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud
I have an R-script that does stuff with a bunch of tweets and I would like to use the same script on the same data but saved in an Hadoop file system. According to this Hortonworks tutorial I could use R code with data from my HDFS, but it is not quite clear.
Can I use the very same R-script, taking advantage of the mapreduce paradigm, by using this Revolution R? Should I change my code or is there a way to execute the same functions optimized for an Hadoop architecture?
My wish would be to write my code on a standard R IDE like R-Studio and then use it, or use the most of it, on my cloud services (such as Microsoft Azure) with mapreduce on the base.
Yes, you can run any R script across different data platform from Hadoop to Spark to Teradata and SQL Server by using environment specific compute context.
Following two links should help you get started on how to use Revolution R / Microsoft R Server on Hadoop:
https://msdn.microsoft.com/en-us/microsoft-r/scaler-hadoop-getting-started
https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/MicrosoftR/Samples/NYCTaxi/NYC2013_MRS_LinearBinary.Rmd
I have tried a number of things put files into my EC2 Rstudio instance, particularly uploading via putty, adding dropbox (via Louis Aslett's AMI, but only very small files sync), using filezilla and also winscp. Unfortunately after several months I can get them in, but they are corrupted on arrival.
There are some older questions related to this
To access S3 bucket from R
but they all quote packages that are no longer maintained or have no accepted answer.
This seems possible in python using boto and there are alot more answered questions on this via python.
I think maybe AWS might prefer using S3, so I am trying that, is there any r package that is current and works for uploading into EC2 from S3. I have set up EC2 and S3? Does anyone have the r code lines to use and is there anything behind the scenes that needs to be done. I was told by a colleague it was possible to do this in r with only 4 lines of r code, once the packages were installed, but she does not know how to actually do it.
We have successfully gone through all the SparkR tutorials about setting it up and running basic programs in RStudio on an EC2 instance.
What we can't figure out now is how to then create a project with SparkR as a dependency, compile/jar it, and run any of the various R programs within it.
We're coming from Scala and Java, so we may be coming at this with the wrong mindset. Is this even possible in R or is it done differently that Java's build files and jars or do you just have to run each R script individually without a packaged jar?
do you just have to run each R script individually without a packaged jar?
More or less. While you can create a R package(-s) to store reusable parts of your code (see for example devtools::create or R packages) and optionally distribute it over the cluster (since current public API is limited to high level interactions with JVM backend it shouldn't be required), what you pass to spark-submit is simply a single R script which:
creates a SparkContext - SparkR::sparkR.init
creates a SQLContext / HiveContexts - SparkR::sparkRSQL.init / SparkR::sparkRHive.init
executes the rest of your code
stops SparkContext - SparkR::sparkR.stop
Assuming that external dependencies are present on the workers, missing packages can installed on the runtime using if not require pattern, for example:
if(!require("some_package")) install.packages("some_package")
or
if(!require("some_package")) devtools::install_github("some_user/some_package")