Amazon Web Services - how to run a script daily - r

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!

Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.

It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.

#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs

If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.

Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

Related

Can you specify the number of threads for certain tasks in a DAG?

I'm very new to Airflow and while I have read the docs and some answers about Airflow's configuration regarding parallelism, it seems I have not yet found the answer to specifying threads used in a task.
My current case is I have 5 tasks (in the form of a Python script) that only do API calls (but to different API service) and transform the data. For each task I can make up to 1000+ calls, so I try to utilize multithreading in the script. Unfortunately, when I try to run the multithreaded script in Airflow, it doesn't use the multithreading mechanism in the script. I feel like this is because of Airflow configuration that overrides the child script or am I wrong? Any help or answer is appreciated, thank you.
Run your script with a KubernetesPodOperator.
You can use a python base image and run your script as is. This should closely mimic how you are executing the script locally except now it's done in a kubernetes pod.

Schedule R scripts in AWS

There are several R scripts that need to be run periodically. Currently, i am having an EC2 instance where these R scripts are running through Cron jobs. However, this is not cost efficient as the scripts do not run all the time.
I am looking for a service that lets me deploy the R scripts and schedule them, only paying per use. Something like for instance AWS Lambda does.
Note: Rewrite these scripts is not a solution for now, since there are many and I do not have the resources for it.
Any ideas or suggestions about it?
You can containerize your scripts and try to run them on ECS with a cron schedule.
Quick search can give you plenty of examples on dockerizing R scripts, like this.
You can push your resulting images to AWS ECR, which is docker registry, and use images to define ECR tasks: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
After that you can run your tasks on schedule.
This way scripts will only consume compute power while working. It still requires some refactoring in form of containerization, but after you do this once it should scale to all other scripts.
If containerization is still too much work, you can combine EC2 instance scheduler with reserved scheduled instances for savings, but be aware that reserved instances have a lot of limitations if you plan on savings.

Creating a one off cron job using firebase and app engine

I'm creating an app where a user can create a piece of data that could be presented in the ui at a later date. I'm trying to find a way to create cron entries dynamically using either java code (for android devices) or node.js code (firebase cloud function generates a cron job). I haven't been able to find a way to do it dynamically and based on what I read it may not be possible. Does anyone out there know a way?
Presently the only way to create GAE cron jobs is via deployment of a cron configuration file (by itself or, in some cases, together with the app code). Which today can only be done via CLI tools (from the GAE or Cloud SDKs).
I'm unsure if you'd consider programmatically invoking such CLI tools qualifying to 'create cron entries dynamically'. If you do - generation the cron config file and scripting the desired CLI-based deployment would be an approach.
But creating jobs via an API is still a feature request, see How to schedule repeated jobs or tasks from user parameters in Google App Engine?. It also contains another potentially acceptable approach (along the same lines as the one mentioned in #ceejayoz's comment))

Unix scripting for servces to check

I am struggling to write a script to check a particular service running on my server and then send me mails.
So should this script be the part of bash profile so that its always running..
regards
rick
The .profile, .bashrc and friends will be run on login, so they are of no good use for background monitoring. Two solutions come to mind:
Either use cron to run your script at predefined intervals
Or make it loop and use your system's init environment (SysV, Upstart, SystemD, ...) to control it
My recommendation is to stick with cron - it even makes the mailing of results dead easy - just create output.

Amazon EMR: Using R code in Amazon EMR

I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike
#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.

Resources