Schedule R scripts in AWS - r

There are several R scripts that need to be run periodically. Currently, i am having an EC2 instance where these R scripts are running through Cron jobs. However, this is not cost efficient as the scripts do not run all the time.
I am looking for a service that lets me deploy the R scripts and schedule them, only paying per use. Something like for instance AWS Lambda does.
Note: Rewrite these scripts is not a solution for now, since there are many and I do not have the resources for it.
Any ideas or suggestions about it?

You can containerize your scripts and try to run them on ECS with a cron schedule.
Quick search can give you plenty of examples on dockerizing R scripts, like this.
You can push your resulting images to AWS ECR, which is docker registry, and use images to define ECR tasks: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
After that you can run your tasks on schedule.
This way scripts will only consume compute power while working. It still requires some refactoring in form of containerization, but after you do this once it should scale to all other scripts.
If containerization is still too much work, you can combine EC2 instance scheduler with reserved scheduled instances for savings, but be aware that reserved instances have a lot of limitations if you plan on savings.

Related

Can you specify the number of threads for certain tasks in a DAG?

I'm very new to Airflow and while I have read the docs and some answers about Airflow's configuration regarding parallelism, it seems I have not yet found the answer to specifying threads used in a task.
My current case is I have 5 tasks (in the form of a Python script) that only do API calls (but to different API service) and transform the data. For each task I can make up to 1000+ calls, so I try to utilize multithreading in the script. Unfortunately, when I try to run the multithreaded script in Airflow, it doesn't use the multithreading mechanism in the script. I feel like this is because of Airflow configuration that overrides the child script or am I wrong? Any help or answer is appreciated, thank you.
Run your script with a KubernetesPodOperator.
You can use a python base image and run your script as is. This should closely mimic how you are executing the script locally except now it's done in a kubernetes pod.

Cloud Composer/Airflow Task Runner Storage

I'm used to running pipelines via AWS data pipelines but getting familiar with Airflow (Cloud Composer).
In data pipelines we would:
Spawn a task runner,
Bootstrap it,
Do work,
Kill the task runner.
I just realized that my airflow runners are not ephemeral. I touched a file in /tmp, did it again in a separate DagRun, then listed /tmp and found two files. I expected only the one I most recently touched.
This seems to mean I need to watch out how much "stuff" is being stored locally on the runner.
I know GCS mounts the /data folder with FUSE so I'm defaulting to storing a lot of my working files there, and moving files from there to final buckets elsewhere, but how do you approach this? What would be "best practice"?
Thanks for the advice.
Cloud Composer currently uses CeleryExecutor, which configures persistent worker processes that handle the execution of task instances. As you have discovered, you can make changes to the filesystems of the Airflow workers (which are Kubernetes pods), and they will indeed persist until the pod is restarted/replaced.
Best-practice wise, you should consider the local filesystem to be ephemeral to the task instance's lifetime, but you shouldn't expect that it will clean up for you. If you have tasks that perform heavy I/O, you should perform them outside of /home/airflow/gcs because that path is network mounted (GCSFUSE), but if there is final data you want to persist, then you should write it to /data.

Best way to distribute code to airflow webserver / scheduler + workers and workflow

What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

How to stop hive/pig install in Amazon Data Pipeline?

I don't need Hive or Pig, and Amazon Data Pipeline by default installs them on any EMR cluster it spins up. This makes testing take longer than it should. Any ideas on how to disable to install?
This is not possible as of today.
The only workaround would be to launch a small EMR cluster that you use for testing (like with single master - m1.small). Then use it with 'workergroup' rather than 'runsOn'.
Depending on type of activities you want to use, the workergroup field might or might not be supported. But you can always wrap everything in a script (python, shell or blah) and use it with ShellCommandActivity.
Update (correctly reminded by ChristopherB):
From 3.x AMI version, Hive and Pig is bundled in the AMI itself. So the steps do not pull any new packages from S3 but only activate the daemons on master node. So unless you are worried about them consuming your instance resources (CPU, memory etc), it should be okay. They would not take noticable time to run.

Resources