TensorFlow Transform Python using AWS S3 as data source - apache-beam-io

I am trying to run TensorFlow Transform, using Python, Apache Flink as the Beam Runner. I noticed that Beam does not have AWS S3 as the io connector, and would like to know any work around for this.
Here is the list of supported io connectors, but Python+S3 is not even in the roadmap.
I can think of two work around:
mount the S3 bucket as a local drive to a EC2 instance
Write my own Python S3 connector using their guide.
I want to know if there are other creative (easy) way out.
Thanks!

Related

Is there an established means of using AzureStor and arrow together in R?

In the arrow R guide there's info about using S3 buckets but nothing about using Azure cloud storage. There's an unrelated package AzureStor which connects to Azure Storage but uses different syntax so they don't (seemingly) work together.
Is there a an existing adaptation or easy way to adapt the AzureStor syntax over to a FileSystem class that arrow can use?
I am not sure but does Azure File System will help?
Something like mentioned in the link below:
https://www.educba.com/azure-file-storage/

Remote execution with on a Single node with Multiple GPU

I am looking into documentation for running hydra on a single node remotely. I am looking for methods where I can run a code present in my local machine and to run it on a GCP instance.
Any pointers?
It sounds like you are looking for a Hydra Launcher that supports GCP.
For now, Hydra does not support this. We do have a Ray Launcher that launches to AWS and could be further extended to launching on GCP. Feel free to subscribe this issue.

How to download latest files from s3 bucket into local machine using Airflow

Is there a way to download latest files from S3 bucket into my local system using Airflow .
Since I am a newbie to Airflow I don't have much idea on how to proceed. Please assist.
Short answer: You could use S3KeySensor to detect when a certain key appears in an S3 bucket and then use S3Hook.read_key() to get the content of the key.
Assuming you are completely new to Airflow, I would suggest:
Start with the tutorial
Read up on Connections, Hooks, and Sensors
Use this example as a starting point for your own DAG
As a followup:
Browse the amazon provider package docs to see what else there is for working with AWS services
Look through other examples

Can R work with AWS S3 within a AWS Lambda function?

I'm looking for an approach to run a simulation algorithm developed in R in AWS. The input for the R model will come from S3 and the output will need to be written back to S3. The data scientist group at my organization are R experts and my organization has identified as AWS as the enterprise cloud platform. Given this, I need to find a way to run the R simulation model in AWS.
I saw this blog (https://aws.amazon.com/blogs/compute/analyzing-genomics-data-at-scale-using-r-aws-lambda-and-amazon-api-gateway/) which talks about using Lambda to run R code within Ptyhon using rpy2 package. I plan to follow the same approach. I was planning to implement the Lambda function as below-
1) Read input files from S3 to write to local lambda storage (/tmp). This will done using Python boto3 SDK.
2) Invoke R algorithm using rpy2. I plan to save the R algorithm as a .RDS file in S3 and load it using rpy2 and then run the R algorithm. R algorithm would write the output back to local lambda storage.
3) Write output from lambda storage to S3. Again this will be done using Python boto3 SDK.
As you can see, Python is used to interact with S3 and bring files to local lambda storage. R will read from local lambda storage and run the simulation algorithm. All R code will be wrapped within rpy2 in the lambda function. I was planning out this way because I was not sure if R can work with S3 directly.
I now realize that Lambda local storage is limited to 500 MB and I doubt if the input & output files will remain within this limit. I'm now trying to see if R can work directly with S3 within Lambda. If this is possible, I don't have to bring the files to local lambda storage and hence will not run out of space. Again, the R - S3 interaction will need to be wrapped inside rpy2. Is there a way to achieve this? Can the R Cloudyr library work in this scenario? Although I see examples of Cloudyr interacting with S3, but I don't see any example of this usage within Lambda using rpy2.
Any thoughts please?

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

Resources