How to download latest files from s3 bucket into local machine using Airflow - airflow

Is there a way to download latest files from S3 bucket into my local system using Airflow .
Since I am a newbie to Airflow I don't have much idea on how to proceed. Please assist.

Short answer: You could use S3KeySensor to detect when a certain key appears in an S3 bucket and then use S3Hook.read_key() to get the content of the key.
Assuming you are completely new to Airflow, I would suggest:
Start with the tutorial
Read up on Connections, Hooks, and Sensors
Use this example as a starting point for your own DAG
As a followup:
Browse the amazon provider package docs to see what else there is for working with AWS services
Look through other examples

Related

What is the best way to export / import from source to target?

We are planning to clone (make a copy) of our existing artifactory. our current setup runs artifactory on an EC2 instance with Derby DB and the files/artifacts stored in S3 bucket. In our copy we would like to have artifactory running on a new EC2, DB running on mysql and files stored in a different S3 bucket.
We have built a base setup for the target and its operational.
what is the best way to export / import from source to target. I see options for repository and system export. Should i do a repository by export/import ?
Thanks
A system export -> import is usually the recommended way.
You can see a detailing of the process in this JFrog's knowledge-base - migrating Artifactory article.
Also another entry with a video in this How to migrate Artifactory from one environment to another?
However you choose to go, make sure to test well and backup before!

Move a single collection in Firebase Cloud Firestore from one project to another

I have to Firebase projects, one is for dev and another is for production. I create a bunch of collections and after they successfully pass the test, I'll need to move them into production database. How can I do this without using Cloud Shell, or are there any alternative suggestions of database?
Thank you!
You can export the collection using cloud shell
Manage export ad import
In case you want to move collections between your dev and your production databases, without using Cloud Shell, there is an alternative that you can follow.
For you to achieve that, you will need to follow the below steps.
Create a Cloud Storage bucket to hold the data from your source project.
Export the data from your source project to the bucket.
Give your destination project permission to read from the bucket.
Import the data from the bucket into your destination project.
With these steps, you should be able to migrate data between your projects and have this way, the structure you want of a Development database and a Production database, where you can easily transfer the data. I would recommend you to check the official documentation Move Data Between Projects, to get the whole tutorial on how to achieve the above steps, in case you have doubts on how to achieve them.
Let me know if the information helped you!

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

Writing an appspec.yml File for Deployment from S3 (and/or Bit Bucket) to AWS CodeDeploy

I'd like to make it so that a commit to our BitBucket repo (or S3 Bucket) automatically deploys code (using CodeDeploy) to our EC2 instances. I'm not clear what to use for the 'source' and 'destination' entry under the 'files' section in the appspec.yml file and also I am not cleared what to mention in BeforeInstall and AfterInstall under 'Hooks' section. I've found some examples on Google and AWs documentation but I am confused what to mention in above fields. The more I am exploring more I am getting confused.
Consider I am new to AWS Code Deploy.
Also it will be very helpful if someone can provide me step y step link how to configure and how to automate the CodeDeploy.
I was wondering if someone could help me out?
Thanks in advance for your help!
Thanks for using CodeDeploy. For new users, I'd like to recommend the following things to do:
Try to run First Run Wizard on console, it will should you the general process how the deployment goes. It also provide a default deployment bundle, also an appspec file included.
Once you want to try a deployment yourself, the Get Started doc is a great place to help you with some pre-requiste settings like IAM role
Then probably try some tutorials for a sample app too, which gives you some idea about deployment groups, deployment configuration, revision and so on.
The next step should be create a bundle for your own use cases, Appspec file doc would be a great place to refer. And for your concerns about BeforeInstall and AfterInstall, if your application doesn't need to do anything, the lifecycle events can be left as empty. BeforeInstall can be used to for for preinstall tasks, such as decrypting files and creating a backup of the current version, while AfterInstall can be used for tasks such as configuring your application or changing file permissions.
Now it comes to the fun part! This blog talks about details about how to integrate with Github(similar for Bitbucket). It's a little long, but really useful, and it also includes how to do automatically deployment once there is a new pushed commit. Currently Jenkins and CodePipline are really popular for auto-triggered deplyoments, but there are always a lot of other ways can achieve the same purpose like Lamda and so on

Installing Google Cloud Datastore gcd

I'm trying to wrap my head around working with Google Cloud Datastore but am having trouble getting started. I've downloaded the zip of the gcd tool (v1beta2) as described here, which, when unpacked, is comprised of three files: gcd.sh, gcd.cmd, and CloudDatastore.jar. Unfortunately, there are no further instructions on what to do next - where to install it, what path variables or permissions to set, etc. Can someone fill me in?
TIA - Joe
Typical usage looks something like:
# create a dataset
gcd.sh create my-project
# start the local datastore
gcd.sh start my-project
Then, if you're using the Java or Python protocol buffers library, you set a couple of environment variables to instruct the client to use the local datastore:
export DATASTORE_HOST=http://localhost:8080
export DATASTORE_DATASET=my-project
You can find more details about the gcd tool (including instructions for managing indexes) here.

Resources