Airflow logs in s3 bucket - airflow

Would like to write the airflow logs to s3. Following are the parameter that we need to set according to the doc-
remote_logging = True
remote_base_log_folder =
remote_log_conn_id =
If Airflow is running in AWS, why do I have to pass the AWS keys? Shouldn't the boto3 API be able to write/read to s3 if correct permission are set on IAM role attached to the instance?

Fair point, but I think it allows for more flexibility if Airflow is not running on AWS or if you want to use a specific set of credentials rather than give the entire instance access. It might have also been easier implementation as well because the underlying code for writing logs into S3 uses the S3Hook (https://github.com/apache/airflow/blob/1.10.9/airflow/utils/log/s3_task_handler.py#L47), which requires a connection id.

Related

Provide aws credentials to Airflow GreatExpectationsOperator

I would like to use GreatExpectationsOperator to perform data quality validations.
The validation results data should be stored in S3.
I don't see an option to send an airflow connection name to the GE operator, and the AWS credentials in my organization are stored in an airflow connection.
How can great expectations retrieve s3 credentials from airflow connection? and not from the default aws credentials in .aws dir?
Thanks!
We ended up creating a new oprator that inherit from GE operator and the operator get the connection as part of its ecxeute method.

VPC creation problem in aws via terraform

I have been trying to create vpc infrastructure in AWS through terraform I am unable to perform the "Terraform apply" command anyone has a similar problem while using a free trial account.
Error: Error creating VPC: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: 4HZVo3-eWCS-YLhRy55P_0T13F_fPtA29TYrJrSe5_dyPxIcqRbh7_wCcrCZr2cpmb-B5--_fxVaOngBfHD_7yfnPH7NLf1rrqpb7ge1mvQrK8P0Ltfpgpm37nZXezZUoYf1t4peB25aCxnbfeboHpgJjcFnHvqvf5so5G2PufnGZSB4FUZMfdaqppnJ-sNT7b36TonHUDNbLhBVUl5Fwd8d02R-6ZraRYvDx-o4lDfP9xSWs6PMUFXNr1qzruYaeMYMxIe-9kGOQptgBLYZXsxr966ajor-p6aLJAKlIwPGN7Iz7v893oGpGgz_8wxTv4oEb5GnfYOuPOqSyEMLKI69b2JUvVU1m4tCcjKBaHJARP5sIiFSGhh4lb_E0_cKkmmFfKzyET2h8YkSD8U9Lm4rRtGbAEJvIoDZYDkNxlW7W2XvsccmLnQFeSxpLolVhguExkP7DT9uXffJzFEjQn-VkhqKnWlwv0vxIcOcoLP04Li5WAqRRr3l7yK2bYznfg
│ status code: 403, request id: 5c297a4d-7bcf-4bb4-b311-37480e1f26b8
make sure you have properly setup aws credentials and permissions.
check these two files
~/.aws/credentials
~/.aws/config
this docs can help you.
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Did you configure your access keys?
provider "aws" {
region = "us-west-2"
access_key = "my-access-key"
secret_key = "my-secret-key"
}
There are multiple ways to do it (described here).
My above example can be a good start but you don't want to commit those so I recommend to configure the keys in ~/.aws/credentials (like you need them for AWS CLI). The aws provider will pick them up automatically and so you don't need to define them somewhere in your terraform code.

Using R and paws: How to set credentials using profile in config file?

I use SSO and a profile as defined in ~/.aws/config (MacOS) to access AWS services, for instance:
aws s3 ls --profile myprofilename
I would like to access AWS services from within R, using the paws() package. In order to do this, I need to set my credentials in the R code. I want to do this through accessing the profile in the ~/.aws/config file (as opposed to listing access keys in the code), but I haven't been able to figure out how to do this.
I looked at the extensive documentation here, but it doesn't seem to cover my use case.
The best I've been able to come up with is:
x = s3(config = list(credentials = list(profile = "myprofilename")))
x$list_objects()
... which throws an error: "Error in f(): No credentials provided", suggesting that the first line of code above does not connect to my profile as stored in ~/.aws/config.
An alternative is to generate a user/key with programmatic access to your S3 data. Then, assuming that ~/.aws/env contains the values of the generated key:
AWS_ACCESS_KEY_ID=abc
AWS_SECRET_ACCESS_KEY=123
AWS_REGION=us-east-1
insert the following line at the beginning of your file:
readRenviron("~/.aws/env")
This AWS blog provides details about how to get the temporary credentials for programatic access. If you can get the credentials and set the appropriate environment variables, then the code should work fine without the profile name.
Or You can also try the following if you can get temporary credentials using aws cli
Check if you can generate temporary credentials
aws sts assume-role --role-arn <value> --role-session-name <some-meaningful-session-name> --profile myprofilename
If you can execute the above successfully, then you can use this method to automate the process of generating credentials before your code runs.
Put the above command in a bash script get-temp-credentials.sh and generate a JSON containing the temporary credentials as per the documentation.
Add a new profile programmatic-access in the ~/.aws/config
[profile programmatic-access]
credential_process = "/path/to/get-temp-credentials.sh"
Finally update the code to use the profile name as programmatic-access
If you have AWS cli credentials set up as a bash profile eg. ~/.aws/config:
[profile myprof]
region=eu-west-2
output=json
.. and credentials eg. ~/.aws/credentials:
[myprof]
aws_access_key_id = XXX
aws_secret_access_key = xxx
.. paws will use these if you add a line to ~/.Renviron:
AWS_PROFILE=myprof

Airflow Metadata DB = airflow_db?

I have a project requirement to back-up Airflow Metadata DB to some data warehouse (but not using an Airflow DAG). At the same time, the requirement mentions some connection called airflow_db.
I am quite new to Airflow, so I googled a bit on the topic. I am a bit confused about this part. Our Airflow Metadata DB is PostgreSQL (this is built from docker-compose, so I am tinkering on a local install), but when I look at Connections in Airflow Web UI, it says airflow_db is MySQL.
I initially assumed that they are the same, but by the looks of it, they aren't? Can someone explain the difference and what they are for?
Airflow creates airflow_db Conn Id with MySQL by default (see source code)
Default connections are not really useful in production system. It's just a long list of stuff that you are probably not going to use.
Airflow 1.1.10 introduced the ability not to create the default list by setting:
load_default_connections = False in airflow.cfg (See PR)
To give more background the connection list is where hooks find the information needed in order to connect to a service. It's not related to the backend database. Though the backend is db like any db and if you wish to allow hooks to interact with it you can define it in the list like any other connection (which is probably why you have this as option in the default).

Running AWS commands from commandline on a ShellCommandActivity

My original problem was that I want to increase my DynamoDB write throughput before I run the pipeline, and then decrease it when I'm done uploading (doing it max once a day, so I'm fine with the decreasing limitations).
They only way I found to do it is through a shell script that will issue the API commands to alter the throughput. How does it work with my AMI access_key and secret_key when it's a resource that pipeline creates for me? (I can't log in to set the ~/.aws/config file and don't really want to create an AMI just for this).
Should I write the script in bash? can I use ruby/python AWS SDK packages for example? (I prefer the latter..)
How do I pass my credentials to the script? do I have runtime variables (like #startedDate) that I can pass as arguments to the activity with my key and secret? Do I have any other way to authenticate with either the commandline tools or the SDK package?
If there is another way to solve my original problem - please let me know. I've only got to the ShellActivity solution because I couldn't find anything else in documentations/forums.
Thanks!
OK. found it - http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-roles.html
The resourceRole in the default object in your pipeline will be the one assigned to resources (Ec2Resource) that are created as a part of the pipeline activation.
The default one in configured to have all your permissions and AWS commandline and SDK packages are automatically looking for those credentials so no need to update ~/.aws/config of pass credentials manually.

Resources