Provide aws credentials to Airflow GreatExpectationsOperator - airflow

I would like to use GreatExpectationsOperator to perform data quality validations.
The validation results data should be stored in S3.
I don't see an option to send an airflow connection name to the GE operator, and the AWS credentials in my organization are stored in an airflow connection.
How can great expectations retrieve s3 credentials from airflow connection? and not from the default aws credentials in .aws dir?
Thanks!

We ended up creating a new oprator that inherit from GE operator and the operator get the connection as part of its ecxeute method.

Related

Create a dynamic database connection in Airflow DAG

I am using Apache-Airflow 2.2.3 and I know we can create connections via admin/connections. But I trying for a way to create a connection using dynamic DB server details.
My DB host, user, password details are coming through the DAGRun input config and I need to read and write the data to DB.
You can read connection details from the DAGRun config:
# Say we gave input {"username": "foo", "password": "bar"}
from airflow.models.connection import Connection
def mytask(**context):
username = context["dag_run"].conf["username"]
password = context["dag_run"].conf["password"]
connection = Connection(login=username, password=password)
However, all operators (that require a connection) in Airflow take an argument conn_id that takes a string identifying the connection in the metastore/env var/secrets backend. At the moment it is not possible to provide a Connection object.
Therefore, if you implement your own Python functions (and use the PythonOperator or #task decorator) or implement your own operators, you should be able to create a Connection object and perform whatever logic using that. But using any other existing operators in Airflow will not be possible.

VPC creation problem in aws via terraform

I have been trying to create vpc infrastructure in AWS through terraform I am unable to perform the "Terraform apply" command anyone has a similar problem while using a free trial account.
Error: Error creating VPC: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: 4HZVo3-eWCS-YLhRy55P_0T13F_fPtA29TYrJrSe5_dyPxIcqRbh7_wCcrCZr2cpmb-B5--_fxVaOngBfHD_7yfnPH7NLf1rrqpb7ge1mvQrK8P0Ltfpgpm37nZXezZUoYf1t4peB25aCxnbfeboHpgJjcFnHvqvf5so5G2PufnGZSB4FUZMfdaqppnJ-sNT7b36TonHUDNbLhBVUl5Fwd8d02R-6ZraRYvDx-o4lDfP9xSWs6PMUFXNr1qzruYaeMYMxIe-9kGOQptgBLYZXsxr966ajor-p6aLJAKlIwPGN7Iz7v893oGpGgz_8wxTv4oEb5GnfYOuPOqSyEMLKI69b2JUvVU1m4tCcjKBaHJARP5sIiFSGhh4lb_E0_cKkmmFfKzyET2h8YkSD8U9Lm4rRtGbAEJvIoDZYDkNxlW7W2XvsccmLnQFeSxpLolVhguExkP7DT9uXffJzFEjQn-VkhqKnWlwv0vxIcOcoLP04Li5WAqRRr3l7yK2bYznfg
│ status code: 403, request id: 5c297a4d-7bcf-4bb4-b311-37480e1f26b8
make sure you have properly setup aws credentials and permissions.
check these two files
~/.aws/credentials
~/.aws/config
this docs can help you.
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Did you configure your access keys?
provider "aws" {
region = "us-west-2"
access_key = "my-access-key"
secret_key = "my-secret-key"
}
There are multiple ways to do it (described here).
My above example can be a good start but you don't want to commit those so I recommend to configure the keys in ~/.aws/credentials (like you need them for AWS CLI). The aws provider will pick them up automatically and so you don't need to define them somewhere in your terraform code.

Airflow logs in s3 bucket

Would like to write the airflow logs to s3. Following are the parameter that we need to set according to the doc-
remote_logging = True
remote_base_log_folder =
remote_log_conn_id =
If Airflow is running in AWS, why do I have to pass the AWS keys? Shouldn't the boto3 API be able to write/read to s3 if correct permission are set on IAM role attached to the instance?
Fair point, but I think it allows for more flexibility if Airflow is not running on AWS or if you want to use a specific set of credentials rather than give the entire instance access. It might have also been easier implementation as well because the underlying code for writing logs into S3 uses the S3Hook (https://github.com/apache/airflow/blob/1.10.9/airflow/utils/log/s3_task_handler.py#L47), which requires a connection id.

How to mask the credentials in the Airflow logs?

I want to make sure some of the encrypted variables does not appear in airflow log.
I am passing AWS keys to Exasol Export sql, in the Airflow log it is getting printed.
Currently, this is not possible out-of-the-box. You can, however, configure your own Python Logger and use that class by changing logging_config_class property in airflow.cfg file.
Example here: Mask out sensitive information in python log
Are the AWS keys sent as part of the data for SQL export or are they sent for the connection?
If they are sent for connection, then hiding these credentials is possible. You simply would have to create a connection and send export the data from the connection.

What's the best method for passing AWS credentials as user data to an EC2 instance?

I have a job processing architecture based on AWS that requires EC2 instances query S3 and SQS. In order for running instances to have access to the API the credentials are sent as user data (-f) in the form of a base64 encoded shell script. For example:
$ cat ec2.sh
...
export AWS_ACCOUNT_NUMBER='1111-1111-1111'
export AWS_ACCESS_KEY_ID='0x0x0x0x0x0x0x0x0x0'
...
$ zip -P 'secret-password' ec2.sh
$ openssl enc -base64 -in ec2.zip
Many instances are launched...
$ ec2run ami-a83fabc0 -n 20 -f ec2.zip
Each instance decodes and decrypts ec2.zip using the 'secret-password' which is hard-coded into an init script. Although it does work, I have two issues with my approach.
'zip -P' is not very secure
The password is hard-coded in the instance (it's always 'secret-password')
The method is very similar to the one described here
Is there a more elegant or accepted approach? Using gpg to encrypt the credentials and storing the private key on the instance to decrypt it is an approach I'm considering now but I'm unaware of any caveats. Can I use the AWS keypairs directly? Am I missing some super obvious part of the API?
You can store the credentials on the machine (or transfer, use, then remove them.)
You can transfer the credentials over a secure channel (e.g. using scp with non-interactive authentication e.g. key pair) so that you would not need to perform any custom encryption (only make sure that permissions are properly set to 0400 on the key file at all times, e.g. set the permissions on the master files and use scp -p)
If the above does not answer your question, please provide more specific details re. what your setup is and what you are trying to achieve. Are EC2 actions to be initiated on multiple nodes from a central location? Is SSH available between the multiple nodes and the central location? Etc.
EDIT
Have you considered parameterizing your AMI, requiring those who instantiate your AMI to first populate the user data (ec2-run-instances -f user-data-file) with their AWS keys? Your AMI can then dynamically retrieve these per-instance parameters from http://169.254.169.254/1.0/user-data.
UPDATE
OK, here goes a security-minded comparison of the various approaches discussed so far:
Security of data when stored in the AMI user-data unencrypted
low
clear-text data is accessible to any user who manages to log onto the AMI and has access to telnet, curl, wget, etc. (can access clear-text http://169.254.169.254/1.0/user-data)
you are vulnerable to proxy request attacks (e.g. attacker asks the Apache that may or may not be running on the AMI to get and forward the clear-text http://169.254.169.254/1.0/user-data)
Security of data when stored in the AMI user-data and encrypted (or decryptable) with easily obtainable key
low
easily-obtainable key (password) may include:
key hard-coded in a script inside an ABI (where the ABI can be obtained by an attacker)
key hard-coded in a script on the AMI itself, where the script is readable by any user who manages to log onto the AMI
any other easily obtainable information such as public keys, etc.
any private key (its public key may be readily obtainable)
given an easily-obtainable key (password), the same problems identified in point 1 apply, namely:
the decrypted data is accessible to any user who manages to log onto the AMI and has access to telnet, curl, wget, etc. (can access clear-text http://169.254.169.254/1.0/user-data)
you are vulnerable to proxy request attacks (e.g. attacker asks the Apache that may or may not be running on the AMI to get and forward the encrypted http://169.254.169.254/1.0/user-data, ulteriorly descrypted with the easily-obtainable key)
Security of data when stored in the AMI user-data and encrypted with not easily obtainable key
average
the encrypted data is accessible to any user who manages to log onto the AMI and has access to telnet, curl, wget, etc. (can access encrypted http://169.254.169.254/1.0/user-data)
an attempt to decrypt the encrypted data can then be made using brute-force attacks
Security of data when stored on the AMI, in a secured location (no added value for it to be encrypted)
higher
the data is only accessible to one user, the user who requires the data in order to operate
e.g. file owned by user:user with mask 0600 or 0400
attacker must be able to impersonate the particular user in order to gain access to the data
additional security layers, such as denying the user direct log-on (having to pass through root for interactive impersonation) improves security
So any method involving the AMI user-data is not the most secure, because gaining access to any user on the machine (weakest point) compromises the data.
This could be mitigated if the S3 credentials were only required for a limited period of time (i.e. during the deployment process only), if AWS allowed you to overwrite or remove the contents of user-data when done with it (but this does not appear to be the case.) An alternative would be the creation of temporary S3 credentials for the duration of the deployment process, if possible (compromising these credentials, from user-data, after the deployment process is completed and the credentials have been invalidated with AWS, no longer poses a security threat.)
If the above is not applicable (e.g. S3 credentials needed by deployed nodes indefinitely) or not possible (e.g. cannot issue temporary S3 credentials for deployment only) then the best method remains to bite the bullet and scp the credentials to the various nodes, possibly in parallel, with the correct ownership and permissions.
I wrote an article examining various methods of passing secrets to an EC2 instance securely and the pros & cons of each.
http://www.shlomoswidler.com/2009/08/how-to-keep-your-aws-credentials-on-ec2/
The best way is to use instance profiles. The basic idea is:
Create an instance profile
Create a new IAM role
Assign a policy to the previously created role, for example:
{
"Statement": [
{
"Sid": "Stmt1369049349504",
"Action": "sqs:",
"Effect": "Allow",
"Resource": ""
}
]
}
Associate the role and instance profile together.
When you start a new EC2 instance, make sure you provide the instance profile name.
If all works well, and the library you use to connect to AWS services from within your EC2 instance supports retrieving the credentials from the instance meta-data, your code will be able to use the AWS services.
A complete example taken from the boto-user mailing list:
First, you have to create a JSON policy document that represents what services and resources the IAM role should have access to. for example, this policy grants all S3 actions for the bucket "my_bucket". You can use whatever policy is appropriate for your application.
BUCKET_POLICY = """{
"Statement":[{
"Effect":"Allow",
"Action":["s3:*"],
"Resource":["arn:aws:s3:::my_bucket"]}]}"""
Next, you need to create an Instance Profile in IAM.
import boto
c = boto.connect_iam()
instance_profile = c.create_instance_profile('myinstanceprofile')
Once you have the instance profile, you need to create the role, add the role to the instance profile and associate the policy with the role.
role = c.create_role('myrole')
c.add_role_to_instance_profile('myinstanceprofile', 'myrole')
c.put_role_policy('myrole', 'mypolicy', BUCKET_POLICY)
Now, you can use that instance profile when you launch an instance:
ec2 = boto.connect_ec2()
ec2.run_instances('ami-xxxxxxx', ..., instance_profile_name='myinstanceprofile')
I'd like to point out that it is not needed to supply any credentials to your EC2 instance anymore. Using IAM, you can create a role for your EC2 instances. In these roles, you can set fine-grained policies that allow your EC2 instance to, for example, get a specific object from a specific S3 bucket and no more. You can read more about IAM Roles in the AWS docs:
http://docs.aws.amazon.com/IAM/latest/UserGuide/WorkingWithRoles.html
Like others have already pointed out here, you don't really need to store AWS credentials for an EC2 instance, by using IAM Roles -
https://aws.amazon.com/blogs/security/a-safer-way-to-distribute-aws-credentials-to-ec2/.
I will add that you can employ the same method also for securely storing NON-AWS credentials for you EC2 instance, like say if you have some db credentials you want to keep secure. You save the non-aws credentials on a S3 Bukcet, and use IAM role to access that bucket.
you can find more detailed information on that here - https://aws.amazon.com/blogs/security/using-iam-roles-to-distribute-non-aws-credentials-to-your-ec2-instances/

Resources