I'm trying to understand how to use secrets in Airflow.
I've configured hashicorp vault as a secret backend in airflow and put there an variable like:
AIRFLOW__SECRETS__BACKEND: "airflow.contrib.secrets.hashicorp_vault.VaultBackend"
AIRFLOW__SECRETS__BACKEND_KWARGS: '{"url":"http://vault:8200","token":My_TOKEN_TO_VAULT,"variables_path":"variables","mount_point":"airflow","connections_path":"connections"}'
1 docker exec -it VAULT_DOCKER_ID sh
2 vault login My_TOKEN_TO_VAULT
3 vault secrets enable -path=airflow -version=2 kv
4 vault kv put airflow/variables/slack_token value=SOMETHING
Now i'm trying to use this variable in my dag
Simple Variable.get('slack_token') indeed works but when I try to use it to connect with Slack i get an error.
So in process of debugging I noticed that this variable is printed as "***" so I suppose it is encrypted.
How to get an access to it?
Thanks :)
Related
I want to integrate vault with a Java application. I follow this blog to do it.
The question is when I have wrapping token, I want to Unwrap it in step number 9 in the picture above with HTTP request and get the secret_id. I see the API document in here, but it require X-Vault-Token which can not store in my JAVA application. Without it the API response permission denied.
But when I use vault command: VAULT_TOKEN=xxxxxxxxxx vault unwrap -field=secret_id, it response a secret that what I want (I do not login to Vault).
Any have experience about this please help. Thank you.
In your linked diagram, step 10 (the next step) is literally "Login with Role ID and Secret ID". If you want to wrap a different secret, then you can change the pattern completely, but the blog post you're referencing wants you to use the Secret ID from the wrapped token response to then log into Vault with that role and get your final secrets.
So, take the output of SECRET_ID=$(VAULT_TOKEN=xxxxxxxxxx vault unwrap -field=secret_id), export it, and then run a resp=$(vault write auth/approle/login role_id="${ROLE_ID}" secret_id="${SECRET_ID}"); VAULT_TOKEN=$(echo "${resp}" | jq -r .auth.client_token), export the VAULT_TOKEN, then call Vault to get the secret you really want (vault kv get secret/path/to/foobar) and do something with it.
#!/usr/bin/env bash
wrap_token=$(cat ./wrapped_token.txt)
role_id=$(cat ./approle_role_id.txt)
secret_id=$(VAULT_TOKEN="${wrap_token}" vault unwrap -field=secret_id)
resp=$(vault write -format=json auth/approle/login role_id="${role_id}" secret_id="${secret_id}")
VAULT_TOKEN=$(echo "${resp}" | jq -r '.auth.client_token')
export VAULT_TOKEN
# Put a secret in a file
# Best to ensure that the fs permissions are suitably restricted
UMASK=0077 vault kv get -format=json path/to/secret > ./secret_sink.json
# Put a secret in an environment variable
SECRET=$(vault kv get -format=json path/to/secret)
export SECRET
In case you want to reduce the security of your pattern, you can read below...
If you want to avoid logging into Vault and simply give the app a secret, you can avoid many of these steps by having your trusted CI solution request the secret directly, i.e. vault kv get -wrap_ttl=24h secret/path/to/secret, and then the unwrapping step you're doing will actually contain a secret you want to use, instead of the intermediary secret that would allow you to log into Vault and establish an application identity. However, this is not recommended as it would make your CI solution want to access more secrets, which is far from least privilege, and it makes it incredibly difficult to audit where the secrets are actually being leveraged from a Vault perspective, which is one of the primary benefits of implementing a central secrets management solution like Vault.
I would like to use GreatExpectationsOperator to perform data quality validations.
The validation results data should be stored in S3.
I don't see an option to send an airflow connection name to the GE operator, and the AWS credentials in my organization are stored in an airflow connection.
How can great expectations retrieve s3 credentials from airflow connection? and not from the default aws credentials in .aws dir?
Thanks!
We ended up creating a new oprator that inherit from GE operator and the operator get the connection as part of its ecxeute method.
I'm deploying Airflow 2 on GKE Autopilot using helm chart and have provisioned a Cloud SQL instance (MySQL) to be used as DB by airflow.
I have created (using kubectl) a secret in K8s with this connection string as value and wanted to give that as an env var to all airflow pods. So tried to provide that in
env: []
section of this chart (line no 239), but it can not use valueFrom attribute there. It need value. So I want to know what are the ways by which I can refer to a secret in this helm chart and provide that as env var value to all the containers this chart deploys
Answering my own for others to find correct solution -
Create the secret with connection key and value as database URI
Disable postgres deployment in helm values.yaml
Change data.metadataSecretName to the secret create in #1. Airflow will pick up and inject that as connection URI
Answer by Harsh Manvar is still valid and correct one, but that is more suited for injecting arbitrary secrets as env vars. For changing database and providing custom URI, approach I took is recommended - https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#database
You can checkout the line no. 244 which is injecting the secret to all PODs
it will also i think do same think as we can inject the secret as env variable so.
# Secrets for all airflow containers
secret: []
# - envName: ""
# secretName: ""
# secretKey: ""
values.yaml : https://github.com/apache/airflow/blob/main/chart/values.yaml#L243
Documentation details : https://github.com/apache/airflow/blob/main/docs/helm-chart/adding-connections-and-variables.rst#connections-and-sensitive-environment-variables
I would like to use DBT in MWAA Airflow enviroment. To achieve this I need to install DBT in the managed environment and from there run the dbt commands via the Airflow operators or CLI (BashOperator).
My problem with solution is that I need store the dbt profile file(s) -which contains the target / source database credentials- in S3. Otherwise the file is not going to be deployed to the Airflow worker nodes hence cannot be used by dbt.
Is there any other option? I feel this is a big security risk and also undermines the use of Airflow (because I would like to use its inbuilt password manager)
My ideas:
Create the profile file on the fly in the Airflow dag as a task and
write it out to local. I do not think this is a feasible workaround, because there is no guarantee that the dbt task is going to run on the same worker node which my code created.
Move the profile file manually to S3 (Exclude it from CI/CD). Again, I see a security risk, as I am storing credentials on S3.
Create a custom operator, which builds the profile file on the same machine as command will run. Maintenance nightmare.
Use MWAA environment variables (https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html) and combine it with dbt's env_var command. (https://docs.getdbt.com/reference/dbt-jinja-functions/env_var)
Storing credentials in System wide EVs, this way feels awkward.
Any good ideas or best practices?
#PeterRing, in our case we use Dbt Cloud. Once the connection is set up in the Airflow UI, you are calling Dbt Job IDs to trigger the job (then using a sensor to monitor it until it completes).
If you can't use Dbt Cloud, perhas you can use AWS Secrets Manager to store your db profile/creds: Configuring an Apache Airflow connection using a Secrets Manager secret
Would like to write the airflow logs to s3. Following are the parameter that we need to set according to the doc-
remote_logging = True
remote_base_log_folder =
remote_log_conn_id =
If Airflow is running in AWS, why do I have to pass the AWS keys? Shouldn't the boto3 API be able to write/read to s3 if correct permission are set on IAM role attached to the instance?
Fair point, but I think it allows for more flexibility if Airflow is not running on AWS or if you want to use a specific set of credentials rather than give the entire instance access. It might have also been easier implementation as well because the underlying code for writing logs into S3 uses the S3Hook (https://github.com/apache/airflow/blob/1.10.9/airflow/utils/log/s3_task_handler.py#L47), which requires a connection id.