Is it possible to pre-define variables, connections etc. in a file so that they are loaded when Airflow starts up? Setting them through the UI is not great from a deployment perspective.
Cheers
Terry
I'm glad that someone asked this question. In fact since Airflow completely exposes the underlying SQLAlchemy models to end-user, programmatic manipulation (creation, updation & deletion) of all Airflow models, particularly those used to supply configs like Connection & Variable is possible.
It may not be very obvious, but the open-source nature of Airflow means there are no secrets: you just need to peek in harder. Particularly for these use-cases, I've always found the cli.py to be very useful reference point.
So here's the snippet I use to create all MySQL connections while setting up Airflow. The input file supplied is of JSON format with the given structure.
# all imports
import json
from typing import List, Dict, Any, Optional
from airflow.models import Connection
from airflow.settings import Session
from airflow.utils.db import provide_session
from sqlalchemy.orm import exc
# trigger method
def create_mysql_conns(file_path: str) -> None:
"""
Reads MySQL connection settings from a given JSON file and
persists it in Airflow's meta-db. If connection for same
db already exists, it is overwritten
:param file_path: Path to JSON file containing MySQL connection settings
:type file_path: str
:return: None
:type: None
"""
with open(file_path) as json_file:
json_data: List[Dict[str, Any]] = json.load(json_file)
for settings_dict in json_data:
db_name: str = settings_dict["db"]
conn_id: str = "mysql.{db_name}".format(db_name=db_name)
mysql_conn: Connection = Connection(conn_id=conn_id,
conn_type="mysql",
host=settings_dict["host"],
login=settings_dict["user"],
password=settings_dict["password"],
schema=db_name,
port=settings_dict.get("port", mysql_conn_description["port"]))
create_and_overwrite_conn(conn=mysql_conn)
# utility delete method
#provide_session
def delete_conn_if_exists(conn_id: str, session: Optional[Session] = None) -> bool:
# Code snippet borrowed from airflow.bin.cli.connections(..)
try:
to_delete: Connection = (session
.query(Connection)
.filter(Connection.conn_id == conn_id)
.one())
except exc.NoResultFound:
return False
except exc.MultipleResultsFound:
return False
else:
session.delete(to_delete)
session.commit()
return True
# utility overwrite method
#provide_session
def create_and_overwrite_conn(conn: Connection, session: Optional[Session] = None) -> None:
delete_conn_if_exists(conn_id=conn.conn_id)
session.add(conn)
session.commit()
input JSON file structure
[
{
"db": "db_1",
"host": "db_1.hostname.com",
"user": "db_1_user",
"password": "db_1_passwd"
},
{
"db": "db_2",
"host": "db_2.hostname.com",
"user": "db_2_user",
"password": "db_2_passwd"
}
]
Reference links
With code, how do you update an airflow variable?
Problem updating the connections in Airflow programatically
How to create, update and delete airflow variables without using the GUI?
Programmatically clear the state of airflow task instances
airflow/api/common/experimental/pool.py
Related
Airflow version :2.0.2
Trying to create Emr cluster, by retrying data from AWS secrets manager.
I am trying to write an airflow dag and, my task is to get data from this get_secret function and use it in Spark_steps
def get_secret():
secret_name = Variable.get("secret_name")
region_name = Variable.get(region_name)
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
account_id = boto3.client('sts').get_caller_identity().get('Account')
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in get_secret_value_response:
secret_str = get_secret_value_response['SecretString']
secret=json.loads(secret_str)
airflow_path=secret["airflow_path"]
return airflow_path
...
I need to use "airflow_path" return value in below spark_steps
Spark_steps:
SPARK_STEPS = [
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
's3://'+airflow_path+'-pyspark/pitchbook/config.zip,s3://'+airflow_path+'-pyspark/pitchbook/jobs.zip,s3://'+airflow_path+'-pyspark/pitchbook/DDL.zip',
's3://'+airflow_path+'-pyspark/pitchbook/main.py'
],
},
},
I saw on the internet I need to use Xcom, is this right ?, and do I need to run this function in python operator first and then get the value. please provide an example as I am a newbie.
Thanks for your help.
Xi
Yes if you would like to pass dynamic stuff, leveraging xcom push/pull might be easier.
Leverage PythonOperator to push data into xcom.
See reference implementation:
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py#L108
https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/
I'm using Airflow in Google Composer. By default all the tasks in a DAG use the default connection to communicate with Storage, BigQuery, etc. Obviously we can specify another connection configured in Airflow, ie:
task_custom = bigquery.BigQueryInsertJobOperator(
task_id='task_custom_connection',
gcp_conn_id='my_gcp_connection',
configuration={
"query": {
"query": 'SELECT 1',
"useLegacySql": False
}
}
)
Is it possible to use a specific connection as the default for all tasks in the entire DAG?
Thanks in advance.
UPDATE:
Specify gcp_conn_id via default_args in DAG (as Javier Lopez Tomas recommended) doesn't work completely. The Operators that expect gcp_conn_id as parameter works fine, but in my case unfortunately most of interactions with GCP components do so via clients or hooks within PythonOperators.
For example: If I call DataflowHook (inside a function called by a PythonOperator) without specifying the connection, it internally uses "google_cloud_default" and not "gcp_conn_id" :(
def _dummy_func(**context):
df_hook = DataflowHook()
default_args = {
'gcp_conn_id': 'my_gcp_connection'
}
with DAG(default_args=default_args) as dag:
dummy = PythonOperator(python_callable=_dummy_func)
You can use default args:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#default-arguments
In your case it would be:
default_args = {
"gcp_conn_id": "my_gcp_connection"
}
with DAG(blabla)...
I'm learning Airflow and am planning to set some variables to use across different tasks. These are in my dags folder, saved as configs.json, like so:
{
"vars": {
"task1_args": {
"something": "This is task 1"
},
"task2_args": {
"something": "this is task 2"
}
}
}
I get that we can enter Admin-->Variables--> upload the file. But I have 2 questions:
What if I want to adjust some of the variables while airflow is running? I can adjust my code easily and it updates in realtime but it doesn't seem like this works for variables.
Is there a way to just auto-import this specific file on startup? I don't want to have to add it every time I'm testing my project.
I don't see this mentioned in the docs but it seems like a pretty trivial thing to want.
What you are looking for is With code, how do you update an airflow variable?
Here's an untested snippet that should help
from airflow.models import Variable
Variable.set(key="my_key", value="my_value")
So basically you can write a bootstrap python script to do this setup for you.
In our team, we use such scripts to setup all Connections, and Pools too
In case you are wondering, here's the set(..) method from source
#classmethod
#provide_session
def set(
cls,
key: str,
value: Any,
serialize_json: bool = False,
session: Session = None
):
"""
Sets a value for an Airflow Variable with a given Key
:param key: Variable Key
:param value: Value to set for the Variable
:param serialize_json: Serialize the value to a JSON string
:param session: SQL Alchemy Sessions
"""
if serialize_json:
stored_value = json.dumps(value, indent=2)
else:
stored_value = str(value)
Variable.delete(key, session=session)
session.add(Variable(key=key, val=stored_value))
session.flush()
I'm new to airflow, can someone please help me with this as I'm unable to access 'db_conn' inside my custom operator, this argument defined in default_args.
**Dag details:**
default_args = {
'owner': 'airflow',
'email': ['example#hotmail.com'],
'db_conn': 'database_connection'
}
dag = DAG(dag_id='my_custom_operator_dag', description='Another tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2020, 8, 6),
catchup=False,
default_args=default_args)
operator_task_start = MyOperator(
task_id='my_custom_operator', dag=dag
)
**Operator details:**
class MyOperator(BaseOperator):
##apply_defaults
def __init__(self,
*args,
**kwargs):
super(MyOperator, self).__init__(*args,**kwargs)
def execute(self, context):
log.info('owner: %s', self.owner)
log.info('email: %s', self.email)
log.info('db_conn: %s', self.db_conn)
# Error here, AttributeError: 'MyOperator' object has no attribute 'db_conn
You seem to have misunderstood default_args. default_args is just a shorthand (code-cleanup / refactoring / brevity) to pass common (which have same value for all operators of DAG, like owner) args to all your operators, by setting them up as defaults and passing to the DAG itself. Quoting the docstring comment from DAG params
:param default_args: A dictionary of default parameters to be used
as constructor keyword parameters when initialising operators.
Note that operators have the same hook, and precede those defined
here, meaning that if your dict contains `'depends_on_past': True`
here and `'depends_on_past': False` in the operator's call
`default_args`, the actual value will be `False`.
:type default_args: dict
So clearly for default_args to work, any keys that you are passing there should be an argument of your Operator classes.
Not just that, do note that passing invalid (non-existent) arguments to Operator constructor(s) will be penalized in Airflow 2.0 (so better not pass any)
'Invalid arguments were passed to {c} (task_id: {t}). '
'Support for passing such arguments will be dropped in '
'future. ..
Hopefully, by now it must be clear that to make this work, you must add a param db_conn in constructor of your MyOperator class
**Operator details:**
class MyOperator(BaseOperator):
##apply_defaults
def __init__(self,
db_conn: str,
*args,
**kwargs):
super(MyOperator, self).__init__(*args,**kwargs)
self.db_conn: str = db_conn
And while we are at it, may I offer you a suggestion: for something like a connection, preferably use the Connection feature offered by Airflow which eases your interaction with external services
makes them manageable (view / edit via UI)
secure (they are stored encrypted in db)
support for load balancing (define multiple connections with same conn_id, Airflow will randomly distribute calls to one of those)
When there is more than one connection with the same conn_id, the
get_connection() method on BaseHook will choose one connection
randomly. This can be be used to provide basic load balancing and
fault tolerance, when used in conjunction with retries.
nice integration with built-in hooks
They also use the airflow.models.connection.Connection model to
retrieve hostnames and authentication information. Hooks keep
authentication code and information out of pipelines, centralized in
the metadata database.
I have an Airflow DAG that starts an AWS EMR cluster to run steps. On the DAG we pass some variables that are set on Airflow Variables. But some of these variables are encrypted at Airflow, but when passing to EMR, we can see then clearly at EMR console. Is there any way to hide this?
Here is how we are defining the step. The airflow variable db_pass must be encrypted or hidden somehow
{
"Name": "EMR JOB",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"{{var.value.job_script}}",
"--database_user={{var.value.db_user}}",
"--database_pass={{var.value.db_pass}}"
]
}
}
]
This SAMPLE_STEP_DEFINITION is then passed as to the EmrAddStepsOperator:
...
sample_task = EmrAddStepsOperator(
steps=SAMPLE_STEP_DEFINITION,
...
There are several ways to do this. First I would suggest to encrypt passwords with KMS. Here is the code how to do it:
def encryptString(plainText: String, keyArn: String): String = {
val req = new EncryptRequest().withKeyId(keyArn).withPlaintext(ByteBuffer.wrap(plainText.getBytes))
Base64.getEncoder.encodeToString(kmsClient.encrypt(req).getCiphertextBlob.array())
}
def decryptString(encryptedText: String, keyArn: String): String = {
val req = new DecryptRequest().withCiphertextBlob(ByteBuffer.wrap(Base64.getDecoder.decode(encryptedText)))
new String(kmsClient.decrypt(req).getPlaintext.array())
}
Your just need to attach decrypt permission to EMR_EC2_DefaultRole.
Another way is to pass a config file stored on S3 with password.