I have a DAG that is triggered externally with some additional parameters say 'name'.
Sample code:
with airflow.DAG(
'my_dag_name',
default_args=default_args,
# Not scheduled, trigger only
schedule_interval=None) as dag:
start = bash_operator.BashOperator(
task_id='start',
bash_command='echo Hello.')
some_operation = MyOperator(
task_id='my_task',
name='{{ dag_run.conf["name"] }}')
goodbye = bash_operator.BashOperator(
task_id='end',
bash_command='echo Goodbye.')
start >> some_operation >> goodbye
Now if I use {{ dag_run.conf["name"] }} directly with the echo for a BashOperator, it works. Another way to read the parameter is to use a PythonOperator where I can read conf by kwargs['dag_run'].conf['name'].
However, what I really want is to have the name beforehand so that I can pass it while construction of the MyOperator.
Related
Does anyone know how to get the way a DAG got started (whether it was on a scheduler or manually)? I'm using Airflow 2.1.
I have a DAG that runs on an hourly basis, but there are times that I run it manually to test something. I want to capture how the DAG got started and pass that value to a column in a table where I'm saving some data. This will allow me to filter based on scheduled or manual starts and filter test information.
Thanks!
From an execution context, such as a python_callable provided to a PythonOperator you can access to the DagRun object related to the current execution:
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
Logs output:
[2021-09-08 18:53:52,188] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_dagRun_info
AIRFLOW_CTX_TASK_ID=python_task
AIRFLOW_CTX_EXECUTION_DATE=2021-09-07T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-09-07T00:00:00+00:00
Run type: backfill
Externally triggered ?: False
dag_run.run_type would be: "manual", "scheduled" or "backfill". (not sure if there are others)
external_trigger docs:
external_trigger (bool) -- whether this dag run is externally triggered
Also you could use jinja to access default vairables in templated fields, there is a variable representing the dag_run object:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
Full DAG:
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
default_args = {
"owner": "airflow",
}
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
dag = DAG(
dag_id="example_dagRun_info",
default_args=default_args,
start_date=days_ago(1),
schedule_interval="#once",
tags=["example_dags", "params"],
catchup=False,
)
with dag:
python_task = PythonOperator(
task_id="python_task",
python_callable=_print_dag_run,
)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
I have a DAG with many sub-tasks in it. In the middle of the DAG, there is a validation task and based on the result/return code from the task, i want to take two different paths. If success, one route(a sequence of tasks) will be followed and in case of failure, we would like to execute a different set of tasks. There are two problems with the current approach, one is that, validation tasks execute many times(as per the retries configured) if the exit code is 1. Second there is no way possible to take different branches of execution
To solve problem number 1, we can use the retry number is available from the task instance, which is available via the macro {{ task_instance }} . Appreciate if someone could point us to a cleaner approach, and the problem number 2 of taking different paths remains unsolved.
You can have retries at the task level.
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
retries=3,
dag=dag,
)
run_this_last = DummyOperator(
task_id='run_this_last',
retries=1,
dag=dag,
)
Regarding your 2nd problem, there is a concept of Branching.
The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id (or list of task_ids). The task_id returned is followed, and all of the other paths are skipped. The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task.
Example DAG:
import random
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import BranchPythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='example_branch_operator',
default_args=args,
schedule_interval="#daily",
)
run_this_first = DummyOperator(
task_id='run_this_first',
dag=dag,
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
dag=dag,
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='one_success',
dag=dag,
)
for option in options:
t = DummyOperator(
task_id=option,
dag=dag,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
dag=dag,
)
branching >> t >> dummy_follow >> join
Regarding your first problem, you set task/Operator specific retry options quite easily. Reference: baseoperator.py#L77.
Problem two, you can branch within a DAG easily with BranchPythonOperator (Example Usage: example_branch_operator.py). You will want to nest your validation task/logic within the BranchPythonOperator (You can define and execute operators within operators).
I'm new to Airflow and I'm trying to run an external DAG (developed and owned by another team), as part of my DAG flow.
I was looking at SubDagOperator, but it seems that for some reason it enforces the name of the subdag to be . which I cannot do as the child dag is owned by a different team.
here is my code sample:
parent_dag = DAG(
dag_id='parent_dag', default_args=args,
schedule_interval=None)
external_dag = SubDagOperator(
subdag=another_teams_dag,
task_id='external_dag',
dag=parent_dag,
trigger_rule=TriggerRule.ALL_DONE
)
and the other team's dag is defined like this:
another_teams_dag = DAG(
dag_id='another_teams_dag', default_args=args,
schedule_interval=None)
but I'm getting this error:
The subdag's dag_id should have the form
'{parent_dag_id}.{this_task_id}'. Expected 'parent_dag.external_dag';
received 'another_teams_dag'.
Any ideas?
What am I missing?
Use TriggerDagRunOperator
More info: https://airflow.apache.org/code.html#airflow.operators.dagrun_operator.TriggerDagRunOperator
Example:
Dag that triggers: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_controller_dag.py
Dag that is triggered: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py
For your case, you can use something like:
trigger = TriggerDagRunOperator(task_id='external_dag',
trigger_dag_id="another_teams_dag",
dag=dag)
I am trying to run a airflow DAG and need to pass some parameters for the tasks.
How do I read the JSON string passed as the --conf parameter in the command line trigger_dag command, in the python DAG file.
ex: airflow trigger_dag 'dag_name' -r 'run_id' --conf '{"key":"value"}'
Two ways. From inside a template field or file:
{{ dag_run.conf['key'] }}
Or when context is available, e.g. within a python callable of the PythonOperator:
context['dag_run'].conf['key']
In the example provided here https://github.com/apache/airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py#L62 while trying to parse 'conf' passed in an airflow REST API call, use provide_context=True in pythonOperator.
Also, the key-value pair passed in json format in the REST API call, can be accessed in bashOperator and sparkOperator as '\'{{ dag_run.conf["key"] if dag_run else "" }}\''
dag = DAG(
dag_id="example_dag",
default_args={"start_date": days_ago(2), "owner": "airflow"},
schedule_interval=None
)
def run_this_func(**context):
"""
Print the payload "message" passed to the DagRun conf attribute.
:param context: The execution context
:type context: dict
"""
print("context", context)
print("Remotely received value of {} for key=message".format(context["dag_run"].conf["key"]))
#PythonOperator usage
run_this = PythonOperator(task_id="run_this", python_callable=run_this_func, dag=dag, provide_context=True)
#BashOperator usage
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: \'{{ dag_run.conf["key"] if dag_run else "" }}\'"',
dag=dag
)
#SparkSubmitOperator usage
spark_task = SparkSubmitOperator(
task_id="task_id",
conn_id=spark_conn_id,
name="task_name",
application="example.py",
application_args=[
'--key', '\'{{ dag_run.conf["key"] if dag_run else "" }}\''
],
num_executors=10,
executor_cores=5,
executor_memory='30G',
#driver_memory='2G',
conf={'spark.yarn.maxAppAttempts': 1},
dag=dag)
You can use the param variable in DAG initialization to send data in DAG tasks.
I'm looking for a method that will allow the content of the emails sent by a given EmailOperator task to be set dynamically. Ideally I would like to make the email contents dependent on the results of an xcom call, preferably through the html_content argument.
alert = EmailOperator(
task_id=alertTaskID,
to='please#dontreply.com',
subject='Airflow processing report',
html_content='raw content #2',
dag=dag
)
I notice that the Airflow docs say that xcom calls can be embedded in templates. Perhaps there is a way to formulate an xcom pull using a template on a specified task ID then pass the result in as html_content? Thanks
Use PythonOperator + send_email instead:
from airflow.operators import PythonOperator
from airflow.utils.email import send_email
def email_callback(**kwargs):
with open('/path/to.html') as f:
content = f.read()
send_email(
to=[
# emails
],
subject='subject',
html_content=content,
)
email_task = PythonOperator(
task_id='task_id',
python_callable=email_callback,
provide_context=True,
dag=dag,
)
For those looking for an exact example of using jinja template with EmailOperator, here is one
from airflow.operators.email_operator import EmailOperator
from datetime import timedelta, datetime
email_task = EmailOperator(
to='some#email.com',
task_id='email_task',
subject='Templated Subject: start_date {{ ds }}',
params={'content1': 'random'},
html_content="Templated Content: content1 - {{ params.content1 }} task_key - {{ task_instance_key_str }} test_mode - {{ test_mode }} task_owner - {{ task.owner}} hostname - {{ ti.hostname }}",
dag=dag)
You can test run the above code snippet using
airflow test dag_name email_task 2017-05-10
might as well answer this myself. Turns out it's fairly straight forward using the template+xcom route. This code snippet works in the context of an already defined dag. It uses the BashOperator instead of EmailOperator because it's easier to test.
def pushparam(param, ds, **kwargs):
kwargs['ti'].xcom_push(key='specificKey', value=param)
return
loadxcom = PythonOperator(
task_id='loadxcom',
python_callable=pushparam,
provide_context=True,
op_args=['your_message_here'],
dag=dag)
template2 = """
echo "{{ params.my_param }}"
echo "{{ task_instance.xcom_pull(task_ids='loadxcom', key='specificKey') }}"
"""
t5 = BashOperator(
task_id='tt2',
bash_command=template2,
params={'my_param': 'PARAMETER1'},
dag=dag)
can be tested on commandline using something like this:
airflow test dag_name loadxcom 2015-12-31
airflow test dag_name tt2 2015-12-31
I will eventually test with EmailOperator and add something here if it doesn't work...