Include / get log output of parent in child airflow task? - airflow

I have an EmailOperator that is set to trigger if any of its parents fail.
some_task= BashOperator(
task_id='some_task',
bash_command='bash /some/path/to/script.sh'),
execution_timeout=timedelta(minutes=30),
dag=dag)
email_on_dest_fail = EmailOperator(
task_id='email_alert',
to=['me#dev.org'],
subject='airflow error',
html_content='Error detected from parent, see run_id={{run_id}} in airflow for more information',
cc=[],
trigger_rule=TriggerRule.ONE_FAILED,
dag=dag
Is there a way to include the (possibly truncated) log output of the first failing parent task in the body of the sent email?

A better design pattern for this would be to use the on_failure_callback for the operator. Or set a default one for the whole DAG. This will allow any failing operator to go through a failure callback and execute your desired behaviour.

Related

run next tasks in dag if another dag is complete

dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)

How to execute the same query for several dynamically decided dates in a single DAG run?

The main idea is to run BigQueryOperator for several specific dynamically determined dates (so the date must be passed to an external query) and that the results are written to corresponding partitions (using the $ suffix of the destination table) all in a single DAG run. Dates are dependent on the execution_date.
I understand that user_defined_macros only exist on (sub)DAG level, so I will have to spawn subDAGs dynamically. But execution_date is only available within operators, such as PythonOperator, and it does not seem possible to spawn a subDAG (or any operator for that matter) from inside another operator.
So either I need a way to access execution_date not from an operator but from within a DAG itself, or an alternative way of passing my custom date to the external query without user_defined_macros on (sub)DAG level.
Is there a standard way (or any way) of dealing with similar situations?
Ok, apparently it is possible to pass Jinja-readable parameters to the BigQueryOperator to be used both in the query and in partition suffix, like so -
def bq_to_bq_fn(day_offset=None):
return BigQueryOperator(
task_id='bq_to_bq' if day_offset is None else f'bq_to_bq_{day_offset}',
use_legacy_sql=False,
write_disposition=write_mode_dag,
allow_large_results=True,
params={
'day_offset': 0 if day_offset is None else day_offset
},
sql='''
SELECT * FROM TEST.test1
WHERE date = '{{ (execution_date - macros.timedelta(days=params.day_offset)).strftime("%Y-%m-%d") }}'
''',
time_partitioning={"type": "DAY", "field": "date"},
destination_dataset_table=project_id + '.TEST.test2$' + '{{ (execution_date - macros.timedelta(days=params.day_offset)).strftime("%Y%m%d") }}',
dag=dag
)
The SQL part can also be in an external file.

Airflow trigger_rule all_done not working as expected

I have the following DAG in Airflow 1.10.9 where the clean_folder task should run once all the previous tasks either succeeded, failed or were skipped. To ensure this, I put the trigger_rule parameter of the clean_folder operator to "all_done":
t_clean_folders = BashOperator
bash_command=f"python {os.path.join(custom_resources_path, 'cleaning.py')} {args['n_branches']}",
trigger_rule='all_done',
task_id="clean_folder",
)
This logic works properly when all tasked in the branches executed before are skipped:
[Graph view][1]
However when a branch was successfully executed, the clean_folder task is skipped:
[Graph view][2]
The branches are defined dynamically as follow:
for b in range(args['n_branches']):
t_file_sensing = FileSensor(
filepath=f"{input_path}/input_{b}",
task_id=f"file_sensing_{b}",
poke_interval=60,
timeout=60*60*5,
soft_fail=True,
retries=3,
)
t_data_staging = BashOperator(
bash_command=f"python {os.path.join(custom_resources_path, 'staging.py')} {b}",
task_id=f"data_staging_{b}",
)
...
The documentation provides the following definition of "all_done": all parents are done with their execution. Is this normal behavior of the trigger_rule? What can I change to ensure clean_folder will run in any case (and last)? Thanks!
[1]: https://i.stack.imgur.com/AGqqD.png
[2]: https://i.stack.imgur.com/7CG3r.png
If possible, you should consider to upgrade your Airflow version to at least 1.10.15 in order to benefit from more recent bug-fixes.
It really surprises me that clean_folder and dag_complete both get executed when every parent tasks are skipped. The behaviour when a task is skipped is to directly skip its child tasks without first checking their trigger_rules.
According to Airflow 1.10.9 Documentation on trigger_rules,
Skipped tasks will cascade through trigger rules all_success and all_failed but not all_done [...]
For your UseCase, you could split the workflow into 2 DAGs:
1 DAG to do everything you want except the t_clean_folder
1 DAG to execute the t_clean_folder task, preceded by an ExternalTaskSensor

Reference filename via xcom in Airflow

I'm trying to understand how to pass values via airflow xcom functionality. The specific usecase I am trying to build is to write a file, then move it, then run another command. The idea is that I pass the file name from one operator to the next.
Here is what I have:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt
DAG = DAG(
dag_id='xcom_test_dag',
start_date=dt.datetime.now(),
schedule_interval='#once'
)
def push_function(**context):
file_name = 'test_file_{date}'.format(date=dt.datetime.now())
return context['task_instance'].xcom_push(key='filename', value=file_name)
def pull_function(**context):
dir(context['task_instance'].xcom_pull())
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
If I want to reference the file name in the pull_task so I could perform read the file - how should I call that? Trying to access context['task_instance'] does not contain a value. Further - is it best practices to try and reference a file name like this from task to task/operator to operator?
When pulling data from XCOM, you want to provide the task ID of the task where you push the data. In your example, the task_id of your push task is push_task, so you'd want to do something like:
value = context['task_instance'].xcom_pull(task_ids='push_task')
However, from the airflow documentation, note that:
By default, xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually).
If you're pushing data to XCOM manually with specific keys, you may need to include that key when calling xcom_pull. In your example, you push a key called filename in your push task, so you'd likely need to do something like this in your pull task:
value = context['task_instance'].xcom_pull(task_ids='push_task', key='filename')
This information is outlined in further detail in the Airflow documentation: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#concepts-xcom
As for your question regarding "best practices" - for communicating between Airflow Tasks/Operators, XCOM is the best way to go. However, if you're wanting to read a file from disk across multiple operators, you would need to ensure that all your workers have access to where the file is stored. If that isn't possible, an alternative could be to have the push task store that file remotely (e.g. in AWS S3) and push the S3 URL to XCOM. The pull task could then read the S3 URL from XCOM, and download the file from S3.

Airflow depends_on_past for whole DAG

Is there a way in airflow of using the depends_on_past for an entire DagRun, not just applied to a Task?
I have a daily DAG, and the Friday DagRun errored on the 4th task however the Saturday and Sunday DagRuns still ran as scheduled. Using depends_on_past = True would have paused the DagRun on the same 4th task, however the first 3 tasks would still have run.
I can see in the DagRun DB table there is a state column that contains failed for the Friday DagRun. What I want is a way configuring a DagRun to not start if the previous DagRun failed, not start and run until finding a Task that previously failed.
Does anyone know if this is possible?
At your first task, set depends_on_past=True and wait_for_downstream=True, the combination will result in that current dag-run runs only if the last run succeeded.
Because by setting the first task at current dag-run would waits for previous
(depends_on_past) and all tasks (wait_for_downstream) to succeed
This question is a bit old but it turns out as a first google search result and the highest rated answer is clearly misleading (and it has made me struggle a bit) so it definitely demands a proper answer. Although the second rated answer should work, there's a cleaner way to do this and I personally find using xcom ugly.
The Airflow has a special operator class designed for monitoring status of tasks from other dag runs or other dags as a whole. So what we need to do is to add a task preceding all the tasks in our dag, checking if the previous run has succeded.
from airflow.sensors.external_task_sensor import ExternalTaskSensor
previous_dag_run_sensor = ExternalTaskSensor(
task_id = 'previous_dag_run_sensor',
dag = our_dag,
external_dag_id = our_dag.dag_id,
execution_delta = our_dag.schedule_interval
)
previous_dag_run_sensor.set_downstream(vertices_of_indegree_zero_from_our_dag)
One possible solution would be to use xcom:
Add 2 PythonOperators start_task and end_task to the DAG.
Make all other tasks depend on start_task
Make end_task depend on all other tasks (set_upstream).
end_task will always push a variable last_success = context['execution_date'] to xcom (xcom_push). (Requires provide_context = True in the PythonOperators).
And start_task will always check xcom (xcom_pull) to see whether there exists a last_success variable with value equal to the previous DagRun's execution_date or to the DAG's start_date (to let the process start).
Example use of xcom:
https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_xcom.py
Here a solution that addresses Marc Lamberti's concern, namely, that 'wait_for_download' is not "recursive".
The solution entails "embedding" your original DAG in between two dummy tasks, a start_task and an end_task.
Such that:
The start_task precedes all your original initial tasks (ie, no other task in your DAG can start until start_task is completed).
A end_task follows all your original ending tasks (ie, all branches in your DAG converge to that dummy end_task).
start_task also directly precedes the end_task.
These conditions are provided by the following code:
start_task >> [_all_your_initial_tasks_here_]
[_all_your_ending_tasks_here] >> end_task
start_task >> end_task
Additionally, one needs to set that start_task has depends_on_past=True and wait_for_downstream=True

Resources