Need this info in the log as a print statement click for more info
Assuming you need to get the duration of a DAG in a task in the DAG itself, then you need to put it as last task and need to understand there will be a little difference (cause the duration task is part of the DAG)
Here, an example of simple DAG that in the last task I calculate the duration and put it in the XCOM.
There is a bit difference also between XCOM and Airflow UI because rounding of the numbers.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import get_current_context
from airflow.sensors.time_delta import TimeDeltaSensor
from airflow.utils import timezone
with DAG(
dag_id="test_dag",
start_date=datetime(2022, 1, 1),
schedule_interval=None,
render_template_as_native_obj=True,
tags=["test"],
) as dag:
#task
def task1():
print("task1")
sleep_task = TimeDeltaSensor(
task_id="sleep",
delta=timedelta(seconds=3),
mode='reschedule'
)
#task(multiple_outputs=True)
def duration_task():
context = get_current_context()
dag_run = context["dag_run"]
execution_date = dag_run.execution_date
now = timezone.make_aware(datetime.utcnow())
duration = now - execution_date
return {
"duration": str(duration),
"start_time": str(dag_run.execution_date),
"end_time": str(now)
}
(task1() >> sleep_task >> duration_task())
Related
Let's take an example DAG.
Here is the code for it.
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.models import TaskInstance
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
def task_failure_notification_alert(context):
logging.info("Task context details: %s", str(context))
def dag_failure_notification_alert(context):
logging.info("DAG context details: %s", str(context))
def red_exception_task(ti: TaskInstance, **kwargs):
raise Exception('red')
default_args = {
"owner": "analytics",
"start_date": datetime(2021, 12, 12),
'retries': 0,
'retry_delay': timedelta(),
"schedule_interval": "#daily"
}
dag = DAG('logger_dag',
default_args=default_args,
catchup=False,
on_failure_callback=dag_failure_notification_alert
)
start_task = DummyOperator(task_id="start_task", dag=dag, on_failure_callback=task_failure_notification_alert)
red_task = PythonOperator(
dag=dag,
task_id='red_task',
python_callable=red_exception_task,
provide_context=True,
on_failure_callback=task_failure_notification_alert
)
end_task = DummyOperator(task_id="end_task", dag=dag, on_failure_callback=task_failure_notification_alert)
start_task >> red_task >> end_task
We can see two functions i.e. task_failure_notification_alert and dag_failure_notification_alert are being called in case of failures.
We can see logs in case of Task failure by the below steps.
We can see logs for the task as below.
but I am unable to find logs for the on_failure_callback of DAG anywhere in UI. Where can we see it?
Under airflow/logs find the "scheduler" folder, under it look for the specific date you ran the Dag for example 2022-12-03 and there you will see name of the dag_file.log.
I have a very simple DAG which includes a very simple task (PythonOperator) which gets some trivial json data from SWAPI API and returns an int. In this case the value of the int is 202.
I'm fairly sure that this int is being correctly pushed as an XCOM value because when I run the dag, select that task and look at the logs and the XCOM in the UI I see:
Furthermore when I add the line:
ti.xcom_push(key = 'height', value = height)
Into the python function which is getting the value from the API, I'm then able to see from the XCOM view for that task that a key of 'height' is being added and the value is indeed 202.
The problem is that for love nor money I cant pull that value out again and use it in another task. For example the task which needs to use that value is a PythonOperator whose function looks like:
def check_height(ti):
height = ti.xcom_pull(key = 'height', task_ids=['get_data_darth_vader'])
print(f"Height is: {height}")
I've also tried it with no key and setting the key to 'return_value', but nothing work the value is None:
[2021-09-30 21:00:35,044] {logging_mixin.py:109} INFO - Height is: [None]
[2021-09-30 21:00:35,047] {python.py:151} INFO - Done. Returned value was: None
I must be doing something wrong, but I cannot see what. I've watched a number of tutorials and read several blog posts on the subject, and can't see where what I am doing differs from the working examples.
Help!
UPDATE: Here is the whole dag
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
import json
import requests
def get_darth_vader_height(ti):
"""
Get Darth Vader info from SWAPI
"""
response=requests.get('https://swapi.dev/api/people/4')
data=json.loads(response.text)
height=data['height']
print(f"DEBUG: {height}")
ti.xcom_push(key="height", value=height)
return height
def check_height(ti):
height=ti.xcom_pull(task_ids='task_one', key="height")
print(f"Height is: {height}")
print(str(height))
with DAG(
'my_dag',
start_date = datetime(2021,1,1),
schedule_interval="#daily",
catchup=False,
) as dag:
get_darth_vader_height = PythonOperator(
task_id='task_one',
python_callable=get_darth_vader_height
)
check_darth_vader_height = PythonOperator(
task_id='task_two',
python_callable=check_height
)
is_tall = BashOperator(
task_id='task_three',
bash_command="echo 'is tall!'"
)
is_short = BashOperator(
task_id='task_four',
bash_command="echo 'is short!'"
)
UPDATE WITH WORKING VERSION:
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
import json
import requests
def get_darth_vader_height(ti):
"""
Get Darth Vader info from SWAPI
"""
response=requests.get('https://swapi.dev/api/people/4')
data=json.loads(response.text)
height=data['height']
print(f"DEBUG: {height}")
ti.xcom_push(key="height", value=height)
return height
def check_height(ti):
height=ti.xcom_pull(task_ids='task_one', key="height")
print(f"Height is: {height}")
if int(height) > 200:
print('height is greater than 200')
return 'is_tall'
print('height is less than 200')
return 'is_short'
with DAG(
'my_dag',
start_date = datetime(2021,1,1),
schedule_interval="#daily",
catchup=False,
) as dag:
get_darth_vader_height = PythonOperator(
task_id='task_one',
python_callable=get_darth_vader_height
)
check_darth_vader_height = BranchPythonOperator(
task_id='task_two',
python_callable=check_height
)
is_tall = BashOperator(
task_id='task_three',
bash_command="echo 'is tall!'"
)
is_short = BashOperator(
task_id='task_four',
bash_command="echo 'is short!'"
)
get_darth_vader_height >> check_darth_vader_height
Adding the chain between the tasks fixed this issue.
I am trying to trigger multiple external dag dataflow job via master dag.
I plan to use TriggerDagRunOperator and ExternalTaskSensor . I have around 10 dataflow jobs - some are to be executed in sequence and some in parallel .
For example: I want to execute Dag dataflow jobs A,B,C etc from master dag and before execution goes next task I want to ensure the previous dag run has completed. But I am having issues with importing ExternalTaskSensor module.
Is their any alternative path to achieve this ?
Note: Each Dag eg A/B/C has 6- 7 task .Can ExternalTaskSensor check if the last task of dag A has completed before DAG B or C can start.
I Used the below sample code to run dag’s which uses ExternalTaskSensor, I was able to successfully import the ExternalTaskSensor module.
import time
from datetime import datetime, timedelta
from pprint import pprint
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.sensors.external_task_sensor import ExternalTaskSensor
from airflow.utils.state import State
sensors_dag = DAG(
"test_launch_sensors",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
dummy_dag = DAG(
"test_dummy_dag",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
def print_context(ds, **context):
pprint(context['conf'])
with dummy_dag:
starts = DummyOperator(task_id="starts", dag=dummy_dag)
empty = PythonOperator(
task_id="empty",
provide_context=True,
python_callable=print_context,
dag=dummy_dag,
)
ends = DummyOperator(task_id="ends", dag=dummy_dag)
starts >> empty >> ends
with sensors_dag:
trigger = TriggerDagRunOperator(
task_id=f"trigger_{dummy_dag.dag_id}",
trigger_dag_id=dummy_dag.dag_id,
conf={"key": "value"},
execution_date="{{ execution_date }}",
)
sensor = ExternalTaskSensor(
task_id="wait_for_dag",
external_dag_id=dummy_dag.dag_id,
external_task_id="ends",
poke_interval=5,
timeout=120,
)
trigger >> sensor
In the above sample code, sensors_dag triggers tasks in dummy_dag using the TriggerDagRunOperator(). The sensors_dag will wait till the completion of the specified external_task in dummy_dag.
Having list of tasks which calls different dags from master dag.I'm using the TriggerDagrunoperator to accomplish this. But facing few issues.
TriggerDagrunoperator doesn't wait for completion of external dag, it triggers next task. I want that to wait until completion and next task should trigger based on the status. Came across ExternalTaskSensor. It is making the process complicated. But the issue with externaltasksensor it only works with schedule. When i am using the externaltasksensor it is failing after timeout time has reached.
Is there a way to trigger the different dags from a master dag sequentially not in parallel. So that it will trigger another dag only after the previously trigger dag has successfully completed all the tasks in the dag.
Run this task after the triggering your external dag:
import time
from airflow.models import DagRun
from airflow import AirflowException
from airflow.operators.python_operator import PythonOperator
def get_external_dag_status(dag_id,**kwargs):
dag_id = dag_id
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
#print("state = "+dag_run.state)
res1 = dag_run.state
#print(dag_run)
return res1
def check_status(dag_id,**kwargs):
st = get_external_dag_status(dag_id)
while st != 'success':
if st == 'failed':
print(st)
break
time.sleep(300) #optional if need to check for every 5 minutes
st = get_external_dag_status(dag_id)
if st == 'success':
return st
elif st == 'failed':
raise ValueError('Dag Failed')
status_check = PythonOperator(task_id="dag_check",
python_callable=check_status,
op_kwargs={'dag_id':'your external dag id'},
dag=spark_dag
)
After triggering a Dag using TriggerDagrunoperator, you can consider calling a DagSensor, that will wait for the Dag completion, and only then trigger other days. Here how we implement our version (not perfect but did the job):
import logging
from airflow.plugins_manager import AirflowPlugin
from airflow.models import DagRun
from airflow.sensors.base_sensor_operator import BaseSensorOperator
from airflow.utils.db import provide_session
from airflow.utils.decorators import apply_defaults
from airflow.utils.state import State
logger = logging.getLogger('airflow.dag_sensor')
class DagSensor(BaseSensorOperator):
"""
Sensor that check if a Dag is currently running.
It proceeds only if the Dag is in not running.
"""
template_fields = ['external_dag_id']
ui_color = '#FFFFCC'
#apply_defaults
def __init__(self,
external_dag_id,
*args,
**kwargs):
super(DagSensor, self).__init__(*args, **kwargs)
self.external_dag_id = external_dag_id
#provide_session
def poke(self, context, session=None):
dag_run = DagRun
count = session.query(dag_run).filter(
dag_run.dag_id == self.external_dag_id,
dag_run._state.in_([State.RUNNING])
).count()
session.commit()
session.close()
logger.info(f'Dag {self.external_dag_id} in running status: {count}')
if count > 0:
return False
else:
return True
class DagSensorPlugin(AirflowPlugin):
name = 'dag_sensor_plugin'
operators = [DagSensor]
Here how you can call it:
from airflow.operators import DagSensor
check_my_dag_completion = DagSensor(
dag=dag,
task_id='check_my_dag_completion',
external_dag_id='my_dag',
poke_interval=30,
timeout=3600
)
This means that you can have something like this in your workflow:
call_dag_a >> check_dag_a >> call_dag_b >> check_dag_b
I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))