Airflow DAG - Failed Task Doesn't Show Fail Status as It Should - airflow

I just started with Airflow DAG and encountered a strange issue with the tool. I am using airflow version 2.3.3 with SequentialExecutor.
The script I Used:
import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
dag_args = {
'owner': 'hao',
'retries': 2,
'retry_delay': datetime.timedelta(minutes=1)
}
with DAG(
dag_id='dependency_experiment',
default_args=dag_args,
description='experiment the dag task denpendency expression',
start_date=datetime.datetime.now(),
schedule_interval='#daily',
dagrun_timeout=datetime.timedelta(seconds=10),
) as dag:
pyOp = PythonOperator(
task_id='pyOp',
python_callable=lambda x: haha * x,
op_kwargs={'x': 10}
)
pyOp
The Log Snippit of This Task:
NameError: name 'haha' is not defined
[2022-07-27, 18:19:34 EDT] {taskinstance.py:1415} INFO - Marking task as UP_FOR_RETRY. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220728T021932, start_date=20220728T021934, end_date=20220728T021934
[2022-07-27, 18:19:34 EDT] {standard_task_runner.py:92} ERROR - Failed to execute job 44 for task pyOp (name 'haha' is not defined; 19405)
[2022-07-27, 18:19:34 EDT] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-07-27, 18:19:34 EDT] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
Problem:
I purposefully defined a PythonOperator, which would fail. When I put the script on DAG, the task raised an exception as expected; however, the status for this task is always skipped. I cannot figure out why the task didn't show a failed status as expected. Any suggestions will be much appreciated.

It's because you are defining 'retries' and 'retry_delay' in your dag_args dictionary.
From the Docs:
default_args (Optional[Dict]) – A dictionary of default parameters to be used as constructor keyword parameters when initialising operators. Note that operators have the same hook, and precede those defined here, meaning that if your dict contains ‘depends_on_past’: True here and ‘depends_on_past’: False in the operator’s call default_args, the actual value will be False.
When you set the 'retries' to a value, Airflow thinks that the Task would be retried in an other time. So it shows it in UI as skipped.
If you delete 'retries' and 'retry_delay' from the dag_args, you'll see that task set to failed when you try to initiate the DAG.
When I ran your code in the logs I see:
INFO - Marking task as UP_FOR_RETRY. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220729T060953, start_date=20220729T060953, end_date=20220729T060953
After I delete the 'retries' and 'retry_delay' the same log becomes:
INFO - Marking task as FAILED. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220729T061031, start_date=20220729T061031, end_date=20220729T061031

Related

Accessing response from SimpleHTTPOperator in another task

Relating to this earlier question, suppose that we have an Apache Airflow DAG that comprises two tasks, first an HTTP request (i.e., SimpleHTTPOperator) and then a PythonOperator that does something with the response of the first task.
Conveniently, using the Dog CEO API as an example, consider the following DAG:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.http.operators.http import SimpleHttpOperator
from airflow.operators.python import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['someone#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1),
}
with DAG(
'dog_api',
default_args=default_args,
description='Get nice dog pics',
schedule_interval=None,
start_date=datetime(2021, 1, 1),
catchup=False,
tags=['dog'],
) as dag:
get_dog = SimpleHttpOperator(
task_id='get_dog',
http_conn_id='dog_api', # NOTE: set up an HTTP connection called 'dog_api' with host 'https://dog.ceo/api'
endpoint='/breeds/image/random',
method="GET",
# xcom_push=True # NOTE: no such argument in 2.2.0 but sometimes suggested by older guides online
)
def xcom_check(ds, **kwargs):
val = kwargs['ti'].xcom_pull(key='return_value', task_ids='get_dog')
return f"xcom_check has: {kwargs['ti']} and it says: {val}"
inspect_dog = PythonOperator(
task_id='inspect_dog',
python_callable=xcom_check,
provide_context=True
)
We'd like to access the return value of get_dog inside xcom_check. By inspecting the logs, get_dog populates the xcom storage nicely to something like:
But now, this is not currently passed to the second task. This can be seen by inspecting the logs as well, which says (among other things):
*redacted* Returned value was: xcom_check has: <TaskInstance: dog_api.inspect_dog manual__2021-10-30T16:27:23.081539+00:00 [running]> and it says: None
So obviously, the task instance is "dog_api.inspect_dog" but we'd want it to be "dog_api.get_dog". How is this done? At the time of writing, the same question is asked in the comments of the previous question, upvoted, but unanswered. I also tried adapting this answer but can't figure out what I'm still doing differently.
Your problem is that you did not set dependency between the tasks so inspect_dog may run before or in parallel to get_dog when this happens get_dog will see no xcom value because inspect_dog didn't push it yet.
You just need to set dependency as:
get_dog >> inspect_dog
Log :
[2021-10-31, 07:07:21 UTC] {python.py:174} INFO - Done. Returned value was: xcom_check has: <TaskInstance: dog_api.inspect_dog manual__2021-10-31T07:05:27.721051+00:00 [running]> and it says: {"message":"https:\/\/images.dog.ceo\/breeds\/pointer-germanlonghair\/hans1.jpg","status":"success"}
As for your comment in the code about xcom_push:
The xcom_push parameter was used in older Airflow versions. It was replaced by do_xcom_push (see source code). Notice that the default value of this parameter is True.

Airflow how to set default values for dag_run.conf

I'm trying to setup an Airflow DAG that provides default values available from dag_run.conf. This works great when running the DAG from the webUI, using the "Run w/ Config" option. However when running on the schedule, the dag_run.conf dict is not present, and the task will fail, e.g.
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'key1'
Below is an example job.
Is it possible to make it so that dag_run.conf always contains the dict defined by params here?
from airflow import DAG
from airflow.utils.dates import hours_ago
from airflow.operators.bash import BashOperator
from datetime import timedelta
def do_something(val1: str, val2: str) -> str:
return f'echo vars are: "{val1}, {val2}"'
params = {
'key1': 'def1',
'key2': 'def2',
}
default_args = {
'retries': 0,
}
with DAG(
'template_test',
default_args=default_args,
schedule_interval=timedelta(minutes=1),
start_date=hours_ago(1),
params = params,
) as dag:
hello_t = BashOperator(
task_id='example-command',
bash_command=do_something('{{dag_run.conf["key1"]}}', '{{dag_run.conf["key2"]}}'),
config=params,
)
The closest I've seen is in For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?, however there they leverage Jinja and if/else - which would require defining these default parameters twice. I'd like to define them only once.
You could use DAG params to achieve what you are looking for:
params (dict) – a dictionary of DAG level parameters that are made accessible in templates, namespaced under params. These params can be overridden at the task level.
You can define params at DAG or Task levels and also add or modify them from the UI in the Trigger DAG w/ config section.
Example DAG:
default_args = {
"owner": "airflow",
}
dag = DAG(
dag_id="example_dag_params",
default_args=default_args,
schedule_interval="*/5 * * * *",
start_date=days_ago(1),
params={"param1": "first_param"},
catchup=False,
)
with dag:
bash_task = BashOperator(
task_id="bash_task", bash_command="echo bash_task: {{ params.param1 }}"
)
Output log:
[2021-10-02 20:23:25,808] {logging_mixin.py:104} INFO - Running <TaskInstance: example_dag_params.bash_task 2021-10-02T23:15:00+00:00 [running]> on host worker_01
[2021-10-02 20:23:25,867] {taskinstance.py:1302} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=example_dag_params
AIRFLOW_CTX_TASK_ID=bash_task
AIRFLOW_CTX_EXECUTION_DATE=2021-10-02T23:15:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-10-02T23:15:00+00:00
[2021-10-02 20:23:25,870] {subprocess.py:52} INFO - Tmp dir root location:
/tmp
[2021-10-02 20:23:25,871] {subprocess.py:63} INFO - Running command: ['bash', '-c', 'echo bash_task: first_param']
[2021-10-02 20:23:25,884] {subprocess.py:74} INFO - Output:
[2021-10-02 20:23:25,886] {subprocess.py:78} INFO - bash_task: first_param
[2021-10-02 20:23:25,887] {subprocess.py:82} INFO - Command exited with return code 0
From the logs, notice that the dag_run is scheduled and the params are still there.
You can find a more extensive example on using parameters in this answer.
Hope that works for you!

Airflow triggering the "on_failure_callback" when the "dagrun_timeout" is exceeded

Currently working on setting up alerts for long running tasks in Airflow. To cancel/fail the airflow dag I've put "dagrun_timeout" in the default_args, and it does what I need, fails/errors the dag when its been running for too long (usually stuck). The only problem is that the function in "on_failure_callback" doesn't get called when the dagrun_timeout is exceeded, because the "on_failure_callback" is on the task level (I think) while the dagrun_timeout is on the dag level.
How can I execute the "on_failure_callback" when the dagrun_timeout is exceeded, or how can I specify a function to be called when a dag fails? Or should I re-think my approach?
Try setting on_failure_callback during DAG declaration:
with DAG(
dag_id="failure_callback_example",
on_failure_callback=_on_dag_run_fail,
...
) as dag:
...
The explanation is that on_failure_callback defined in default_args will get passed only to the Tasks being created and not to the DAG object.
Here is an example to try this behaviour:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import TaskInstance
from airflow.operators.bash import BashOperator
def _on_dag_run_fail(context):
print("***DAG failed!! do something***")
print(f"The DAG failed because: {context['reason']}")
print(context)
def _alarm(context):
print("** Alarm Alarm!! **")
task_instance: TaskInstance = context.get("task_instance")
print(f"Task Instance: {task_instance} failed!")
default_args = {
"owner": "mi_empresa",
"email_on_failure": False,
"on_failure_callback": _alarm,
}
with DAG(
dag_id="failure_callback_example",
start_date=datetime(2021, 9, 7),
schedule_interval=None,
default_args=default_args,
catchup=False,
on_failure_callback=_on_dag_run_fail,
dagrun_timeout=timedelta(seconds=45),
) as dag:
delayed = BashOperator(
task_id="delayed",
bash_command='echo "waiting..";sleep 60; echo "Done!!"',
)
will_fail = BashOperator(
task_id="will_fail",
bash_command="exit 1",
# on_failure_callback=_alarm,
)
delayed >> will_fail
You can find the logs of the callbacks execution in the Scheduler logs AIRFLOW_HOME/logs/scheduler/date/failure_callback_example :
[2021-09-24 13:12:34,285] {logging_mixin.py:104} INFO - [2021-09-24 13:12:34,285] {dag.py:862} INFO - Executing dag callback function: <function _on_dag_run_fail at 0x7f83102e8670>
[2021-09-24 13:12:34,336] {logging_mixin.py:104} INFO - ***DAG failed!! do something***
[2021-09-24 13:12:34,345] {logging_mixin.py:104} INFO - The DAG failed because: timed_out
Edit:
Within the context dict the key reason is passed in order to specify the cause of the DAG run failure. Some values are: 'reason': 'timed_out' or 'reason': 'task_failure' . This could be use to perfom specific behaviour in the callback based on the reason of the DAG Run failure.

Airflow Exception not being thrown when encountering None/Falsy values

I am trying to pass data between a PythonOperator, _etl_lasic to another PythonOperator, _download_s3_data, which works fine but I want to throw an exception when the value passed is None which should mark the task as a failure.
import airflow
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.exceptions import AirflowFailException
def _etl_lasic(**context):
path_s3 = None
context["task_instance"].xcom_push(
key="path_s3",
value=path_s3,
)
def _download_s3_data(templates_dict, **context):
path_s3 = templates_dict["path_s3"]
if not path_s3:
raise AirflowFailException("Path to S3 was not passed!")
else:
print(f"Path to S3: {path_s3}")
with DAG(
dag_id="02_lasic_retraining_without_etl",
start_date=airflow.utils.dates.days_ago(3),
schedule_interval="#once",
) as dag:
etl_lasic = PythonOperator(
task_id="etl_lasic",
python_callable=_etl_lasic,
)
download_s3_data = PythonOperator(
task_id="download_s3_data",
python_callable=_download_s3_data,
templates_dict={
"path_s3": "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}"
},
)
etl_lasic >> download_s3_data
Logs:
[2021-08-17 04:04:41,128] {logging_mixin.py:103} INFO - Path to S3: None
[2021-08-17 04:04:41,128] {python.py:118} INFO - Done. Returned value was: None
[2021-08-17 04:04:41,143] {taskinstance.py:1135} INFO - Marking task as SUCCESS. dag_id=02_lasic_retraining_without_etl, task_id=download_s3_data, execution_date=20210817T040439, start_date=20210817T040440, end_date=20210817T040441
[2021-08-17 04:04:41,189] {taskinstance.py:1195} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-08-17 04:04:41,212] {local_task_job.py:118} INFO - Task exited with return code 0
Jinja-templated values are rendered as strings by default. In your case, even though you push an XCom value of None, when the value is pulled via "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}" the value is actually rendered as "None" which doesn't throw an exception based on the current logic.
There are two options that will solve this:
Instead of setting path_s3 to None in the "_etl_lasic" function, set it to an empty string.
If you are using Airflow 2.1+, there is a parameter, render_template_as_native_obj, that can be set at the DAG level which will render Jinja-templated values as native Python types (list, dict, etc.). Setting that parameter to True will do the trick without changing how path_s3 is set in the function. A conceptual example is documented here.

airflow reschedule error: dependency 'Task Instance State' PASSED: False

I have a customized sensor that looked like below. The idea is one dag can have different tasks that can start from different time, and take advantage of the built-in airflow reschedule system.
class MySensor(BaseSensorOperator):
def __init__(self, *, start_time, tz, ...)
super().__init__(**kwargs)
self._start_time = start_time
self._tz = tz
#provide_session
def execute(self, context, session: Session=None):
dt_start = datetime.combine(context['next_execution_date'].date(), self._start_time)
dt_start = dt_start.replace(tzinfo=self._tz)
if datetime.now().timestamp() < dt_start.timestamp():
dt_reschedule = datetime.utcnow().replace(tzinfo=UTC)
dt_reschedule += timedelta(seconds=dt_start.timestamp()-datetime.now().timestamp())
raise AirflowRescheduleException(dt_reschedule)
return super().execute(context)
In the dag, I have something as below. However, I notice when the mode is 'poke', which is default, the sensor will not work properly.
with DAG( schedule='0 10 * * 1-5', ... ) as dag:
task1 = MySensor(start_time=time(14,0), mode='poke')
task2 = MySensor(start_time=time(16,0), mode='reschedule')
... ...
From the log, i can see the following:
{taskinstance.py:1141} INFO - Rescheduling task, mark task as UP_FOR_RESCHEDULE
[5s later]
{local_task_job.py:102} INFO - Task exited with return code 0
[14s later]
{taskinstance.py:687} DEBUG - <TaskInstance: mydag.mytask execution_date [failed]> dependency 'Task Instance State' PASSED: False, Task in in the 'failed' state which is not a valid state for execution. The task must be cleared in order to be run.
{taskinstance.py:664} INFO - Dependencies not met for <TaskInstance ... [failed]> ...
Why rescheduling not working with mode='poke'? And when did the scheduler(?) flip the state of the taskinstance from "up_for_reschedule" to "failed"? Any better way to start the each task/sensor at different time? The sensor is an improved version of FileSensor, and checks a bunch of files or files with patterns. My current option is to force every task with mode='reschedule'
Airflow version 1.10.12

Resources