Airflow DAG run fails when task is up for retry - airflow

In my default args for a DAG I have set the retry and retry_delay parameters. When I monitor the UI, upon a task failure, it briefly changes state to "retry" but immediately following, the DAG state is set to "FAILED" and so the task (that should be up for retry) gets stuck in the queued state. In this situation, shouldn't the dagrun remain in the "running" state since the failed task is up for retry?
I've spent some time searching documentation and code for how the dagrun changes state, but have been unable to get any clarity.

This has since been fixed in the upcoming Airflow 1.8 release. Refer to the following JIRA tickets:
Retry not honored
Retry delay not honored

Related

Airflow task improperly has an `upstream_failed` status after previous task succeeded after 1 retry

I have two tasks A and B. Task A failed once but the retry succeeded and is marked as a success (green). I would expect Task B to perform normally since Task A retry succeeded but it is marked as upstream_failed and was not triggered. Is this a way to fix this behavior?
The Task B has an ALL_SUCCESS trigger rule.
I am using Airflow 2.0.2 on AWS (MWAA).
Trying to restart the scheduler.
upstream_failed happened from scheduler flow or when depends are seting to failed state, you can check states from Task Instances
in Retry Mode:
Task A will be in up_for_retry state until exceed retries number.
If trigger_rule set with all_success(it's default trigger rule), Task B will not trigger untill Task A finished, If every thing running correctly.
Could you add the DAG implementation?

Datadog airflow alert when a dag is paused

We have encountered a scenario recently where someone mistakenly turned off a production dag, and we want to get alert whenever a dag is paused using datadog.
I have checked https://docs.datadoghq.com/integrations/airflow/?tab=host
But have not got any metric for dag to check if it is paused or not.
I can run a custom script in datadog as well.
One of the method is that I exec into postgres pod and get the list of active dags:
select * from dag where is_paused=true;
Or is there any other way I can get the unpaused dag list and also when new dag is added what is the best way to handle it.
I want the alert whenever a unpaused dag is paused.
If you are on Airflow 2 you can use the REST API to query for state of the DAG.
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_dag
There is "is_paused" field.
And of you are not Airflow 2, you should be. Airflow 1.10 is end-of-life and will not receive any fixes (including critical security fixes) so you should upgrade as soon as you can.

New task in DAG blocks further DAG executions

We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.

How can an Airflow DAG fail if none of the tasks have failed?

We have a long dag (~60 tasks), and quite frequently we see a dagrun for this dag in a state of failed. When looking at the tasks in the DAG they are all in a state of either success or null (i.e. not even queued yet). It appears that the dag has got into a state of failed prematurely.
Under what circumstances can this happen, and what should people do to protect against it?
If it's helpful for context we're running Airflow using the Celery executor and currently running on version 1.9.0. If we set the state of the dag in question back to running then all the tasks (and the dag as a whole) complete successfully.
The only way that a DAG can fail without a task failing is through something not connected to any of the tasks. Besides manual intervention (check that nobody on the team is manually failing the dags!) the only thing that fails DAGs outside of considering task states is the timeout checker.
This runs inside the scheduler, while considering whether it needs to schedule a new dag_run. If it finds another active run, which has been running longer than the dagrun_timeout argument of the DAG, then it will get killed. As far as I can see this isn't logged anywhere, so the best way to diagnose this is to look at the time that the DAG started and the time that the last task finished to see if it's roughly the length of dagrun_timeout.
You can see the code in action here: https://github.com/apache/incubator-airflow/blob/e9f3fdc52cb53f3ac3e9721e5128d17d1c5c418c/airflow/jobs.py#L800

Task with no status leads to DAG failure

I have a DAG that fetches data from Elasticsearch and ingests into the data lake. The first task, BeginIngestion, opens in several tasks (one for each resource), and these tasks open in more tasks (one for each shard). After the shards are fetched, the data is uploaded to S3 and then closed into a task EndIngestion, followed by a task AuditIngestion.
It was executing correctly, but now all tasks are executed successfully, but the "closing task" EndIngestion remains with no status. When I refresh the webserver's page, the DAG is marked as Failed.
This image shows successful upstream tasks, with the task end_ingestion with no status and the DAG marked as Failed.
I also dug into the task instance details and found
Dagrun Running: Task instance's dagrun was not in the 'running' state but in the state 'failed'.
Trigger Rule: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'failed': 0, 'upstream_failed': 0, 'skipped': 0, 'done': 49, 'successes': 49}, upstream_task_ids=['s3_finish_upload_ingestion_raichucrud_complain', 's3_finish_upload_ingestion_raichucrud_interaction', 's3_finish_upload_ingestion_raichucrud_company', 's3_finish_upload_ingestion_raichucrud_user', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_location', 's3_finish_upload_ingestion_raichucrud_companytoken', 's3_finish_upload_ingestion_raichucrud_indexevolution', 's3_finish_upload_ingestion_raichucrud_companyindex', 's3_finish_upload_ingestion_raichucrud_producttype', 's3_finish_upload_ingestion_raichucrud_categorycomplainsto', 's3_finish_upload_ingestion_raichucrud_companyresponsible', 's3_finish_upload_ingestion_raichucrud_category', 's3_finish_upload_ingestion_raichucrud_additionalfieldoption', 's3_finish_upload_ingestion_raichucrud_privatecontactconfiguration', 's3_finish_upload_ingestion_raichucrud_phone', 's3_finish_upload_ingestion_raichucrud_presence', 's3_finish_upload_ingestion_raichucrud_responsible', 's3_finish_upload_ingestion_raichucrud_store', 's3_finish_upload_ingestion_raichucrud_socialprofile', 's3_finish_upload_ingestion_raichucrud_product', 's3_finish_upload_ingestion_raichucrud_macrorankingpresenceto', 's3_finish_upload_ingestion_raichucrud_macroinfoto', 's3_finish_upload_ingestion_raichucrud_raphoneproblem', 's3_finish_upload_ingestion_raichucrud_macrocomplainsto', 's3_finish_upload_ingestion_raichucrud_testimony', 's3_finish_upload_ingestion_raichucrud_additionalfield', 's3_finish_upload_ingestion_raichucrud_companypageblockitem', 's3_finish_upload_ingestion_raichucrud_rachatconfiguration', 's3_finish_upload_ingestion_raichucrud_macrorankingitemto', 's3_finish_upload_ingestion_raichucrud_purchaseproduct', 's3_finish_upload_ingestion_raichucrud_rachatproblem', 's3_finish_upload_ingestion_raichucrud_role', 's3_finish_upload_ingestion_raichucrud_requestmoderation', 's3_finish_upload_ingestion_raichucrud_categoryproblemto', 's3_finish_upload_ingestion_raichucrud_companypageblock', 's3_finish_upload_ingestion_raichucrud_problemtype', 's3_finish_upload_ingestion_raichucrud_key', 's3_finish_upload_ingestion_raichucrud_macro', 's3_finish_upload_ingestion_raichucrud_url', 's3_finish_upload_ingestion_raichucrud_document', 's3_finish_upload_ingestion_raichucrud_transactionkey', 's3_finish_upload_ingestion_raichucrud_catprobitemcompany', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_categoryinfoto', 's3_finish_upload_ingestion_raichucrud_marketplace', 's3_finish_upload_ingestion_raichucrud_macroproblemto', 's3_finish_upload_ingestion_raichucrud_categoryrankingto', 's3_finish_upload_ingestion_raichucrud_macrorankingto', 's3_finish_upload_ingestion_raichucrud_categorypageto']
As you see, the "Trigger Rule" field says that one of the tasks is in a "non-successful state", but at the same time the stats shows that all upstreams are marked as successful.
If I reset the database, it doesn't happen, but I can't reset it for every execution (hourly). I also don't want to reset it.
Does anyone have any light?
PS: I am running in an EC2 instance (c4.xlarge) with LocalExecutor.
[EDIT]
I found in the scheduler log that the DAG is in deadlock:
[2017-08-25 19:25:25,821] {models.py:4076} DagFileProcessor157 INFO - Deadlock; marking run failed
I guess this may be due to some exception treatment.
I have had this exact issue before, for me my code was generating duplicate task ids. And it looks like in your case there is also a duplicate id:
s3_finish_upload_ingestion_raichucrud_privatecontactinteraction
This is probably a year late for you, but hopefully this will save others, lots of debugging time :)

Resources