Airflow DAG randomly fails with timeout - airflow

I have a DAG on Airflow (version: 1.10.15) which fails seemingly randomly, even though the code is not changed in between the different runs. It is a training script which normally should not take more than 2 hours, but occasionally it runs without ever completing, and eventually times out at the maximum set timeout.
When checking the logs, I'm seeing a continuous iteration of the following:
[2022-05-31 12:51:13,219] {databricks_operator.py:89} INFO - <SCRIPT_NAME> in run state: {'life_cycle_state':
'RUNNING', 'result_state': None, 'state_message': 'In run'}
[2022-05-31 12:51:13,219] {databricks_operator.py:90} INFO - View run status, Spark UI, and logs at <LOG_URL>
[2022-05-31 12:51:13,219] {databricks_operator.py:91} INFO - Sleeping for 30 seconds.
So there is seemingly no indication of failure in the logs. I then checked the actual notebook file, and saw that the execution of DAG stalls at the training step; one or two epochs are completed but then, no more training seems to occur until the timeout.
I'm pretty new to using Airflow, so at this point I'm not even sure where I should begin debugging this issue. If this was some error in the training script, it should give an error consistently, I presume. But in most cases the DAG actually runs correctly, it just stalls in an endless loop and times out only occasionally.
What might be going wrong?

Related

Finding out whether a DAG execution is a catchup or a regularly scheduled one

I have an Airflow pipeline that starts with a FileSensor that may perform a number of retries (which makes sense because the producing process sometimes takes longer, and sometimes simply fails).
However when I restart the pipeline, as it runs in catchup mode, the retries in the file_sensor become spurious: if the file isn't there for a previous day, it wont materialize anymore.
Therefore my question: is it possible to make the behavior of a DAG-run contingent on whether that is currently running in a catch up or in a regularly scheduled run?
My apologies if this is a duplicated question: it seems a rather basic problem, but I couldn't find previous questions or documentation.
The solution is rather simple.
Set a LatestOnlyOperator upstream from the FileSensor
Set an operator of any type you may need downstream from the FileSensor with its trigger rule set to TriggerRule.ALL_DONE.
Both skipped and success states count as "done" states, while an error state doesn't. Hence, in a "non-catch-up" run the FileSensor will need to succeed to give way to the downstream task, while in a catch-up run the downstream will right away start after skipping the FileSensor.

Airflow tasks are getting marked as success, even though it has not yet started

I am using Airflow 1.10.3. Currently, seeing a lot of tasks are marked as "success", but start date, end date and duration are empty for such tasks. There are no logs for those tasks.
It doesn't raise any pager or anything currently and we are seeing increased silent failures day by day.
Any idea what is going wrong?

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

Airflow tasks get stuck at “Scheduled” status and never gets running during backfill

Trying to do some backfills, and all the dag runs start up fine, but for some reason they can't get by a specific task, instead they get stuck in a "Scheduled" state. Not sure what "Scheduled" means and why they don't move to "Running". It works fine in the daily run, but the backfill gets stuck for some reason.
This is super annoying since it means I have to start all the tasks for the backfill manually, which works.
Any idea why a task might be stuck in a "Scheduled" state?
Tasks stuck in a “queued” state usually mean one of two things, no queue to execute on or no pool to execute in.
Which executor are you using? Local, sequential, celery?

How can an Airflow DAG fail if none of the tasks have failed?

We have a long dag (~60 tasks), and quite frequently we see a dagrun for this dag in a state of failed. When looking at the tasks in the DAG they are all in a state of either success or null (i.e. not even queued yet). It appears that the dag has got into a state of failed prematurely.
Under what circumstances can this happen, and what should people do to protect against it?
If it's helpful for context we're running Airflow using the Celery executor and currently running on version 1.9.0. If we set the state of the dag in question back to running then all the tasks (and the dag as a whole) complete successfully.
The only way that a DAG can fail without a task failing is through something not connected to any of the tasks. Besides manual intervention (check that nobody on the team is manually failing the dags!) the only thing that fails DAGs outside of considering task states is the timeout checker.
This runs inside the scheduler, while considering whether it needs to schedule a new dag_run. If it finds another active run, which has been running longer than the dagrun_timeout argument of the DAG, then it will get killed. As far as I can see this isn't logged anywhere, so the best way to diagnose this is to look at the time that the DAG started and the time that the last task finished to see if it's roughly the length of dagrun_timeout.
You can see the code in action here: https://github.com/apache/incubator-airflow/blob/e9f3fdc52cb53f3ac3e9721e5128d17d1c5c418c/airflow/jobs.py#L800

Resources