Apache Airflow tasks are stuck in a 'up_for_retry' state - airflow

I've been setting up an airflow cluster on our system and previously it has been working. I'm not sure what I may have done to change this.
I have a DAG which I want to run on a schedule. To make sure it's working I'd also like to trigger it manually. Neither of these seem to be working at the moment and no logs seem to be being written for the task instances. The only logs available are the airflow scheduler logs which generally look healthy.
I am just constantly met with this message:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-12-12T11:34:46.978355+00:00 and task will be retried at 2018-12-12T11:35:08.093313+00:00.
However, if I wait a little the exact same message is presented again except the times have moved forward a little. Therefore, it seems the task is never actually being retried.
I am using a LocalExecutor and the task is an SSHOperator. Simplified code below. All it does is ssh onto another machine and start a bunch of application with a pre-determined directory structure.:
DB_INFO_FILE = 'info.json'
START_SCRIPT = '/bin/start.sh'
TIME_IN_PAST = timezone.convert_to_utc(datetime.today() -
timedelta(days=1))
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': TIME_IN_PAST,
'email': ['some_email#blah.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
def _extract_instance_id(instance_string):
return re.findall(r'\d+', instance_string)[0]
def _read_file_as_json(file_name):
with open(file_name) as open_file:
return json.load(open_file)
DB_INFO = _read_file_as_json(DB_INFO_FILE)
CONFIG_CLIENT = ConfigDbClient(**DB_INFO)
APP_DIRS = CONFIG_CLIENT.get_values('%my-app-info%')
INSTANCE_START_SCRIPT_PATHS = {
_extract_instance_id(instance_string): directory+START_SCRIPT
for instance_string, directory in APP_DIRS.items()
}
# Create an ssh hook which refers to pre-existing connection information
# setup and stored by airflow
SSH_HOOK = SSHHook(ssh_conn_id='my-conn-id')
# Create a DAG object to add tasks to
DAG = DAG('my-dag-id',
default_args=DEFAULT_ARGS)
# Create a task for each app instance.
for instance_id, start_script in INSTANCE_START_SCRIPT_PATHS.items():
task = SSHOperator(
task_id='run-script-{0}'.format(instance_id),
command='bash {0}'.format(start_script),
ssh_hook=SSH_HOOK,
dag=DAG)
It works when I run the tasks individually, via the command line but not via the UI. It seems I can run tasks but I simply cannot trigger a DAG to run. I've tried many combinations of start_date s and interval schedules just to sanity check also.
Any ideas?
And yes, I am aware this question has been asked before and I have looked at all of them but not of the solutions have helped me.

Oh. Your start_date is changing at the same rate or faster than the schedule interval period.
Here's what the scheduler is seeing every couple of seconds:
start_date: 2018-12-11T12:12:12.000Z # E.G. IFF now is 2018-12-12T12:12:12.000Z, a day ago
schedule_interval: timedelta(days=1) # The default
Here's what the scheduler needs for a DAG to run: The last time a run occurred was more than one schedule interval ago. If no scheduled run has occurred, the first scheduled run could start now if one full schedule interval has passed since the start_date as that is the earliest allowable date for execution_date. In which case the dag_run with the execution_date set to the start of that interval period should be created. Then task_instances can be created for any tasks in the DAG whose dependencies are met as long as the task_instance execution_date is after the start_date of the DAG (this is not stored on the dag_run object but recomputed by loading the DAG file just while inspecting the dag's state).
So it's not getting scheduled automatically for the reason that the start date keeps changing just as the interval is satisfied. However if it were -2d at least one run would get scheduled and then any further runs would have to wait until it's 1d after that to be scheduled. It's easier though if you just set a fixed datetime on your start_date.
But what about those odd retries on your manual runs…
You did start a manual run or two. These runs took the current time as the execution_date unless you specified something else. This should be after the start_date, at least until tomorrow, which should clear them to run. But then it seems in your logs you're seeing that they're failing and being marked for retry and also not decrementing your retries. I'm not sure why that would be but could it be that something isn't right with the SSHOperator.
Did you install airflow with the [ssh] extra so that SSHOperator's dependencies are met (specifically paramiko and sshtunnel) on both the webserver and scheduler? One of them is working because I assume it's parsing and showing up in the UI based on being added to the DB.
What do you get if you execute:
airflow test my-dag-id run-script-an_instance_id 2018-12-12T12:12:12
You know that the scheduler and webserver are looping over refilling their DAG bag and so rerunning this DAG file a few 1000 times a day, reloading that json (it's local access, so similar to importing a module), and recreating that SSHHook with a DB lookup. I don't see anything fancy done setting up this hook, why not just remove ssh_hook from the SSHOperator and replace it with ssh_conn_id='my-conn-id' so it can be created once at execution time?
I doubt that's the issue that's causing the retries that just roll forward though.

I had a task stuck in up_for_retry for almost 24 hours before I noticed it, and it had nothing to do with the start_date, end_date, or any other classic beginner's problem.
I resorted to reading the source code, and found that Airflow treats up_for_retry tasks differently if they are part of a backfill DAG run.
So I connected to the metadata DB and changed backfill_ to scheduled__ in the dag_run row corresponding to the stuck task. Airflow started running the stuck task immediately.

Related

Can catchup be used to backfill failed dag runs in Airflow?

I have read that Airflows catchup feature applies to task instances that do not yet have a state - i.e. the scheduler will pick up any execution dates where the DAG has not yet ran (starting from the given start_date) - is this correct, and if so, does this mean catchup does not apply to failed DAG runs?
I am looking for a way to backfill any execution dates that failed, rather than not having ran at all.
Take a look at the backfill command options. You could use rerun-failed-tasks:
if set, the backfill will auto-rerun all the failed tasks for the backfill date range instead of throwing exceptions
Default: False
or reset-dagruns:
if set, the backfill will delete existing backfill-related DAG runs and start anew with fresh, running DAG runs
Default: False
Also, keep in mind this:
If reset_dag_run option is used, backfill will first prompt users whether airflow should clear all the previous dag_run and task_instances within the backfill date range. If rerun_failed_tasks is used, backfill will auto re-run the previous failed task instances within the backfill date range.
Before doing anything like the above, my suggestion is that you try it first with some dummy DAG or similiar.

New task in DAG blocks further DAG executions

We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.

My airflow DAG and operator are not doing what I expected

edit: I figured out my problem. I didn't understand the different between triggering a run and it running immediately and keeping it on and letting it do its job. The code is fine.
I wrote this simple program to figure out airflow. On the hour it is supposed to print to a file "hello world", but it's doing it immediately. Does someone see where I am going wrong?
def print_hello():
f = open('helloword.txt','a')
f.write( 'Hello World!')
f.close()
dag = DAG('hello_world', description='Simple tutorial DAG', schedule_interval='#hourly',
start_date=datetime(2018, 5, 31), catchup=False)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
The start date is 2018-05-31 and the schedule interval is #hourly, so the execution date for the first run would normally be 2018-05-31T00:00:00 with a start date >= ~2018-05-31T01:00:00.
In this case, you have set catchup to false, so instead only the most recent DAG run will be created. I would expect that DAG run created to be 2018-05-31T21:00:00 right now.
The current UTC time is 2018-05-31T22:00:00 right now. Since the start date timestamp 2018-05-31T00:00:00 is in the past, the Airflow scheduler will schedule and start the task immediately.
You can delete the DAG runs and task instances and then change the start date to 2018-06-01 if you want it to start fresh tomorrow. It would not run immediately in this case if you choose a start date in the future.
You can find a bit more info about how the scheduler works here:
Airflow Wiki > Scheduler Basics
Airflow Docs > Scheduling & Triggers > To Keep in Mind
Your code looks fine to me. Are you seeing some lines appended to the file if you put your DAG off?
I think what you're seeing is the backfill executions running. You put your start date today, implicitly at midnight. Airflow will therefore catch up and fire up these DAG runs first before eventually running your task every hour.

How can an Airflow DAG fail if none of the tasks have failed?

We have a long dag (~60 tasks), and quite frequently we see a dagrun for this dag in a state of failed. When looking at the tasks in the DAG they are all in a state of either success or null (i.e. not even queued yet). It appears that the dag has got into a state of failed prematurely.
Under what circumstances can this happen, and what should people do to protect against it?
If it's helpful for context we're running Airflow using the Celery executor and currently running on version 1.9.0. If we set the state of the dag in question back to running then all the tasks (and the dag as a whole) complete successfully.
The only way that a DAG can fail without a task failing is through something not connected to any of the tasks. Besides manual intervention (check that nobody on the team is manually failing the dags!) the only thing that fails DAGs outside of considering task states is the timeout checker.
This runs inside the scheduler, while considering whether it needs to schedule a new dag_run. If it finds another active run, which has been running longer than the dagrun_timeout argument of the DAG, then it will get killed. As far as I can see this isn't logged anywhere, so the best way to diagnose this is to look at the time that the DAG started and the time that the last task finished to see if it's roughly the length of dagrun_timeout.
You can see the code in action here: https://github.com/apache/incubator-airflow/blob/e9f3fdc52cb53f3ac3e9721e5128d17d1c5c418c/airflow/jobs.py#L800

Airflow 1.9.0 is queuing but not launching tasks

Airflow is randomly not running queued tasks some tasks dont even get queued status. I keep seeing below in the scheduler logs
[2018-02-28 02:24:58,780] {jobs.py:1077} INFO - No tasks to consider for execution.
I do see tasks in database that either have no status or queued status but they never get started.
The airflow setup is running https://github.com/puckel/docker-airflow on ECS with Redis. There are 4 scheduler threads and 4 Celery worker tasks. For the tasks that are not running are showing in queued state (grey icon) when hovering over the task icon operator is null and task details says:
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:- The scheduler is down or under heavy load
Metrics on scheduler do not show heavy load. The dag is very simple with 2 independent tasks only dependent on last run. There are also tasks in the same dag that are stuck with no status (white icon).
Interesting thing to notice is when I restart the scheduler tasks change to running state.
Airflow can be a bit tricky to setup.
Do you have the airflow scheduler running?
Do you have the airflow webserver running?
Have you checked that all DAGs you want to run are set to On in the web ui?
Do all the DAGs you want to run have a start date which is in the past?
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
I've had for instance a DAG which was wrongly set to depends_on_past: True which forbid the current instance to start correctly.
Also a great resource directly in the docs, which has a few more hints: Why isn't my task getting scheduled?.
I'm running a fork of the puckel/docker-airflow repo as well, mostly on Airflow 1.8 for about a year with 10M+ task instances. I think the issue persists in 1.9, but I'm not positive.
For whatever reason, there seems to be a long-standing issue with the Airflow scheduler where performance degrades over time. I've reviewed the scheduler code, but I'm still unclear on what exactly happens differently on a fresh start to kick it back into scheduling normally. One major difference is that scheduled and queued task states are rebuilt.
Scheduler Basics in the Airflow wiki provides a concise reference on how the scheduler works and its various states.
Most people solve the scheduler diminishing throughput problem by restarting the scheduler regularly. I've found success at a 1-hour interval personally, but have seen as frequently as every 5-10 minutes used too. Your task volume, task duration, and parallelism settings are worth considering when experimenting with a restart interval.
For more info see:
Airflow: Tips, Tricks, and Pitfalls (section "The scheduler should be restarted frequently")
Bug 1286825 - Airflow scheduler stopped working silently
Airflow at WePay (section "Restart everything when deploying DAG changes.")
This used to be addressed by restarting every X runs using the SCHEDULER_RUNS config setting, although that setting was recently removed from the default systemd scripts.
You might also consider posting to the Airflow dev mailing list. I know this has been discussed there a few times and one of the core contributors may be able to provide additional context.
Related Questions
Airflow tasks get stuck at "queued" status and never gets running (especially see Bolke's answer here)
Jobs not executing via Airflow that runs celery with RabbitMQ
Make sure you don't have datetime.now() as your start_date
It's intuitive to think that if you tell your DAG to start "now" that it'll execute "now." BUT, that doesn't take into account how Airflow itself actually reads datetime.now().
For a DAG to be executed, the start_date must be a time in the past, otherwise Airflow will assume that it's not yet ready to execute. When Airflow evaluates your DAG file, it interprets datetime.now() as the current timestamp (i.e. NOT a time in the past) and decides that it's not ready to run. Since this will happen every time Airflow heartbeats (evaluates your DAG) every 5-10 seconds, it'll never run.
To properly trigger your DAG to run, make sure to insert a fixed time in the past (e.g. datetime(2019,1,1)) and set catchup=False (unless you're looking to run a backfill).
By design, an Airflow DAG will execute at the completion of its schedule_interval
That means one schedule_interval AFTER the start date. An hourly DAG, for example, will execute its 2pm run when the clock strikes 3pm. The reasoning here is that Airflow can't ensure that all data corresponding to the 2pm interval is present until the end of that hourly interval.
This is a peculiar aspect to Airflow, but an important one to remember - especially if you're using default variables and macros.
Time in Airflow is in UTC by default
This shouldn't come as a surprise given that the rest of your databases and APIs most likely also adhere to this format, but it's worth clarifying.
Full article and source here
I also had a similar issue, but it is mostly related to SubDagOperator with more than 3000 task instances in total (30 tasks * 44 subdag tasks).
What I found out is that airflow scheduler mainly responsible for putting your scheduled tasks in to "Queued Slots" (pool), while airflow celery workers is the one who pick up your queued task and put it into the "Used Slots" (pool) and run it.
Based on your description, your scheduler should work fine. I suggest you check your "celery workers" log to see whether there is any error, or restart it to see whether it helps or not. I experienced some issues that celery workers normally go on strike for a few minutes then start working again (especially on SubDagOperator)
One of the very silly reasons could be that the DAG is "paused" which is the default state for the first time. I lost around 2 hrs fighting it. If you are using Airflow Web interface, then this shows up as a toggle next to your DAG in the list
I am facing the issue today and found that bullet point 4 from tobi6 answer below worked out and resolved the issue
*'Do all the DAGs you want to run have a start date which is in the past?'*
I am using airflow version v1.10.3
My problem was one step further, in addition to my tasks being queued, I couldn't see any of my celery workers on the Flower UI. The solution was that, since I was running my celery worker as root I had to make changes in my ~/.bashrc file.
The following steps made it work:
Add export C_FORCE_ROOT=true to your ~/.bashrc file
source ~/.bashrc
Run worker : nohup airflow worker $* >> ~/airflow/logs/worker.logs &
Check your Flower UI at http://{HOST}:5555
I think it's worth mentioning that there's an open issue that can cause tasks to fail to run with no obvious reason: https://issues.apache.org/jira/browse/AIRFLOW-5506
The problem seems to occur when using LocalScheduler connected to a PostgreSQL airflow db, and results in the scheduler logging a number of "Killing PID xxxx" lines. Check the scheduler logs after the DAGs have been stalled without starting any new tasks for a while.
You can try to stop the webserver and the scheduler:
ps -ef | grep airflow #show the process id
kill 1234 #kill the webserver
kill 5678 #kill the scheduler
Remove the files from the airflow folder if they exist (they will be created again):
airflow-scheduler.err
airflow-scheduler.pid
airflow-webserver.err
airflow-webserver.pid
Start the webserver and the scheduler again.
airflow webserver -D
airflow scheduler -D
-D will make the services run in the background.
I had a similar issue of a triggered DAG "running" indefinitely because its first task stuck in "queued" state.
I realized this was because of a "ghost" DAG that actually changed name. It seems that since the DAG has run in the past (had data in the postgresDG) and was referenced as child-DAG in other DAGs, the trigger of the parent DAGs referencing the old name would "resurrect" the old DAG name, but with the new code. Indeed the old DAG name and new DAG code did not match, thus producing an "infinite queued execution" bug.
Solution:
Delete the all the previous DAG runs of the previous DAG-runs with the old name
Restart everything (webserver, worker, executor,...) OR Delete relevant DAGs (with the "delete DAG" button in the UI).
The interpretation of the bug can vary but this fix worked in my case.
One more thing to check is whether "the concurrency parameter of your DAG reached?".
I'd experienced the same situation when some task was shown as NO STATUS.
It turned out that my File_Sensor tasks were run with timeout set up to 1 week, while DAG time out was only 5 hours. That leaded to the case when the Files were missing, many sensors tasked were running at the same time. Which results the concurrency overloaded!
The depending tasks couldn't be started before the sensor task succeed, when the dag timeout, they got NO STATUS.
My solution:
Carefully set tasks and DAG timeout
Increase dag_concurrency in airflow.cfg file in AIRFLOW_HOME folder.
Please refer to the docs.
https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
I believe this is an issue with celery version 4.2.1 and redis 3.0.1 as described here:
https://github.com/celery/celery/issues/3808
we resolved the issue by downgrading our redis version 2.10.6:
redis==2.10.6
In my case, tasks were not being launched because I had for all operators a pool configured and hadn't created it, hence, tasks were not even scheduled. An operator looks like:
foo = DummyOperator(
task_id='foo',
dag=dag,
pool='capser'
)
To create a pool go to Admin > Pools > Create and set slots, for example, 128, which runs successfully for me. You can also configure by using the CLI.
counter intuitive UI message!
I have spent days on this. So want to elaborate on my specific issue (s).
Each dag has a state. By default the state could be 'pause' or 'not pause'.
The first confusion arises from - what is the default state on startup? The UI message attached seems to indicate that the state is 'not pause' and on clicking the toggle, it pauses.
In reality, the default state is 'pause'. This state can be controlled by settings, environment variables, parameters and UI. I have detailed them below.
The second confusion arises because of the UI again. When we manually trigger a dag which is in the pause state. The UI shows the dag as running (green circle)! But the dag is actually in the 'pause' state. The tasks will not execute unless it is 'un-paused'.
If we read the task instance details. The message would be
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
What is the 'None' state!? And clear which task?!
The actual problem is that the dag is in the pause state. On toggling the dag state the tasks would start to execute.
The pause state of the dag can be changed by
clicking the button on the UI.
set your particular dag to run, by adding the below parameter to your dag
DAG(dag_id='your-dag', is_paused_upon_creation=True)
setting the config variable in airflow.cfg file. (caution: this will start all your dags including the example ones)
dags_are_paused_at_creation = FALSE
configuring an environment variable before starting up the scheduler/webserver.(caution: this will start all your dags including the example ones)
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
Make sure that your task is assigned to the same queue, that your workers is listening to. This means that in your DAG file you have to set 'queue': 'queue_name' and in your worker configuration you have to set either default_queue = 'queue_name' in the airflow.cfg or AIRFLOW__OPERATORS__DEFAULT_QUEUE: 'queue_name' in the docker-compose.yaml (in case you're using Docker).

Resources