Airflow Clear automatically triggering tasks? - airflow

I am running airflow clear -t task_regex -s 2019-02-23 -e 2019-02-24 dag_id to clear and then re-run a task. However, Airflow is automatically queuing up this task to run, instead of waiting for me to run airflow backfill. Is this a setting I can change? Also, it looks like it queued only some of them to run, and not others. In other words, immediately after clearing the task, half the dates had a green box and the other half had a white box.

I believe setting catchup_by_default is meant for this very purpose and should be set to False in order to achieve what you want
Command Line Backfills still work, but the scheduler will not do
scheduler catchup if this is False,
But then you need not configure it globally since it can be modified on per-DAG basis too using catchup parameter
however it can be set on a per DAG basis in the DAG definition
(catchup)

Related

Prevent Airflow from triggering on scheduler restart

My Airflow Scheduler went down for some reason, and when I re-started it, all the DAGS triggered simultaneously. It was as if it was catching up from the missed jobs. Also, it seems when I modify a DAG, the workflow triggers. These unexpected triggers corrupt my data and loses trust in the system.
Is there a way to prevent a DAG running unexpectedly unless it is the exact time (no catch-up) or unless it is manually triggered?
The airflow scheduler will, at a minimum, attempt to run the current schedule interval when it is online to do so. This means that if the scheduler process is offline for a period of time, when it comes back online it will reconcile which jobs should have run and attempt to start those jobs.
There is some control using catchup, which tells the scheduler that only the latest job should be run and schedule intervals other than the latest that were missed do not need to be run.
Some info on catchup here: https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#catchup
Is there a way to prevent a DAG running unexpectedly unless it is the exact time (no catch-up) or unless it is manually triggered?
There is no way to tell Airflow to only attempt to schedule the job at the exact time the job is supposed to run (and never attempt again after the fact). You can set the schedule interval to None and the job will never be scheduled, however. You can manually trigger the job through the UI or via the Airflow API in this case.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#cron-presets
preset | meaning
-------+----------------------------------------------------------------
None | Don’t schedule, use for exclusively “externally triggered” DAGs

A DAG is preventing other smaller DAGs tasks to start

I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html

Airflow DAG Multiple Runs

I have a DAG that I want to run multiple times after each successful completion. For an example I want to run it 10 times and stop. Is there a way to accomplish this? I tried looking into scheduling with CRON but it doesn't seem clean nor triggering the DAG via UI multiple times doesn't work (runs in parallel).
I found a solution to my use case. It incorporated using depends_on_past=True (mentioned by #Hitesh Gupta) and setting your airflow.cfg file below:
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 1
This allowed us to only have one active DAG run at a time and also to not continue the next DAG run if there were failure in the previous run. This is for Airflow version 1.10.1 that I tested on.
You can, in addition to supplying a start_date, provide your DAG an end_date
Quoting the docstring
:param start_date: The timestamp from which the scheduler will attempt
to backfill
:type start_date: datetime.datetime
:param end_date: A date beyond which your DAG won't run, leave to None for open ended scheduling
:type end_date: datetime.datetime
While unrelated, also have a look at following scheduler settings in airflow.cfg as mentioned in this article
run_duration
num_runs
UPDATE-1
In his article Use apache airflow to run task exactly once, #Andreas P has described a clever technique, which I believe can be adapted to your use-case. While even that won't be a very-tidy solution, it would at-least allow you to specify beforehand the number of runs (integer) for DAG instead of end_date.
Alternatively (assuming you implement the above approach) rather than rather than baking this skipping-dag-after max-runs functionality within each DAG, you can create a separate orchestrator DAG that disables a given DAG after its max runs have passed.
You have to set property depends_on_past. This is set under DAG's default arguments section and it refers to previous instance dag instance. This is fix your problem.

How do I queue up backfills in airflow?

I have DAG where max_active_runs is set to 2, but now I want to run backfills for 20ish runs. I actually expected airflow to sort of schedule all the backfills but only start 2 at a time, but that doesn't seem to happen. When I run the backfill command it starts two, but the command doesn't return since it didn't manage to start them all, instead, it keeps on trying until it succeeds.
So what I expected was this:
I ran the backfill command
All the runs are marked as running
Command returns since now everything should be scheduled
Two of the runs start
What I experienced:
I ran the backfill command
Two runs are marked as running and start
Command doesn't return since it can't start the rest
The experienced behavior makes it hard to just start a backfill and the shutdown your computer. So am I doing something wrong?
Update
Using trigger_dag instead of backfill did what I wanted it to do. When running with backfill it seems like the command needed to be running for it to continue, feels weird. The difference with trigger_dag is that it trigger the dag and then it let airflow deal with it. Maybe it has something to do with how the backfill command is executed when using gcloud composer environments run <env> --location=<location> backfill -- ...?

Airflow 1.9.0 is queuing but not launching tasks

Airflow is randomly not running queued tasks some tasks dont even get queued status. I keep seeing below in the scheduler logs
[2018-02-28 02:24:58,780] {jobs.py:1077} INFO - No tasks to consider for execution.
I do see tasks in database that either have no status or queued status but they never get started.
The airflow setup is running https://github.com/puckel/docker-airflow on ECS with Redis. There are 4 scheduler threads and 4 Celery worker tasks. For the tasks that are not running are showing in queued state (grey icon) when hovering over the task icon operator is null and task details says:
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:- The scheduler is down or under heavy load
Metrics on scheduler do not show heavy load. The dag is very simple with 2 independent tasks only dependent on last run. There are also tasks in the same dag that are stuck with no status (white icon).
Interesting thing to notice is when I restart the scheduler tasks change to running state.
Airflow can be a bit tricky to setup.
Do you have the airflow scheduler running?
Do you have the airflow webserver running?
Have you checked that all DAGs you want to run are set to On in the web ui?
Do all the DAGs you want to run have a start date which is in the past?
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
I've had for instance a DAG which was wrongly set to depends_on_past: True which forbid the current instance to start correctly.
Also a great resource directly in the docs, which has a few more hints: Why isn't my task getting scheduled?.
I'm running a fork of the puckel/docker-airflow repo as well, mostly on Airflow 1.8 for about a year with 10M+ task instances. I think the issue persists in 1.9, but I'm not positive.
For whatever reason, there seems to be a long-standing issue with the Airflow scheduler where performance degrades over time. I've reviewed the scheduler code, but I'm still unclear on what exactly happens differently on a fresh start to kick it back into scheduling normally. One major difference is that scheduled and queued task states are rebuilt.
Scheduler Basics in the Airflow wiki provides a concise reference on how the scheduler works and its various states.
Most people solve the scheduler diminishing throughput problem by restarting the scheduler regularly. I've found success at a 1-hour interval personally, but have seen as frequently as every 5-10 minutes used too. Your task volume, task duration, and parallelism settings are worth considering when experimenting with a restart interval.
For more info see:
Airflow: Tips, Tricks, and Pitfalls (section "The scheduler should be restarted frequently")
Bug 1286825 - Airflow scheduler stopped working silently
Airflow at WePay (section "Restart everything when deploying DAG changes.")
This used to be addressed by restarting every X runs using the SCHEDULER_RUNS config setting, although that setting was recently removed from the default systemd scripts.
You might also consider posting to the Airflow dev mailing list. I know this has been discussed there a few times and one of the core contributors may be able to provide additional context.
Related Questions
Airflow tasks get stuck at "queued" status and never gets running (especially see Bolke's answer here)
Jobs not executing via Airflow that runs celery with RabbitMQ
Make sure you don't have datetime.now() as your start_date
It's intuitive to think that if you tell your DAG to start "now" that it'll execute "now." BUT, that doesn't take into account how Airflow itself actually reads datetime.now().
For a DAG to be executed, the start_date must be a time in the past, otherwise Airflow will assume that it's not yet ready to execute. When Airflow evaluates your DAG file, it interprets datetime.now() as the current timestamp (i.e. NOT a time in the past) and decides that it's not ready to run. Since this will happen every time Airflow heartbeats (evaluates your DAG) every 5-10 seconds, it'll never run.
To properly trigger your DAG to run, make sure to insert a fixed time in the past (e.g. datetime(2019,1,1)) and set catchup=False (unless you're looking to run a backfill).
By design, an Airflow DAG will execute at the completion of its schedule_interval
That means one schedule_interval AFTER the start date. An hourly DAG, for example, will execute its 2pm run when the clock strikes 3pm. The reasoning here is that Airflow can't ensure that all data corresponding to the 2pm interval is present until the end of that hourly interval.
This is a peculiar aspect to Airflow, but an important one to remember - especially if you're using default variables and macros.
Time in Airflow is in UTC by default
This shouldn't come as a surprise given that the rest of your databases and APIs most likely also adhere to this format, but it's worth clarifying.
Full article and source here
I also had a similar issue, but it is mostly related to SubDagOperator with more than 3000 task instances in total (30 tasks * 44 subdag tasks).
What I found out is that airflow scheduler mainly responsible for putting your scheduled tasks in to "Queued Slots" (pool), while airflow celery workers is the one who pick up your queued task and put it into the "Used Slots" (pool) and run it.
Based on your description, your scheduler should work fine. I suggest you check your "celery workers" log to see whether there is any error, or restart it to see whether it helps or not. I experienced some issues that celery workers normally go on strike for a few minutes then start working again (especially on SubDagOperator)
One of the very silly reasons could be that the DAG is "paused" which is the default state for the first time. I lost around 2 hrs fighting it. If you are using Airflow Web interface, then this shows up as a toggle next to your DAG in the list
I am facing the issue today and found that bullet point 4 from tobi6 answer below worked out and resolved the issue
*'Do all the DAGs you want to run have a start date which is in the past?'*
I am using airflow version v1.10.3
My problem was one step further, in addition to my tasks being queued, I couldn't see any of my celery workers on the Flower UI. The solution was that, since I was running my celery worker as root I had to make changes in my ~/.bashrc file.
The following steps made it work:
Add export C_FORCE_ROOT=true to your ~/.bashrc file
source ~/.bashrc
Run worker : nohup airflow worker $* >> ~/airflow/logs/worker.logs &
Check your Flower UI at http://{HOST}:5555
I think it's worth mentioning that there's an open issue that can cause tasks to fail to run with no obvious reason: https://issues.apache.org/jira/browse/AIRFLOW-5506
The problem seems to occur when using LocalScheduler connected to a PostgreSQL airflow db, and results in the scheduler logging a number of "Killing PID xxxx" lines. Check the scheduler logs after the DAGs have been stalled without starting any new tasks for a while.
You can try to stop the webserver and the scheduler:
ps -ef | grep airflow #show the process id
kill 1234 #kill the webserver
kill 5678 #kill the scheduler
Remove the files from the airflow folder if they exist (they will be created again):
airflow-scheduler.err
airflow-scheduler.pid
airflow-webserver.err
airflow-webserver.pid
Start the webserver and the scheduler again.
airflow webserver -D
airflow scheduler -D
-D will make the services run in the background.
I had a similar issue of a triggered DAG "running" indefinitely because its first task stuck in "queued" state.
I realized this was because of a "ghost" DAG that actually changed name. It seems that since the DAG has run in the past (had data in the postgresDG) and was referenced as child-DAG in other DAGs, the trigger of the parent DAGs referencing the old name would "resurrect" the old DAG name, but with the new code. Indeed the old DAG name and new DAG code did not match, thus producing an "infinite queued execution" bug.
Solution:
Delete the all the previous DAG runs of the previous DAG-runs with the old name
Restart everything (webserver, worker, executor,...) OR Delete relevant DAGs (with the "delete DAG" button in the UI).
The interpretation of the bug can vary but this fix worked in my case.
One more thing to check is whether "the concurrency parameter of your DAG reached?".
I'd experienced the same situation when some task was shown as NO STATUS.
It turned out that my File_Sensor tasks were run with timeout set up to 1 week, while DAG time out was only 5 hours. That leaded to the case when the Files were missing, many sensors tasked were running at the same time. Which results the concurrency overloaded!
The depending tasks couldn't be started before the sensor task succeed, when the dag timeout, they got NO STATUS.
My solution:
Carefully set tasks and DAG timeout
Increase dag_concurrency in airflow.cfg file in AIRFLOW_HOME folder.
Please refer to the docs.
https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
I believe this is an issue with celery version 4.2.1 and redis 3.0.1 as described here:
https://github.com/celery/celery/issues/3808
we resolved the issue by downgrading our redis version 2.10.6:
redis==2.10.6
In my case, tasks were not being launched because I had for all operators a pool configured and hadn't created it, hence, tasks were not even scheduled. An operator looks like:
foo = DummyOperator(
task_id='foo',
dag=dag,
pool='capser'
)
To create a pool go to Admin > Pools > Create and set slots, for example, 128, which runs successfully for me. You can also configure by using the CLI.
counter intuitive UI message!
I have spent days on this. So want to elaborate on my specific issue (s).
Each dag has a state. By default the state could be 'pause' or 'not pause'.
The first confusion arises from - what is the default state on startup? The UI message attached seems to indicate that the state is 'not pause' and on clicking the toggle, it pauses.
In reality, the default state is 'pause'. This state can be controlled by settings, environment variables, parameters and UI. I have detailed them below.
The second confusion arises because of the UI again. When we manually trigger a dag which is in the pause state. The UI shows the dag as running (green circle)! But the dag is actually in the 'pause' state. The tasks will not execute unless it is 'un-paused'.
If we read the task instance details. The message would be
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
What is the 'None' state!? And clear which task?!
The actual problem is that the dag is in the pause state. On toggling the dag state the tasks would start to execute.
The pause state of the dag can be changed by
clicking the button on the UI.
set your particular dag to run, by adding the below parameter to your dag
DAG(dag_id='your-dag', is_paused_upon_creation=True)
setting the config variable in airflow.cfg file. (caution: this will start all your dags including the example ones)
dags_are_paused_at_creation = FALSE
configuring an environment variable before starting up the scheduler/webserver.(caution: this will start all your dags including the example ones)
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
Make sure that your task is assigned to the same queue, that your workers is listening to. This means that in your DAG file you have to set 'queue': 'queue_name' and in your worker configuration you have to set either default_queue = 'queue_name' in the airflow.cfg or AIRFLOW__OPERATORS__DEFAULT_QUEUE: 'queue_name' in the docker-compose.yaml (in case you're using Docker).

Resources