I know there are already a couple of questions open on this topic but I can't find a satisfying answer to my situation.
Why is Airflow not correctly backfilling for all the dates:
I think that I've correctly set the default argument:
default_args = {
'depends_on_past': False,
'start_date': datetime.datetime(2022, 1, 1),
'catchup':True,
schedule_interval= "0 0 * * *",
'retries': 0
}
I've run some manual dag triggers and some part have been backfilled but not others I don't understand why...
My understand is that since I've set catchup to True and that I've defined a start_date the dag should be executed on all the 'missing' date between the start date and the current date...
#edit1
in addition I can see that when I'm clearing a tag previously executed the backfilling is working. Why is the backfilling not working for a date when a dag as not been executed...
When you set catchup to True, Airflow scheduler checks after each run if the next run is missing, and it creates it, but it doesn't check the previous runs and the history because this affects the performance when you have a lot of runs in the Metastore.
If you want to backfill the previous runs you have two solutions:
clear a previous run (the last task of the run) of a missing period
use the CLI and execute the backfill command which check if there is missing runs between two dates and it runs them (the terminal session should stay alive until the end of the backfill, so try to use screen)
Related
Seems there there has been previous discussion about this.
How do i stop airflow running a task the first time when i unpause it?
https://groups.google.com/g/cloud-composer-discuss/c/JGtmAd7xcsM?pli=1
When I deploy a dag to run at a specific time (say, once a day at 9AM), Airflow immediately runs the dag at deployment.
dag = DAG(
'My Dag',
default_args=default_args,
schedule_interval='00 09 * * *',
start_date = datetime(2021, 1, 1),
catchup=False # dont run previous and backfill; run only latest
)
That's because with catchup=False, scheduler "creates a DAG run only for the latest interval", as indicated in the doc.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
What I want to achieve is that I don't even want a DAG run for the latest interval to start. I want nothing to happen until the next time clock strikes 9AM.
It seems like out of the box, Airflow does not have any native solution to this problem.
What are some workarounds that people have been using? Perhaps something like check current time is close to next_execution_date?
When you update your dag you can set start_date to the next day.
However, it won't work if you pause/unpause dag.
Note it's recommended to be a static value (avoid using datetime.now() or similar dynamic values), so for every deployment, you need to specify a new value like datetime(2021, 10, 15), datetime(2021, 10, 16), ... which might make deployment more difficult.
with the dag paused: create dag run http.://.../dagrun/add with Execution Date set to the one needed to skip. This makes task instances in UI accessible
mark success those task instances in the UI
unpause the tag
I have an Airflow DAG that previously did not have a schedule. It is turned on in the Airflow interface and I was running it manually. Just now, I updated the schedule as follows:
[in the default arguments object that is fed into the DAG]
'catchup': True
'start_date': datetime.datetime(2020, 8, 1)
[in my DAG object instantiation]
schedule_interval='0 0 17 * *'
Right now it is August 18, 2020 in UTC. I expected that this would cause the DAG to immediately run once the code changes were added, but so far it isn't running.
I have said that the schedule starts on August 1 2020, and the schedule interval means "every month on the 17th," so by that definition it has missed a run at midnight on August 17th. Why isn't it catching up that past run, since catchup is set to True? When will it first run?
I know that there is some controversy surrounding the confusing behavior of schedule_interval, because the first run is after the first interval after start_date. However, even those discussions, which I have read, have to do with the case when schedule_interval is an actual interval like #daily, and when somebody has placed the first run in the future. I cannot find any documentation for what should happen for catchup when the new schedule starts in the past and/or when schedule_interval is a cron.
You want to set your start_date to be one "interval" of your schedule_interval behind the current date/time; where "interval" is the amount of time between subsequent executions.
The easiest way to do this is to just set start_date to be the date of when it would have ran prior to the date you want it to run, if it were already installed and running. In this instance it would be datetime.datetime(2020, 7, 17)
New to airflow coming from cron, trying to understand how the execution_date macro gets applied to the scheduling system and when manually triggered. I've read the faq, and setup a schedule to what I expected would execute with the correct execution_date macro filled in.
I would like to run my dag weekly, on Thursday at 10am UTC. Occasionally I would run it manually. My understanding was the the dag's start date should be one period behind the actual date I want the dag to start. So, in order to execute the dag today, on 4/9/2020, with a 4/9/20020 execution_date I setup the following defaults:
default_args = {
'owner': 'airflow',
'start_date': dt.datetime(2020, 4, 2),
'concurrency': 4,
'retries': 0
}
And the dag is defined as:
with DAG('my_dag',
catchup=False,
default_args=default_args,
schedule_interval='0 10 * * 4',
max_active_runs=1,
concurrency=4,
) as dag:
opr_exc = BashOperator(task_id='execute_dag',bash_command='/path/to/script.sh --dt {{ ds_nodash }}')
While the dag executed on time today 4/9, it executed with the ds_nodash of 20200402 instead of 20200409. I guess I'm still confused since catchup was turned off, start date was one week prior thus I was expecting 20200409.
Now, I found another answer here, that basically explains that execution_date is at the start of the period, and always one period behind. So going forward should I be using next_ds_nodash? Wouldn't this create a problem for manually triggered dags, since execution_date works as expected when run on-demand. Or does next_ds_nodash translate to ds_nodash when manually triggered?
Question: Is there a happy medium that allows me to correctly get the execution_date macro passed over to my weekly run dag when running scheduled AND when manually triggered? What's best practice here?
After a bit more research and testing, it does indeed appear that next_ds_nodash becomes equivalent to ds_nodash when manually triggering the dag.
Thus if you are in a similar situation, do the following to correctly schedule your weekly run job (with optional manually triggers)
Set the start_date one week prior to the date you actually want to start
Configure the schedule_interval accordingly for when you want to run the job
Use the next execution date macros for wherever you expect to get the expected current execution date for when the job runs.
This works for me, but I don't have to deal with any catchup/backfill options, so YMMV.
I’m new to Airflow and I’m trying to understand how to use the scheduler correctly. Basically I want to schedule tasks the same way as I use cron. There’s a task that needs to be run every 5 minutes and I want it to start at the dag run next even 5 min slot after I add the DAG file to dags directory or after I have made some changes to the dag file.
I know that the DAG is run at the end of the schedule_interval. If I add a new DAG and use start_date=days_ago(0) then I will get the unnecessary runs starting from the beginning of the day. It also feels stupid to hardcode some specific start date on the dag file i.e. start_date=datetime(2019, 9, 4, 10, 1, 0, 818988). Is my approach wrong or is there some specific reason why the start_date needs to be set?
I think I found an answer to my own question from the official documentation: https://airflow.apache.org/scheduler.html#backfill-and-catchup
By turning off the catchup, DAG run is created only for the most recent interval. So then I can set the start_date to anything in the past and define the dag like this:
dag = DAG('good-dag', catchup=False, default_args=default_args, schedule_interval='*/5 * * * *')
When I schedule DAGs to run at a specific time everyday, the DAG execution does not take place at all.
However, when I restart Airflow webserver and scheduler, the DAGs execute once on the scheduled time for that particular day and do not execute from the next day onwards.
I am using Airflow version v1.7.1.3 with python 2.7.6.
Here goes the DAG code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import time
n=time.strftime("%Y,%m,%d")
v=datetime.strptime(n,"%Y,%m,%d")
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': v,
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=10),
}
dag = DAG('dag_user_answer_attempts', default_args=default_args, schedule_interval='03 02 * * *')
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='user_answer_attempts',
bash_command='python /home/ubuntu/bigcrons/appengine-flask-skeleton-master/useranswerattemptsgen.py',
dag=dag)
Am I doing something wrong?
Your issue is the start_date being set for the current time. Airflow runs jobs at the end of an interval, not the beginning. This means that the first run of your job is going to be after the first interval.
Example:
You make a dag and put it live in Airflow at midnight. Today (20XX-01-01 00:00:00) is also the start_date, but it is hard-coded ("start_date":datetime(20XX,1,1)). The schedule interval is daily, like yours (3 2 * * *).
The first time this dag will be queued for execution is 20XX-01-02 02:03:00, because that is when the interval period ends. If you look at your dag being run at that time, it should have a started datetime of roughly one day after the schedule_date.
You can solve this by having your start_date hard-coded to a date or by making sure that the dynamic date is further in the past than the interval between executions (In your case, 2 days would be plenty). Airflow recommends you use static start_dates in case you need to re-run jobs or backfill (or end a dag).
For more information on backfilling (the opposite side of this common stackoverflow question), check the docs or this question:
Airflow not scheduling Correctly Python
Check the following:
start_date is a fix time in the past(don't use datetime.now())
if you don't want to run the historical data, use catchup=false
to set a specific time for DAG to run (e.g hourly, monthly, daily, at a specific time), try using
https://crontab.guru/#40_21_*_*_* to write what you need.
If you think you have 1, 2, 3 steps all correct but the DAG is not running. Or the DAG can run every xx minutes, but failed to trigger even once in a daily interval, try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.
Hope this helped!
From the schedule your DAG should run everyday at 02:03 AM. My suspicion is the start_date might be impacting it. Can you hardcode that to something like 'start_date': datetime.datetime(2016, 11, 01) and try.
Great answer apathyman.
It helped me a lot to understand. I was using days_ago(0) and once I changed it to days_ago(1), scheduler started triggering.