When I schedule DAGs to run at a specific time everyday, the DAG execution does not take place at all.
However, when I restart Airflow webserver and scheduler, the DAGs execute once on the scheduled time for that particular day and do not execute from the next day onwards.
I am using Airflow version v1.7.1.3 with python 2.7.6.
Here goes the DAG code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import time
n=time.strftime("%Y,%m,%d")
v=datetime.strptime(n,"%Y,%m,%d")
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': v,
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=10),
}
dag = DAG('dag_user_answer_attempts', default_args=default_args, schedule_interval='03 02 * * *')
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='user_answer_attempts',
bash_command='python /home/ubuntu/bigcrons/appengine-flask-skeleton-master/useranswerattemptsgen.py',
dag=dag)
Am I doing something wrong?
Your issue is the start_date being set for the current time. Airflow runs jobs at the end of an interval, not the beginning. This means that the first run of your job is going to be after the first interval.
Example:
You make a dag and put it live in Airflow at midnight. Today (20XX-01-01 00:00:00) is also the start_date, but it is hard-coded ("start_date":datetime(20XX,1,1)). The schedule interval is daily, like yours (3 2 * * *).
The first time this dag will be queued for execution is 20XX-01-02 02:03:00, because that is when the interval period ends. If you look at your dag being run at that time, it should have a started datetime of roughly one day after the schedule_date.
You can solve this by having your start_date hard-coded to a date or by making sure that the dynamic date is further in the past than the interval between executions (In your case, 2 days would be plenty). Airflow recommends you use static start_dates in case you need to re-run jobs or backfill (or end a dag).
For more information on backfilling (the opposite side of this common stackoverflow question), check the docs or this question:
Airflow not scheduling Correctly Python
Check the following:
start_date is a fix time in the past(don't use datetime.now())
if you don't want to run the historical data, use catchup=false
to set a specific time for DAG to run (e.g hourly, monthly, daily, at a specific time), try using
https://crontab.guru/#40_21_*_*_* to write what you need.
If you think you have 1, 2, 3 steps all correct but the DAG is not running. Or the DAG can run every xx minutes, but failed to trigger even once in a daily interval, try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.
Hope this helped!
From the schedule your DAG should run everyday at 02:03 AM. My suspicion is the start_date might be impacting it. Can you hardcode that to something like 'start_date': datetime.datetime(2016, 11, 01) and try.
Great answer apathyman.
It helped me a lot to understand. I was using days_ago(0) and once I changed it to days_ago(1), scheduler started triggering.
Related
Seems there there has been previous discussion about this.
How do i stop airflow running a task the first time when i unpause it?
https://groups.google.com/g/cloud-composer-discuss/c/JGtmAd7xcsM?pli=1
When I deploy a dag to run at a specific time (say, once a day at 9AM), Airflow immediately runs the dag at deployment.
dag = DAG(
'My Dag',
default_args=default_args,
schedule_interval='00 09 * * *',
start_date = datetime(2021, 1, 1),
catchup=False # dont run previous and backfill; run only latest
)
That's because with catchup=False, scheduler "creates a DAG run only for the latest interval", as indicated in the doc.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
What I want to achieve is that I don't even want a DAG run for the latest interval to start. I want nothing to happen until the next time clock strikes 9AM.
It seems like out of the box, Airflow does not have any native solution to this problem.
What are some workarounds that people have been using? Perhaps something like check current time is close to next_execution_date?
When you update your dag you can set start_date to the next day.
However, it won't work if you pause/unpause dag.
Note it's recommended to be a static value (avoid using datetime.now() or similar dynamic values), so for every deployment, you need to specify a new value like datetime(2021, 10, 15), datetime(2021, 10, 16), ... which might make deployment more difficult.
with the dag paused: create dag run http.://.../dagrun/add with Execution Date set to the one needed to skip. This makes task instances in UI accessible
mark success those task instances in the UI
unpause the tag
Unfortunately even after reading the many questions here and the FAQ page of the airflow website, I still don't understand how airflow schedules tasks. I have a very simple example task here:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
"depends_on_past": False,
"start_date": datetime(2020, 5, 29),
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
dag = DAG(
"example_dag_one",
schedule_interval="30 8 * * *",
catchup=False,
default_args=default_args,
)
with dag:
t1 = BashOperator(task_id="print_hello", bash_command="echo hello", dag=dag)
t1
My naiv view would be that this task would be run on May 29th 08:30. But as the time passes, airflow has not scheduled that task. If I change the cron expression to something like: '* 8 * * *' It will schedule a task every minute.
When I however use the same DAG with a start date of yesterday (so May 28th in that case) the task will be scheduled at 08:30, yet it's execution date is the 28th (even though it ran on May 29th) and the start date in the web ui is May 29th. This is VERY confusing.
What I want from airflow in the end is simple: "Here is python code, run it on this time day". So how could I achieve that. Again let's say I want to schedule a task on 08:30 every day starting tomorrow.
The answer can be found in Airflow official documentation:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
So applying to your case, if you put start date 29th of May, with the original cron, it will run every day at 08:30 starting from tomorrow 30th of May.
Anyway, if you don't need a dag specifically at some point in the day, you can just set schedule interval to '#daily', and it will be triggered at the beginning (00:00) of each day. If there are a lot of dags with #daily, don't worry, the scheduler and the workers will know how to handle it to execute all of them. If you have dags that depend on other dags, there are mechanisms to concatenate them so that you still don't have to worry about specifying hours.
Actually Airflow will wait for the entire scheduling interval (1 day) to be completed, then the execution would start !
So if you want your task to be executed today 2020/ 5/ 29, you should set the start time in a way that the schedule interval finishes. So set the start time to : datetime(2020, 5, 28)
If the schedule interval is 1 week, so the task would be launched 1 week later of the start time and so on ...
New to airflow coming from cron, trying to understand how the execution_date macro gets applied to the scheduling system and when manually triggered. I've read the faq, and setup a schedule to what I expected would execute with the correct execution_date macro filled in.
I would like to run my dag weekly, on Thursday at 10am UTC. Occasionally I would run it manually. My understanding was the the dag's start date should be one period behind the actual date I want the dag to start. So, in order to execute the dag today, on 4/9/2020, with a 4/9/20020 execution_date I setup the following defaults:
default_args = {
'owner': 'airflow',
'start_date': dt.datetime(2020, 4, 2),
'concurrency': 4,
'retries': 0
}
And the dag is defined as:
with DAG('my_dag',
catchup=False,
default_args=default_args,
schedule_interval='0 10 * * 4',
max_active_runs=1,
concurrency=4,
) as dag:
opr_exc = BashOperator(task_id='execute_dag',bash_command='/path/to/script.sh --dt {{ ds_nodash }}')
While the dag executed on time today 4/9, it executed with the ds_nodash of 20200402 instead of 20200409. I guess I'm still confused since catchup was turned off, start date was one week prior thus I was expecting 20200409.
Now, I found another answer here, that basically explains that execution_date is at the start of the period, and always one period behind. So going forward should I be using next_ds_nodash? Wouldn't this create a problem for manually triggered dags, since execution_date works as expected when run on-demand. Or does next_ds_nodash translate to ds_nodash when manually triggered?
Question: Is there a happy medium that allows me to correctly get the execution_date macro passed over to my weekly run dag when running scheduled AND when manually triggered? What's best practice here?
After a bit more research and testing, it does indeed appear that next_ds_nodash becomes equivalent to ds_nodash when manually triggering the dag.
Thus if you are in a similar situation, do the following to correctly schedule your weekly run job (with optional manually triggers)
Set the start_date one week prior to the date you actually want to start
Configure the schedule_interval accordingly for when you want to run the job
Use the next execution date macros for wherever you expect to get the expected current execution date for when the job runs.
This works for me, but I don't have to deal with any catchup/backfill options, so YMMV.
I want one dag starts after completion of another dag. one solution is using external sensor function, below you can find my solution. the problem I encounter is that the dependent dag is stuck at poking, I checked this answer and made sure that both of the dags runs on the same schedule, my simplified code is as follows:
any help would be appreciated.
leader dag:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
schedule = '* * * * *'
dag = DAG('leader_dag', default_args=default_args,catchup=False,
schedule_interval=schedule)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
the dependent dag:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.operators.sensors import ExternalTaskSensor
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 10, 8),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
schedule='* * * * *'
dag = DAG('dependent_dag', default_args=default_args, catchup=False,
schedule_interval=schedule)
wait_for_task = ExternalTaskSensor(task_id = 'wait_for_task',
external_dag_id = 'leader_dag', external_task_id='t1', dag=dag)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t1.set_upstream(wait_for_task)
the log for leader_dag:
the log for dependent dag:
First the task_id in the leader_dag is named print_date but you setup your dependent_dag with a task wait_for_task which is waiting on leader_dag's task named t1. There is no task named t1. What you assigned it to in the py file is not relevant, nor used in the Airflow db and transversely by the sensor. It should be waiting on task name print_date.
Second your logs do not line up in which leader_dag run you show for what the dependent_dag is waiting for.
Finally, I can't recommend you use Airflow to schedule tasks every minute. Certainly not two dependent tasks together.
Consider writing streaming jobs in a different system like Spark, or rolling your own Celery or Dask environment for this.
You could also avoid the the ExternalTaskSensor by adding a TriggerDagRunOperator to the end of your leader_dag to trigger the dependent_dag, and removing the schedule from that by setting the schedule_interval to None.
What I see in your logs is a log for the leader from 2018-10-13T19:08:11. This at best would be the dagrun for execution_date 2018-10-13 19:07:00 because the minute period starting 19:07 ends at 19:08 which is the earliest it can be scheduled. And I see some delay between scheduling and execution of about 11 seconds if this is the case. However there can be multiple minutes of scheduling lag in Airflow.
I also see a log from the dependent_dag which runs from 19:14:04 to 19:14:34 and is looking for the completion of the corresponding 19:13:00 dagrun. There's no indication that your scheduler is lag free enough to have started the 19:13:00 dagrun of leader_dag by 19:14:34. You could have better convinced me if you showed it poking for 5 minutes or so. Of course it's never going to sense leader_dag.t1 because that isn't what you named the tasks shown.
So, Airflow has scheduling delay, If you had a few 1000 dags in the system, it might be higher than 1 minute, such that a with catchup=False you're going to get some runs following each other IE 19:08, 19:09 and some runs that skip a minute (or 6) like 19:10 followed by 19:16 can happen, and since the delay is a bit random on a dag-by-dag basis, you might get unaligned runs with the sensor waiting for ever, EVEN if you have the correct task id to wait for:
wait_for_task = ExternalTaskSensor(
task_id='wait_for_task',
external_dag_id='leader_dag',
- external_task_id='t1',
+ external_task_id='print_date',
dag=dag)
While using ExternalTaskSensor you have to give both DAGs the same starting date. If that does not work for your use case then you need to use execution_delta or execution_date_fn in your ExternalTaskSensor.
Simple solution to this is to do the following:
1- make sure all the variables you set in the ExternalTaskSensor is set right, exactly like the task_id and dag_id of the dag you want your master_dag to have an eye on.
2- make both of the master_dag and slave_dag(the dag you want to wait for it) have the same start_date, otherwise it wont work. if your slave starts at 22PM and master starts at 22:30 so you have 30 minutes diffrence that you should specify with execution delta.
If your error couldn't be solved by following, your problem is either basic or you programmed your dag way-too-wrong..
I use Airflow 1.8.0
and I have a DAG like this one :
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['technical#me.com'],
'start_date': datetime.datetime(2018, 5, 21),
'email_on_retry': False,
'retries': 0
}
dag = DAG('my_dag',
schedule_interval='40 20 * * *',
catchup=True,
default_args=default_args)
Every day the dag is correctly scheduled but with a day late.
Given today's date is
2018-07-02
the web interface show :
instead of 2018-07-01
But if I do a manual trigger
The current date is correctly passed :
Is there a way to force scheduler to run with the current date ?
This is correct and is a part of the design of airflow. If you look here you'll see the explanation:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Your schedule_interval is schedule_interval='20 40 * * *'. Remember that the schedule_interval is in CRON format, or (Minutes Hour Day(month) Month day(week). So your current schedule is actually incorrect as the scheduler cannot run every 40th hour. Do you want to make it so that it runs at the 40th minute every 20 hours? If so, try schedule_interval='40 20 * * *'.
Also, set catchup to catchup=False if you want it to run the most recent day. With both of these fixes, it should work. Refer to this website for more CRON help.