I'm trying to figure out how to configure/schedule an Airflow DAG to run twice a day in the exact time instead of run both times at the same time once the criteria is met.
I want to run the same task at midnight and 9pm.
To do so I've added a cron to schedule_interval like 0 0,21 * * * so it runs everyday at midnight and 9pm. But today's (27th of April) run started at 00:00:00 for yesterday's (26th of April) and both 00:00:00 and 21:00:00 runs ran at the same time.
The expected behaviour would be run today (27th of April) at 00:00:00 and 21 hours later run again at 21:00:00.
Any ideas?
In the end, the question is: how can I run a DAG twice a day?
Thank you.
Everything you have done is correct except the start date. Keep it running for a day. Once it will backfill previous days(from start date till today) your dag should start scheduling on correct time.
Related
I'm just learning Apache Airflow. I understand that the execution date is not the same time as the actual time a dag run is triggered.
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Yeah, For a daily job, cron jobs run at the start of the day; Airflow jobs run at the end of the day.
I humbly ask: Anyway to set the execution date same as the trigger time?
You generally structure your tasks such that you'll provide a date to the job via kwargs (for idempotency, etc).
Airflow provides macros (https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html) that expose both the data_interval_start and the data_interval_end.
I believe you're looking for the data_interval_end which aligns with the logical date that the job is running.
I set the schedule like '* 1,5,10,18 * * *' in airflow.
But 18 in yesterday wasn't executed. So I checked the logs.
then I found the job scheduled in 10 executed in 18.
I want to know why and how can I fix.
Note that if you run a DAG on a schedule_interval of one day, the run
stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
other words, the job instance is started once the period it covers has
ended.
~Airflow scheduler docs
So as you can see it will be scheduled after the period - 10->18 is closed, so after 18:00. Check task before, it should be ran just after 10:00.
You don't understand how the airflow scheduler works.
Airflow as DAG Run (data_interval_start) always takes the date of the previous execution, therefore the task performed at your place at 18, has DAG Run at 10. The same as the next task, call at 10 will have a DAG Run at 18 the previous day.
I am trying to define periods of time where the scheduler will not mark tasks as ready.
For example, I have a DAG with a lengthy backfill period. The execution of the backfill will run throughout the day until the DAG is caught up. However I do not want and executions of the DAG to execute between midnight and 2 am.
Is it possible to achieve this through configuration?
After pausing a dag for 2-3 days, when resuming the dag with catchup=False, will run immediately with the last execution.
For example a dag that sends data to an external system is scheduled to run everyday on 19:00.
Stopping the dag for 4 days and enabling on 11:00 would run the dag immediately with yesterdays execution and then again on 19:00 for that day.
In this case the dag runs two times on the day it's resumed.
Is it possible to resume the dag and the first run will happen actually on 19:00?
With default operators, we cannot achieve what you are expecting. Closest to that, what airflow has is LatestOnlyOperator. This is one of the simplest Operators and needs only following configuration
latest_only = LatestOnlyOperator(task_id='latest_only')
This would let the downstream tasks run only if the current time falls between current execution date and next execution date. So, in your case, it would skip execution of three days, but yesterday's run would trigger the jobs.
I am trying to run an Airflow DAG at 04:30 UTC time. I want to run it on all days excepting Tuesday. I have set the schedule_interval in the following way:
schedule_interval='30 04 * * 0,1,3,4,5,6'
However, I see the DAG is not running at the anticipated time instances. If I run manually, then it runs fine. So there is no error in any other place. Could you please point out the mistake in scheduler? Is it not the right approach to exclude a particular day from the scheduler?
Thanks for the help.