I am trying to add in SLA for the DAG/individual tasks which have a schedule frequency once. Essentially, I have a template DAG(schedule frequency once) which gets triggered by another DAG on a weekly schedule. Adding 'sla': timedelta(seconds=12) to these template DAGs has no effect. I can add it to all the other DAGs and it works okay. Is there a way to add SLAs to the DAGs with a schedule frequency #once?
Related
I am trying to set an airflow dag that runs on second day of each month. Based on my research, the schedule interval should be set to:
schedule_interval = '0 0 2 * *'
Now what worries my is the stuff discussed in airflow documentation. Based on what's discussed there:
Note that if you run a DAG on a schedule_interval of one day, the run
stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
other words, the job instance is started once the period it covers has
ended.
Let’s Repeat That The scheduler runs your job one schedule_interval
AFTER the start date, at the END of the period.
Does this mean that for monthly dags, everything will be delayed by one month? So for example the 2020-11-02 job will run on 2020-12-01 2359? If yes, how can I make sure that it runs exactly when it's supposed to?
Your interpretion is right, the DAG Run with execution_date=2020-11-02 will be triggered approximately on 2020-12-01 23:59. However, I think you don't need to be worried that the DAG is delayed or something, it still has a monthly schedule and will run every month. You rather need to take this logic into account when running a operator.
You can also simply work with other variables if you don't want to adapt the logic, for whatever reason:
{{ next_execution_date }} - the next execution date.
Is there any way to implement current plan concept in Airflow..
Example: In Control m jobs will be scheduled as per date example: ODATE : 20200801
But in airflow we are not able to see an option how to schedule the jobs as per every date/calendar..
Is there any best way we can implement..
I just ran an airflow DAG. When I see the airflow last run date, it displays the last but last run date. It catches my attention when I hover over the "i" icon it shows the correct date. Is there any way to solve this? Sounds like nonsense but I end up using it for QA of my data.
This is probably because your airflow job has catchup=True enabled and a start_date in the past, so it is back-filling.
The Start Date is the real-time date of the last run, whereas the Last Run is the execution date of the airflow job. For example, if I am back-filling a time partitioned table with data from 2016-01-01 to present, the Start Date will be the current date but the Last Run date will be 2016-01-01.
Please include your DAG file/code in the future.
Edit: If you don't have catchUp=True enabled, and the discrepancy is approximately one day (like in the picture you sent), then that is just due to the behaviour of the scheduler. From the docs, "The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period."
if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be triggered soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
I have a DAG that need to be trigger every Tuesday and Friday (for context, purpose of the DAG is basically ETL data published only twice a week on Tuesday and Friday)
This DAG need to Catchup the past.
I use the {{ execution_date }} in many operator parameter (for API call parameter, in storage name for keeping copy of raw data, ...)
The Catchup works well, my issue is with the present.
Because of the schedule interval, every Friday it will ETL the data of previous Tuesday (use execution_date for API call parameter) and every Tuesday it will ETL the data of previous Friday.
What I need is that the Tuesday run get data of this Tuesday and not the previous Friday.
I think about using start_date instead of execution_date for API call but in this case the Catchup will not work as expected.
I don't find any pretty solution where Catchup work well and present data are processed without the schedule interval delay...
Any idea ?
EDIT Based on andscoop answer:
Best solution is to use next_execution_date instead of execution_date
Catchup will not prevent the most current DAG from running. It only determines whether or not previous un-run DAGs will run to "catchup".
There is no delay per-se, what you are seeing is that the reported execution date is only showing the last completed schedule interval.
You will want to look into Airflow macros to template the exact timestamp you need.
I'm new to airflow and I'm trying understand what is execution_date means in airflow context. I've read the tutorial page from airflow's documentation which states that
The date specified in this context is an execution_date, which simulates the scheduler running your task or dag at a specific date + time:
I tried to run a task from the tutorial using following command.
airflow test tutorial print_date 2015-06-01
I expected it to print execution_date but the task is printing the actual date on my local system like this.
[2018-05-26 20:36:13,880] {bash_operator.py:101} INFO - Sat May 26 20:36:13 IST 2018
I thought the scheduler will be simulated at a given time. So I'm confused here understanding about execution_date param. Can anyone help me understand this? Thanks.
It's printing the current time in your log because it was actually executed at this time.
The execution date is a DAG run parameter. Tasks can use it to have a date reference different from when the task will actually be executed.
Example: say you're interested in storing currency rates once per day. You want to get rates since 2010. You'll have a task in your DAG to call an API which will return the currency rate for a day. You can create a DAG with a start date of 2010-01-01 with a schedule of once per day. Even if you create it now, in 2018, it will run for every day since the start date and thanks to the execution date you'll have the correct data.