what do execution_date and backfill means in airflow - airflow

I'm new to airflow and I'm trying understand what is execution_date means in airflow context. I've read the tutorial page from airflow's documentation which states that
The date specified in this context is an execution_date, which simulates the scheduler running your task or dag at a specific date + time:
I tried to run a task from the tutorial using following command.
airflow test tutorial print_date 2015-06-01
I expected it to print execution_date but the task is printing the actual date on my local system like this.
[2018-05-26 20:36:13,880] {bash_operator.py:101} INFO - Sat May 26 20:36:13 IST 2018
I thought the scheduler will be simulated at a given time. So I'm confused here understanding about execution_date param. Can anyone help me understand this? Thanks.

It's printing the current time in your log because it was actually executed at this time.
The execution date is a DAG run parameter. Tasks can use it to have a date reference different from when the task will actually be executed.
Example: say you're interested in storing currency rates once per day. You want to get rates since 2010. You'll have a task in your DAG to call an API which will return the currency rate for a day. You can create a DAG with a start date of 2010-01-01 with a schedule of once per day. Even if you create it now, in 2018, it will run for every day since the start date and thanks to the execution date you'll have the correct data.

Related

Are airflow monthly dags delayed by a day or a month?

I am trying to set an airflow dag that runs on second day of each month. Based on my research, the schedule interval should be set to:
schedule_interval = '0 0 2 * *'
Now what worries my is the stuff discussed in airflow documentation. Based on what's discussed there:
Note that if you run a DAG on a schedule_interval of one day, the run
stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
other words, the job instance is started once the period it covers has
ended.
Let’s Repeat That The scheduler runs your job one schedule_interval
AFTER the start date, at the END of the period.
Does this mean that for monthly dags, everything will be delayed by one month? So for example the 2020-11-02 job will run on 2020-12-01 2359? If yes, how can I make sure that it runs exactly when it's supposed to?
Your interpretion is right, the DAG Run with execution_date=2020-11-02 will be triggered approximately on 2020-12-01 23:59. However, I think you don't need to be worried that the DAG is delayed or something, it still has a monthly schedule and will run every month. You rather need to take this logic into account when running a operator.
You can also simply work with other variables if you don't want to adapt the logic, for whatever reason:
{{ next_execution_date }} - the next execution date.

how to implement current plan on calendar date basis in Airflow

Is there any way to implement current plan concept in Airflow..
Example: In Control m jobs will be scheduled as per date example: ODATE : 20200801
But in airflow we are not able to see an option how to schedule the jobs as per every date/calendar..
Is there any best way we can implement..

Airflow - Incorrect Last Run

I just ran an airflow DAG. When I see the airflow last run date, it displays the last but last run date. It catches my attention when I hover over the "i" icon it shows the correct date. Is there any way to solve this? Sounds like nonsense but I end up using it for QA of my data.
This is probably because your airflow job has catchup=True enabled and a start_date in the past, so it is back-filling.
The Start Date is the real-time date of the last run, whereas the Last Run is the execution date of the airflow job. For example, if I am back-filling a time partitioned table with data from 2016-01-01 to present, the Start Date will be the current date but the Last Run date will be 2016-01-01.
Please include your DAG file/code in the future.
Edit: If you don't have catchUp=True enabled, and the discrepancy is approximately one day (like in the picture you sent), then that is just due to the behaviour of the scheduler. From the docs, "The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period."
if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be triggered soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.

ETL present data without the schedule interval delay while not breaking the Catchup

I have a DAG that need to be trigger every Tuesday and Friday (for context, purpose of the DAG is basically ETL data published only twice a week on Tuesday and Friday)
This DAG need to Catchup the past.
I use the {{ execution_date }} in many operator parameter (for API call parameter, in storage name for keeping copy of raw data, ...)
The Catchup works well, my issue is with the present.
Because of the schedule interval, every Friday it will ETL the data of previous Tuesday (use execution_date for API call parameter) and every Tuesday it will ETL the data of previous Friday.
What I need is that the Tuesday run get data of this Tuesday and not the previous Friday.
I think about using start_date instead of execution_date for API call but in this case the Catchup will not work as expected.
I don't find any pretty solution where Catchup work well and present data are processed without the schedule interval delay...
Any idea ?
EDIT Based on andscoop answer:
Best solution is to use next_execution_date instead of execution_date
Catchup will not prevent the most current DAG from running. It only determines whether or not previous un-run DAGs will run to "catchup".
There is no delay per-se, what you are seeing is that the reported execution date is only showing the last completed schedule interval.
You will want to look into Airflow macros to template the exact timestamp you need.

Tivoli to Autosys

I am looking for a help in converting a IBM Tivoli script to Autosys since we have migrated from Tivoli to AUtosys.
Below is the script in Tivoli -
SCHEDULE SX On RUNCYCLE YEARLY VALIDFROM 02/01/2015 "FREQ=YEARLY;INTERVAL=1;"
UNTIL 0550 +5 DAYS
CARRYFORWARD
...
...
I need to convert this same script so that it can work in Autosys. Not sure, what Tivoli script means and what this Tivoli is doing and how I can convert it in run_calendar in AUtosys script
What about UNTIL and carryforward option ?
Thanks.
This is a TWS Schedule (a set of jobs) that is set to run once a year starting 02/01/2015. It will start on that day at new plan generation and will run until 05:50 am after 5 days. CARRYFORWARD means that this schedule will not disappear at new plan generation that comes everyday.
This is not a script per se, this is a container of jobs. The Autosys calendar must look like a yearly job.

Resources