Are airflow monthly dags delayed by a day or a month? - airflow

I am trying to set an airflow dag that runs on second day of each month. Based on my research, the schedule interval should be set to:
schedule_interval = '0 0 2 * *'
Now what worries my is the stuff discussed in airflow documentation. Based on what's discussed there:
Note that if you run a DAG on a schedule_interval of one day, the run
stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
other words, the job instance is started once the period it covers has
ended.
Let’s Repeat That The scheduler runs your job one schedule_interval
AFTER the start date, at the END of the period.
Does this mean that for monthly dags, everything will be delayed by one month? So for example the 2020-11-02 job will run on 2020-12-01 2359? If yes, how can I make sure that it runs exactly when it's supposed to?

Your interpretion is right, the DAG Run with execution_date=2020-11-02 will be triggered approximately on 2020-12-01 23:59. However, I think you don't need to be worried that the DAG is delayed or something, it still has a monthly schedule and will run every month. You rather need to take this logic into account when running a operator.
You can also simply work with other variables if you don't want to adapt the logic, for whatever reason:
{{ next_execution_date }} - the next execution date.

Related

Run a airflow dag on 2nd and 5th working day every month

I want to run a DAG on 2nd and 5th working day of every month.
eg1: Suppose, 1st Day of the month falls on Friday. In that case the 2nd working day of the month falls on 4th of that Month(i.e. on Monday) and 5th working Day falls on 7th of that month.
eg2: Suppose 1st Day of Month falls on Wednesday. In that case, the 2nd working Day will fall on 2nd Day of that Month, but the 5th Working Day will fall on 7th of that month (i.e. on Tuesday)
eg3: suppose 1st Day of Month falls on Sunday. In that case, the 2nd working day will fall on 3rd of that month and the 5th working day will fall on 6th of that month (i.e. on Friday)
So, how to schedule the DAG in Airflow for such scenarios.
#aiflow #DAG #schedule
I am looking for scheduling logic or code
Could you provide a code of your DAG?
It depends what operators are you using/willing to use.
Also, you might keep in mind bank holidays. Be sure that it is okay to run your airflow job on the 2nd day even if it is bank holiday?
You can schedule your dag daily and use python operator to validate if current date fits your restirctions. I would push this value to XCOM and then read and process this value in your DAG definition.
Another option would be to use bash operator in the beggining of flow and fail it in case current date violates your logic. This will not execute the rest of depented tasks.
Airflow uses cron definition of scheduling times so you must define your logic inside the code as cron can only run tasks on defined schedule and cannot do any calculations.
You can use custom timetables: https://airflow.apache.org/docs/apache-airflow/stable/howto/timetable.html
For a similar use case we implemented a branching operator with a logic to run the selected tasks when it is a specific workday of the month (I think we were using the workday package to identify specific holidays) and then this dag run daily. But the DAG had to complete some other tasks in all cases.

Airflow - Incorrect Last Run

I just ran an airflow DAG. When I see the airflow last run date, it displays the last but last run date. It catches my attention when I hover over the "i" icon it shows the correct date. Is there any way to solve this? Sounds like nonsense but I end up using it for QA of my data.
This is probably because your airflow job has catchup=True enabled and a start_date in the past, so it is back-filling.
The Start Date is the real-time date of the last run, whereas the Last Run is the execution date of the airflow job. For example, if I am back-filling a time partitioned table with data from 2016-01-01 to present, the Start Date will be the current date but the Last Run date will be 2016-01-01.
Please include your DAG file/code in the future.
Edit: If you don't have catchUp=True enabled, and the discrepancy is approximately one day (like in the picture you sent), then that is just due to the behaviour of the scheduler. From the docs, "The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period."
if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be triggered soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.

ETL present data without the schedule interval delay while not breaking the Catchup

I have a DAG that need to be trigger every Tuesday and Friday (for context, purpose of the DAG is basically ETL data published only twice a week on Tuesday and Friday)
This DAG need to Catchup the past.
I use the {{ execution_date }} in many operator parameter (for API call parameter, in storage name for keeping copy of raw data, ...)
The Catchup works well, my issue is with the present.
Because of the schedule interval, every Friday it will ETL the data of previous Tuesday (use execution_date for API call parameter) and every Tuesday it will ETL the data of previous Friday.
What I need is that the Tuesday run get data of this Tuesday and not the previous Friday.
I think about using start_date instead of execution_date for API call but in this case the Catchup will not work as expected.
I don't find any pretty solution where Catchup work well and present data are processed without the schedule interval delay...
Any idea ?
EDIT Based on andscoop answer:
Best solution is to use next_execution_date instead of execution_date
Catchup will not prevent the most current DAG from running. It only determines whether or not previous un-run DAGs will run to "catchup".
There is no delay per-se, what you are seeing is that the reported execution date is only showing the last completed schedule interval.
You will want to look into Airflow macros to template the exact timestamp you need.

what do execution_date and backfill means in airflow

I'm new to airflow and I'm trying understand what is execution_date means in airflow context. I've read the tutorial page from airflow's documentation which states that
The date specified in this context is an execution_date, which simulates the scheduler running your task or dag at a specific date + time:
I tried to run a task from the tutorial using following command.
airflow test tutorial print_date 2015-06-01
I expected it to print execution_date but the task is printing the actual date on my local system like this.
[2018-05-26 20:36:13,880] {bash_operator.py:101} INFO - Sat May 26 20:36:13 IST 2018
I thought the scheduler will be simulated at a given time. So I'm confused here understanding about execution_date param. Can anyone help me understand this? Thanks.
It's printing the current time in your log because it was actually executed at this time.
The execution date is a DAG run parameter. Tasks can use it to have a date reference different from when the task will actually be executed.
Example: say you're interested in storing currency rates once per day. You want to get rates since 2010. You'll have a task in your DAG to call an API which will return the currency rate for a day. You can create a DAG with a start date of 2010-01-01 with a schedule of once per day. Even if you create it now, in 2018, it will run for every day since the start date and thanks to the execution date you'll have the correct data.

i want execute my previous days unexecute job first rather than today in Control-M

i am trying ordering my jobs into planing until jobs gets success,but when previous days jobs for that particular CTM table run then only today order jobs need to be run , i have added keep active for 10 days for each job.
thanks in advance
You can try adding an IN-CONDITION of the same job with PREV date, so that todays job will run only when yesterday's job has completed its run successfully.
If you want to run todays job even when yesterdays job fails, you can add a DO-STEP on failure, to create a condition which can be used by todays job to start running.

Resources