How to set the execution date same as the trigger time? - airflow

I'm just learning Apache Airflow. I understand that the execution date is not the same time as the actual time a dag run is triggered.
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Yeah, For a daily job, cron jobs run at the start of the day; Airflow jobs run at the end of the day.
I humbly ask: Anyway to set the execution date same as the trigger time?

You generally structure your tasks such that you'll provide a date to the job via kwargs (for idempotency, etc).
Airflow provides macros (https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html) that expose both the data_interval_start and the data_interval_end.
I believe you're looking for the data_interval_end which aligns with the logical date that the job is running.

Related

airflow schedule issue - diff between schedule time and run time

I set the schedule like '* 1,5,10,18 * * *' in airflow.
But 18 in yesterday wasn't executed. So I checked the logs.
then I found the job scheduled in 10 executed in 18.
I want to know why and how can I fix.
Note that if you run a DAG on a schedule_interval of one day, the run
stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
other words, the job instance is started once the period it covers has
ended.
~Airflow scheduler docs
So as you can see it will be scheduled after the period - 10->18 is closed, so after 18:00. Check task before, it should be ran just after 10:00.
You don't understand how the airflow scheduler works.
Airflow as DAG Run (data_interval_start) always takes the date of the previous execution, therefore the task performed at your place at 18, has DAG Run at 10. The same as the next task, call at 10 will have a DAG Run at 18 the previous day.

airflow DAG triggering for time consuming runs

I am completely new to Airflow and am trying to grasp the concepts of scheduling and default args.
I have a scenario where I would like to schedule my DAG hourly to do some data transfer task between a source and a database. What I am trying to understand is, lets say one of my DAG runs has triggered at 00:00 AM. Now if it takes more than an hour for this run to successfully complete all of its tasks (say 1 hour 30 min), does it mean that the next DAG run that was supposed to be triggered at 01:00 AM will NOT get triggered but the DAG run from 02:00 AM will get triggered?
Yes.
In order to avoid, you need catchup=True for the DAG object.
Reference : https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
(Search for Catchup)
Airflow Scheduler used to monitor all DAGs and tasks in Airflow.Default Arguments can be used to create tasks with default parameters in DAG.
The first DAG runs based on start_date and runs based on scheduled_interval sequentially. Scheduler doesn’t trigger tasks until the period has ended.For your requirement you can set dag.catchup to true as to run the DAG for each completed interval and scheduler will execute them sequentially.Catchup is used to start the DAG run since the last data interval which has not started for any data interval.

Define maintenance windows in airflow

I am trying to define periods of time where the scheduler will not mark tasks as ready.
For example, I have a DAG with a lengthy backfill period. The execution of the backfill will run throughout the day until the DAG is caught up. However I do not want and executions of the DAG to execute between midnight and 2 am.
Is it possible to achieve this through configuration?

resuming a dag runs immediately with the last scheduled execution

After pausing a dag for 2-3 days, when resuming the dag with catchup=False, will run immediately with the last execution.
For example a dag that sends data to an external system is scheduled to run everyday on 19:00.
Stopping the dag for 4 days and enabling on 11:00 would run the dag immediately with yesterdays execution and then again on 19:00 for that day.
In this case the dag runs two times on the day it's resumed.
Is it possible to resume the dag and the first run will happen actually on 19:00?
With default operators, we cannot achieve what you are expecting. Closest to that, what airflow has is LatestOnlyOperator. This is one of the simplest Operators and needs only following configuration
latest_only = LatestOnlyOperator(task_id='latest_only')
This would let the downstream tasks run only if the current time falls between current execution date and next execution date. So, in your case, it would skip execution of three days, but yesterday's run would trigger the jobs.

Is it possible to have airflow backfill and scheduling at the same time?

I am in a situation where I have started getting some data scheduled daily at a certain time and I have to create ETL for that data.
Meanwhile, when I am still creating the DAGs for scheduling the tasks in Airflow. The data keeps on arriving daily. So when I will start running my DAGs from today I want to schedule it daily and also wants to backfill all the data from past days which I missed while I was creating DAGs.
I know that if I put start_date as the date from which the data started arriving airflow will start backfilling from that date, but wouldn't in that case, my DAGs will always be behind of current day? How can I achieve backfilling and scheduling at the same time? Do I need to create separate DAGs/tasks for backfill and scheduling?
There are several things you need to consider.
1. Is your daily data independent or the next run is dependent on the previous run?
If the data is dependent on previous state you can run backfill in Airflow.
How backfilling works in Airflow ?
Airflow gives you the facility to run past DAGs. The process of running past DAGs is called Backfill. The process of Backfill actually let Airflow forset some status of all DAGs since it’s inception.
I know that if I put start_date as the date from which the data
started arriving airflow will start backfilling from that date, but
wouldn't in that case, my DAGs will always be behind of current day?
Yes setting a past start_date is the correct way of backfilling in airflow.
No, If you use celery executer, the jobs will be running in parallel and it will eventually catch up to the current day , obviously depending upon your execution duration.
How can I achieve backfilling and scheduling at the same time? Do I
need to create separate DAGs/tasks for backfill and scheduling?
You do not need to do anything extra to achieve scheduling and backfilling at the same time, Airflow will take care of both depending on your start_date
Finally , If this activity is going to be one time task I recommend , you process your data(manually) offline to airflow , this will give you more control over the execution.
and then either mark the backfilled tasks as succeed or below
Run an airflow backfill command like this: airflow backfill -m -s "2016-12-10 12:00" -e "2016-12-10 14:00" users_etl.
This command will create task instances for all schedule from 12:00 PM to 02:00 PM and mark it as success without executing the task at all. Ensure that you set your depends_on_past config to False, it will make this process a lot faster. When you’re done with it, set it back to True.
Or
Even simpler set the start_date to current date

Resources