Retrive scheduled time for catchup=true in airflow - airflow

I have a dag that is scheduled to call a script daily passing the current date so i pass {{ ds }} to get the execution date.
On days when my dag doesn't run i have catchup = True.
so the dag needs to pass the scheduled date, not the execution date for the task to get done, so that the activity of the day on which the dag was unable to run is still completed.
How can i do this?

As far as I understand your scenario, the execution_date is exactly what you need.
The name might be a bit misleading, but it is in fact filled with the scheduled timestamp.

Related

How to set the execution date same as the trigger time?

I'm just learning Apache Airflow. I understand that the execution date is not the same time as the actual time a dag run is triggered.
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Yeah, For a daily job, cron jobs run at the start of the day; Airflow jobs run at the end of the day.
I humbly ask: Anyway to set the execution date same as the trigger time?
You generally structure your tasks such that you'll provide a date to the job via kwargs (for idempotency, etc).
Airflow provides macros (https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html) that expose both the data_interval_start and the data_interval_end.
I believe you're looking for the data_interval_end which aligns with the logical date that the job is running.

airflow DAG triggering for time consuming runs

I am completely new to Airflow and am trying to grasp the concepts of scheduling and default args.
I have a scenario where I would like to schedule my DAG hourly to do some data transfer task between a source and a database. What I am trying to understand is, lets say one of my DAG runs has triggered at 00:00 AM. Now if it takes more than an hour for this run to successfully complete all of its tasks (say 1 hour 30 min), does it mean that the next DAG run that was supposed to be triggered at 01:00 AM will NOT get triggered but the DAG run from 02:00 AM will get triggered?
Yes.
In order to avoid, you need catchup=True for the DAG object.
Reference : https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
(Search for Catchup)
Airflow Scheduler used to monitor all DAGs and tasks in Airflow.Default Arguments can be used to create tasks with default parameters in DAG.
The first DAG runs based on start_date and runs based on scheduled_interval sequentially. Scheduler doesn’t trigger tasks until the period has ended.For your requirement you can set dag.catchup to true as to run the DAG for each completed interval and scheduler will execute them sequentially.Catchup is used to start the DAG run since the last data interval which has not started for any data interval.

Can catchup be used to backfill failed dag runs in Airflow?

I have read that Airflows catchup feature applies to task instances that do not yet have a state - i.e. the scheduler will pick up any execution dates where the DAG has not yet ran (starting from the given start_date) - is this correct, and if so, does this mean catchup does not apply to failed DAG runs?
I am looking for a way to backfill any execution dates that failed, rather than not having ran at all.
Take a look at the backfill command options. You could use rerun-failed-tasks:
if set, the backfill will auto-rerun all the failed tasks for the backfill date range instead of throwing exceptions
Default: False
or reset-dagruns:
if set, the backfill will delete existing backfill-related DAG runs and start anew with fresh, running DAG runs
Default: False
Also, keep in mind this:
If reset_dag_run option is used, backfill will first prompt users whether airflow should clear all the previous dag_run and task_instances within the backfill date range. If rerun_failed_tasks is used, backfill will auto re-run the previous failed task instances within the backfill date range.
Before doing anything like the above, my suggestion is that you try it first with some dummy DAG or similiar.

Airflow: Mark execution date range task istances as success

I have a DAG that has run tasks for over a decade of execution dates. Now I needed to add another year to the beginning. I googled a little bit and the recommendation was to do this under a new dag_id. Because the old DAG has run already for that named execution date range, I want to mark those in the new DAG as a success. How can I archive this in a convenient way?
Thanks in advance. Have a nice start to this week.
Airflow's backfill feature is designed to do exactly what you're trying to do.
That said, not everyone likes using the feature. For example, if your dag is ordinarily an hourly job, backfilling several years of data in hourly batches might be really inefficient.
So for various reasons, creating a temporary "backfill" dag is not a bad way to go.
And to be clear, with "backfill" dag I refer to a dag you are using for purpose of backfill, while not using airflow backfill feature.
For your "backfill" dag, use the DAG parameters start_date and end_date to control the range of execution_date a dag will create dag runs for.
Then after your "backfill" dag is done with all its runs, you can delete it. Airflow won't know the old task instances are now backfilled, but you may not care about that. If you do, you can update the dag_id manually in the metastore database. And otherwise, your "old" dag has correct metadata for more recent periods.

how to get airflow DAG execution date using command line?

In order for me to get the dag_state, I run the following LCI command:
airflow dag_state example_bash_operator '12-12T16:04:46.960661+00:00'
The trouble is - I have to explicitly pass the exact date-time (i.e. execution_date) to this command.
When I run airflow list_dags I only get a listing of DAG's but not their execution dates.
Is there a way to obtain the exact date time (i.e. -> '12-12T16:04:46.960661+00:00')
for a given dag, using command line CLI?
There's a conceptual issue here. Dags are objects that have schedules, not execution dates. When the schedule is due, DagRuns are created for that Dag with the appropriate execution_date.
So you can ask for the state of a DagRun using the CLI and providing the execution_date, because execution dates (almost uniquely) map to a specific DagRun. Almost uniquely because in practice you can trigger two DagRuns with the same execution_date, but that's an unusual scenario.
But if you ask for the execution_date of a Dag, what do you really want to know? The execution_date of the last recently created DagRun? The list of execution_dates for the currently running DagRuns?
You can check list_dag_runsdag_id CLI command and see if yon can filter it to your needs.

Resources