I have a DAG that has run tasks for over a decade of execution dates. Now I needed to add another year to the beginning. I googled a little bit and the recommendation was to do this under a new dag_id. Because the old DAG has run already for that named execution date range, I want to mark those in the new DAG as a success. How can I archive this in a convenient way?
Thanks in advance. Have a nice start to this week.
Airflow's backfill feature is designed to do exactly what you're trying to do.
That said, not everyone likes using the feature. For example, if your dag is ordinarily an hourly job, backfilling several years of data in hourly batches might be really inefficient.
So for various reasons, creating a temporary "backfill" dag is not a bad way to go.
And to be clear, with "backfill" dag I refer to a dag you are using for purpose of backfill, while not using airflow backfill feature.
For your "backfill" dag, use the DAG parameters start_date and end_date to control the range of execution_date a dag will create dag runs for.
Then after your "backfill" dag is done with all its runs, you can delete it. Airflow won't know the old task instances are now backfilled, but you may not care about that. If you do, you can update the dag_id manually in the metastore database. And otherwise, your "old" dag has correct metadata for more recent periods.
Related
I have a newly created daily dag and I have set it up yesterday (Jan. 25th), once it is loaded by airflow I can see it is run once (scheduled_2021-0124T00:00:00+00:00), and then I manually triggered it once just to see if it works and it did (manual_2021-01-25).
Now time is 08:24 UTC Jan 26th. But I did not see any run for 01-25. I have used airflow dags next-execution and found out airflow is planning to execute the dag for 01-26 directly, possibly on 01-27 00:00 UTC. So it will skip 01-25 entirely.
I am wondering why this behaviour? Is there any reason behind this?
This is THE most difficult concept to grasp in Airflow. After you get this the rest of the system is fairly straightforward. But this one design spec is brutal, I have seen it being seasoned engineers to their knees, sobbing in fits of rage.
As the other poster mentioned in the Airflow docs, Airflow runs your job at the end of the period. This is easiest for me to visualize for a DAG that has a daily schedule. The DAG run date for 01/01/2021, with a start time of 00:01 AM, will not execute until 01/02/2021 00:01 AM.
The confusing part of this is WHY!? When you stop to think about why Airflow was written it begins to make sense. This execution pattern ensures that the data for the run date 01/01/2021 is complete and ready when your orchestration pipeline runs to act on this data. Think about it as a business process. If you are a business analyst and come into work on 01/02/2021 you will be looking at data from the day before, not data from today. The data from today has not yet been collected.
The same pattern is true for weekly or monthly intervals as well. The data for that week or month is not going to be ready to act on until the end of the period.
This also makes more sense when you start using the macros and jinja templating.
Hopefully this is now clear as Mud.
This is actually a bug in Airflow 2.0.0 release which was fixed in 2.0.1: https://github.com/apache/airflow/issues/13434
This is a feature of Airflow that confused me too, in the beginning. From the Airflow docs:
If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59.
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
I have a DAG 'abc' scheduled to run every day at 7 AM CST. For some reason, I do not want to run tomorrow's instance. How can I skip that particular instance. Is there any way to do that using command line ? Appreciate any help on this.
I believe you can preemptively create a DAG Run for the future date at in the UI under Browse->DAG Run -> Create, initializing it in the success (or failed) state, which should prevent the scheduler from creating a new run when the time comes. I think you can do this on the CLI with trigger_dag as well, but you'll just need to separately update its state cause it'll default to running.
I think you can set the start_date for the day after tomorrow or whatever date you want your dag run as long as it is in the future. but the schedule interval will stay the same every 7AM. You can start date in Default_Args
I have an airflow dag specified as shown in the picture above.
The git_pull_datagenerator_batch_2 is supposed to be delayed by the TimeDeltaSensor wait_an_hour.
However, the task git_pull_datagenerator seems to be delayed as well although it does not have a dependency on wait_an_hour. (The whole dag is scheduled at 2019-12-10T20:00:00, but git_pull_datagenerator started one hour later than that)
I have checked all documents of airflow but could not find any clues.
I'm assuming your schedule interval is hourly? A DAG run with an execution date of 2019-12-10T20:00:00 on an #hourly schedule interval is expected to run at or shortly after 2019-12-10T21:00:00 when hour 20 has "completed". I don't think it has anything to do with the sensor.
This is a common Airflow pitfall:
Airflow was developed as a solution for ETL needs. In the ETL world,
you typically summarize data. So, if I want to summarize data for
2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be
right after all data for 2016-02-19 becomes available.
If this is what is happening, wait_an_hour started at 2019-12-10T21:00:00 and git_pull_datagenerator_batch_2 at 2019-12-10T22:00:00.
It turns out that the default executor is a SequentialExecutor, which causes all of the tasks to run in a linear order.
I am in a situation where I have started getting some data scheduled daily at a certain time and I have to create ETL for that data.
Meanwhile, when I am still creating the DAGs for scheduling the tasks in Airflow. The data keeps on arriving daily. So when I will start running my DAGs from today I want to schedule it daily and also wants to backfill all the data from past days which I missed while I was creating DAGs.
I know that if I put start_date as the date from which the data started arriving airflow will start backfilling from that date, but wouldn't in that case, my DAGs will always be behind of current day? How can I achieve backfilling and scheduling at the same time? Do I need to create separate DAGs/tasks for backfill and scheduling?
There are several things you need to consider.
1. Is your daily data independent or the next run is dependent on the previous run?
If the data is dependent on previous state you can run backfill in Airflow.
How backfilling works in Airflow ?
Airflow gives you the facility to run past DAGs. The process of running past DAGs is called Backfill. The process of Backfill actually let Airflow forset some status of all DAGs since it’s inception.
I know that if I put start_date as the date from which the data
started arriving airflow will start backfilling from that date, but
wouldn't in that case, my DAGs will always be behind of current day?
Yes setting a past start_date is the correct way of backfilling in airflow.
No, If you use celery executer, the jobs will be running in parallel and it will eventually catch up to the current day , obviously depending upon your execution duration.
How can I achieve backfilling and scheduling at the same time? Do I
need to create separate DAGs/tasks for backfill and scheduling?
You do not need to do anything extra to achieve scheduling and backfilling at the same time, Airflow will take care of both depending on your start_date
Finally , If this activity is going to be one time task I recommend , you process your data(manually) offline to airflow , this will give you more control over the execution.
and then either mark the backfilled tasks as succeed or below
Run an airflow backfill command like this: airflow backfill -m -s "2016-12-10 12:00" -e "2016-12-10 14:00" users_etl.
This command will create task instances for all schedule from 12:00 PM to 02:00 PM and mark it as success without executing the task at all. Ensure that you set your depends_on_past config to False, it will make this process a lot faster. When you’re done with it, set it back to True.
Or
Even simpler set the start_date to current date
In order for me to get the dag_state, I run the following LCI command:
airflow dag_state example_bash_operator '12-12T16:04:46.960661+00:00'
The trouble is - I have to explicitly pass the exact date-time (i.e. execution_date) to this command.
When I run airflow list_dags I only get a listing of DAG's but not their execution dates.
Is there a way to obtain the exact date time (i.e. -> '12-12T16:04:46.960661+00:00')
for a given dag, using command line CLI?
There's a conceptual issue here. Dags are objects that have schedules, not execution dates. When the schedule is due, DagRuns are created for that Dag with the appropriate execution_date.
So you can ask for the state of a DagRun using the CLI and providing the execution_date, because execution dates (almost uniquely) map to a specific DagRun. Almost uniquely because in practice you can trigger two DagRuns with the same execution_date, but that's an unusual scenario.
But if you ask for the execution_date of a Dag, what do you really want to know? The execution_date of the last recently created DagRun? The list of execution_dates for the currently running DagRuns?
You can check list_dag_runsdag_id CLI command and see if yon can filter it to your needs.