I have a DAG 'abc' scheduled to run every day at 7 AM CST. For some reason, I do not want to run tomorrow's instance. How can I skip that particular instance. Is there any way to do that using command line ? Appreciate any help on this.
I believe you can preemptively create a DAG Run for the future date at in the UI under Browse->DAG Run -> Create, initializing it in the success (or failed) state, which should prevent the scheduler from creating a new run when the time comes. I think you can do this on the CLI with trigger_dag as well, but you'll just need to separately update its state cause it'll default to running.
I think you can set the start_date for the day after tomorrow or whatever date you want your dag run as long as it is in the future. but the schedule interval will stay the same every 7AM. You can start date in Default_Args
Related
I'm just learning Apache Airflow. I understand that the execution date is not the same time as the actual time a dag run is triggered.
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Yeah, For a daily job, cron jobs run at the start of the day; Airflow jobs run at the end of the day.
I humbly ask: Anyway to set the execution date same as the trigger time?
You generally structure your tasks such that you'll provide a date to the job via kwargs (for idempotency, etc).
Airflow provides macros (https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html) that expose both the data_interval_start and the data_interval_end.
I believe you're looking for the data_interval_end which aligns with the logical date that the job is running.
I have a DAG that has run tasks for over a decade of execution dates. Now I needed to add another year to the beginning. I googled a little bit and the recommendation was to do this under a new dag_id. Because the old DAG has run already for that named execution date range, I want to mark those in the new DAG as a success. How can I archive this in a convenient way?
Thanks in advance. Have a nice start to this week.
Airflow's backfill feature is designed to do exactly what you're trying to do.
That said, not everyone likes using the feature. For example, if your dag is ordinarily an hourly job, backfilling several years of data in hourly batches might be really inefficient.
So for various reasons, creating a temporary "backfill" dag is not a bad way to go.
And to be clear, with "backfill" dag I refer to a dag you are using for purpose of backfill, while not using airflow backfill feature.
For your "backfill" dag, use the DAG parameters start_date and end_date to control the range of execution_date a dag will create dag runs for.
Then after your "backfill" dag is done with all its runs, you can delete it. Airflow won't know the old task instances are now backfilled, but you may not care about that. If you do, you can update the dag_id manually in the metastore database. And otherwise, your "old" dag has correct metadata for more recent periods.
I am in a situation where I have started getting some data scheduled daily at a certain time and I have to create ETL for that data.
Meanwhile, when I am still creating the DAGs for scheduling the tasks in Airflow. The data keeps on arriving daily. So when I will start running my DAGs from today I want to schedule it daily and also wants to backfill all the data from past days which I missed while I was creating DAGs.
I know that if I put start_date as the date from which the data started arriving airflow will start backfilling from that date, but wouldn't in that case, my DAGs will always be behind of current day? How can I achieve backfilling and scheduling at the same time? Do I need to create separate DAGs/tasks for backfill and scheduling?
There are several things you need to consider.
1. Is your daily data independent or the next run is dependent on the previous run?
If the data is dependent on previous state you can run backfill in Airflow.
How backfilling works in Airflow ?
Airflow gives you the facility to run past DAGs. The process of running past DAGs is called Backfill. The process of Backfill actually let Airflow forset some status of all DAGs since it’s inception.
I know that if I put start_date as the date from which the data
started arriving airflow will start backfilling from that date, but
wouldn't in that case, my DAGs will always be behind of current day?
Yes setting a past start_date is the correct way of backfilling in airflow.
No, If you use celery executer, the jobs will be running in parallel and it will eventually catch up to the current day , obviously depending upon your execution duration.
How can I achieve backfilling and scheduling at the same time? Do I
need to create separate DAGs/tasks for backfill and scheduling?
You do not need to do anything extra to achieve scheduling and backfilling at the same time, Airflow will take care of both depending on your start_date
Finally , If this activity is going to be one time task I recommend , you process your data(manually) offline to airflow , this will give you more control over the execution.
and then either mark the backfilled tasks as succeed or below
Run an airflow backfill command like this: airflow backfill -m -s "2016-12-10 12:00" -e "2016-12-10 14:00" users_etl.
This command will create task instances for all schedule from 12:00 PM to 02:00 PM and mark it as success without executing the task at all. Ensure that you set your depends_on_past config to False, it will make this process a lot faster. When you’re done with it, set it back to True.
Or
Even simpler set the start_date to current date
When clearing a task of a DAG for January and Februrary 2019, I noticed that all tasks of this DAG that did not exist at the time were triggered.
I'm wondering why this happens. I suppose the scheduler is kind of "forced" to look at the DAG runs of January and February, and because the tasks that did not exist at the time never ran for these execution dates, they get triggered. But I'd like to put concrete words on this vague understanding of the situation.
Can I avoid this? This creates unexpected behavior and has me doubting before launching a big replay of a month that is long past :)
We have also encountered this problem and I think it makes sense. As per Airflow documentation stated.
Once you clear a DAG, it will be cleared as if it never runs.
so in my understanding, it will check all dag and task instance all over again, run all the task until it reached the schedule time.
Can I avoid this? I'm no airflow expert but I think as of now, we can't. What we normally do is to duplicate the DAG we want to rerun and set start_date and end_date, so it will not intervene with the current DAG that is running normally.
edit: I figured out my problem. I didn't understand the different between triggering a run and it running immediately and keeping it on and letting it do its job. The code is fine.
I wrote this simple program to figure out airflow. On the hour it is supposed to print to a file "hello world", but it's doing it immediately. Does someone see where I am going wrong?
def print_hello():
f = open('helloword.txt','a')
f.write( 'Hello World!')
f.close()
dag = DAG('hello_world', description='Simple tutorial DAG', schedule_interval='#hourly',
start_date=datetime(2018, 5, 31), catchup=False)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
The start date is 2018-05-31 and the schedule interval is #hourly, so the execution date for the first run would normally be 2018-05-31T00:00:00 with a start date >= ~2018-05-31T01:00:00.
In this case, you have set catchup to false, so instead only the most recent DAG run will be created. I would expect that DAG run created to be 2018-05-31T21:00:00 right now.
The current UTC time is 2018-05-31T22:00:00 right now. Since the start date timestamp 2018-05-31T00:00:00 is in the past, the Airflow scheduler will schedule and start the task immediately.
You can delete the DAG runs and task instances and then change the start date to 2018-06-01 if you want it to start fresh tomorrow. It would not run immediately in this case if you choose a start date in the future.
You can find a bit more info about how the scheduler works here:
Airflow Wiki > Scheduler Basics
Airflow Docs > Scheduling & Triggers > To Keep in Mind
Your code looks fine to me. Are you seeing some lines appended to the file if you put your DAG off?
I think what you're seeing is the backfill executions running. You put your start date today, implicitly at midnight. Airflow will therefore catch up and fire up these DAG runs first before eventually running your task every hour.