The scheduling in Airflow can be kind of tricky and is certainly different from cron scheduling. I want to schedule a DAG on two specific days per month which is possible with following cron expression 0 0 12,25 * *
I tried the schedule interval equal to this cron expression.
For example schedule_interval='0 0 12,25 * *', but that doesn't appear to work.
I am wondering if this kind of schedule is supported by Airflow within a single DAG? What approach could I take to schedule a DAG on two specific days each month?
Related
Let's say you have 2 scripts: Daily_summary.py and Weekly_summary.py.
You could create 2 separate DAGs with daily and weekly schedules, but is it possible to solve this with 1 DAG?
I've tried a daily schedule, and simply putting this at the bottom (simplified):
if datetime.today().strftime('%A') == 'Sunday':
SSHOperator(run weekly_summary.py)
But problem is that if it is still running on Sunday at midnight, airflow will terminate this task since the Operator no longer exists on Monday.
If I could somehow get the execution day's day-of-the-week, that would solve it, but with Jinja templating '{{ds}}' it is not actually a text of 'yyyy-mm-dd', so cannot change it to date with datetime package. It only becomes date format somehow AFTER the airflow script gets executed
You shoudl dynamically generate two DAGs. But you can reuse the same code for that. This is the power of airflow - this is Python code, so you can easily use the same code to generate same DAG "structure" but with two diferent dag ids, and two different schedules.
See this nice article from Astronomer with some examples: https://www.astronomer.io/guides/dynamically-generating-dags
I am using Airflow 1.10.10 and wanted to know how to change a Aiflow DAG schedule . I checked online and in most of the comments its suggested that to change schedule of a DAG, create a new DAG with new dag_id, or change dag_id of existing DAG and give new schedule_interval . Attempt to change schedule of a existing DAG will not work in straight forward manner and will throw error or might create scheduling error.
However I tried to test this so that I can create the scenario where my DAG schedule change leads to erroneous cases . This I tried by only change schedule_interval in DAG file. I tried below change of schedule in my DAG and all worked as expected. Schedule was changed properly and no erroneous case was found .
Started with #Daily
Changed to 10 min
Changed to 17 min
Changed to 15 min
Changed to 5 min
Can someone please clarify what kind of problem may arise if we change the schedule_interval in a DAG without changing ID.
I do see this recommendation on the old Airflow Confluence page on Common Pitfalls.
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval
Changing start_date without changing schedule_interval is safe, but changing to
an earlier start_date will not create any new DagRuns for the time
between the new start_date and the old one, so tasks will not
automatically backfill to the new dates. If you manually create
DagRuns, tasks will be scheduled, as long as the DagRun date is after
both the task start_date and the dag start_date.
I don't know the author's intent but I imagine changing the schedule_interval can cause confusion for users. When they revisit these task, they will wonder why the current schedule_interval does not match past task executions because that information is not stored at the task level.
Changing the schedule_interval does not impact past dagruns or tasks. The change will affect when new dagruns are created, which impacts the tasks within those dagruns.
I personally do not modify the dag_id when I update a DAG's scheduler_interval for two reasons.
If I keep the previous DAG, I am unnecessarily inducing more stress on the scheduler for processing a DAG that will not be turned on.
If I do not keep the previous DAG, I essentially lose all the history of the dagrun where it had a different schedule_interval.
Edit: Looks like there is an Github Issue created to move the Common Pitfall page but it is stale.
I have 2 questions about scheduling a task:
I would like to schedule a task to run on the first business day at 12pm every month. Is it possible to do so?
If the first run doesnt produce a file, a second task will be scheduled to run at the end of that day.
DAG A has schedule '0 6 * * *'.
DAG B has schedule '*/5 * * * *'.
However, DAG B should only start running for that day once DAG A has completed for that day.
I've played around with SubDags and ExternalTaskSensor but haven't yet found a satisfactory solution and I'm sure I'm missing something good. Recommendations?
Edit: say DAG A is my ETL. DAG B has some tasks that query my database and require that data to be up-to-date. DAG B gets run throughout the day, but only once the ETL is completed.
I can see using ShortCircuitOperator, for example, and having the condition be "DAG A that ran today is completed." But how could I write this condition?
This question is not an exact duplicate but is similar to another which already has 3 good answers: Scheduling dag runs in Airflow.
I recommend reading through all of them, but to summarize the info in the answers over there, there are several viable options for the use case of a DAG dependent upon another DAG:
TriggerDagRunOperator
BranchPythonOperator
ShortCircuitOperator
SubDagOperator / SubDAGs
With any of these options you may want to experiment with the trigger rule
External triggers (possibly less relevant for your use case)
If you can add more detail around the use case you're trying to accomplish, I could give more specific guidance as well.
Use TriggerDagRunOperator for calling a DAG to run after another. Refer to this question. Afraid I cannot provide a satisfactory example as I have not used it yet.
I've read Airflow's FAQ about "What's the deal with start_date?", but it still isn't clear to me why it is recommended against using dynamic start_date.
To my understanding, a DAG's execution_date is determined by the minimum start_date between all of the DAG's tasks, and subsequent DAG Runs are ran at the latest execution_date + schedule_interval.
If I set my DAG's default_args start_date to be for, say, yesterday at 20:00:00, with a schedule_interval of 1 day, how would that break or confuse the scheduler, if at all? If I understand correctly, the scheduler would trigger the DAG with an execution_date of yesterday at 20:00:00, and the next DAG Run would be scheduled for today at 20:00:00.
Is there some concept that I'm missing?
First run would be at start_date+schedule_interval. It doesn't run dag on start_date, it always runs on start_date+schedule_interval.
As they mentioned in document if you give start_date dynamic for e.g. datetime.now() and give some schedule_interval(1 hour), it will never execute that run as now() moves along with time and datetime.now()+ 1 hour is not possible
The scheduler expects to see a constant start date and interval. If you change it the scheduler might not notice until it reloads the DagBag, and if the new start date doesn't line up with your old schedule it might break depends_on_past behavior.
If you don't need depends_on_past the simplest might be to stop using the scheduler, set the start date to some arbitrary old date, and externally trigger the DAG however you like using a crontab or similar.