For example, I created a new DAG at 11:30. The DAG is supposed to run on the first minute of every hour (Cron: 0 * * * *). I expected the DAG to start at 12:00. However the DAG first started at 13:00. The Second started at 14:00 while Next run in UI is 13:00
I know that, it's mentioned in docs
**Let’s Repeat That**, the scheduler runs your job one schedule AFTER the start date, at the END of the interval.
But I don't know What's the use of doing that? Why don't run at the start of the day like cron job?
Does anyone have any documentation explaining this issue?
thanks
Related
I have an Airflow DAG that runs for every 10 min. My cron tab expression looks like this */10 1-8,10-23 * * *. This currently pauses the DAG for 1 hours(9th hour UTC). I want to change this to pause the DAG for 45 minutes from 12:30 to 1:15 AM UTC.
Any way I can achieve this using Airflow?. I read about the timetable concept in airflow 2.2, but I am not sure how exactly this can be achieved.
My options are(Not sure which all are possible):
Use a cron tab expression that can handle it
Create a separate DAG to pause jobs for 45 min(Not sure how we can do this for 45 min. I know how to pause using a python operator. Confused at setting up pause time)
Anybody have done something similar or have any points on how to achieve this?
Any pointers is much appreciated!
I am using AWS managed airflow.
The dags set in 1 minute interval time, but we need to set up the dags in 30 seconds intervals. Is there any way to achieve this in Airflow?
If you check the schedule_interval documentation, you have two choices:
preset: The minimum period is hourly
cron expression: The standard cron expression doesn't handle seconds
So achieve this with Airflow schedule interval is not possible. But you can handle it in your code. You can for exemple create 2 DAGs for the same schedule interval, and for one DAG wait the second 30 to run the code.
I don't understand what the use is for specifying a DAG start_date in the past. I've read about catchup and backfill but I still don't get it. In what context would I want to specify a start_date in the past?
for a scheduled run, airflow scheduler waits for the completion of interval time period before running your DAG.
for instance, say you want run your dag on monthly basis and scheduled it as 0 3 11 * * , which means to run your dag at 3 AM on 11th day of the month.
Now, say you have deployed your dag on 10th day of January, 2021 then you would expect it to run on the next day. But In reality, airflow won't trigger your DAG till next month ie. 11th Feb,2021. So the airflow will wait for about one month before actually triggering your DAG that was supposed to run on 11th of Jan, 2021.
In this scenario, when you deploy your DAG you can mention your start_date as 10th Dec, 2020 so that when the actual day (11th Jan,2021) comes, scheduler will mark as completion of interval time period and start your triggering your DAG.
for more reference, you can read up : https://www.astronomer.io/guides/scheduling-tasks
I don't understand why do we need a 'start_date' for the operators(task instances). Shouldn't the one that we pass to the DAG suffice?
Also, if the current time is 7th Feb 2018 8.30 am UTC, and now I set the start_date of the dag to 7th Feb 2018 0.00 am with my cron expression for schedule interval being 30 9 * * * (daily at 9.30 am, i.e expecting to run in next 1 hour). Will my DAG run today at 9.30 am or tomorrow (8th Feb at 9.30 am )?
Regarding start_date on task instance, personally I have never used this, I always just have a single DAG start_date.
However from what I can see this would allow you to specify certain tasks to start at a different time from the main DAG. It appears this is a legacy feature and from reading the FAQ they recommend using time sensors for that type of thing instead and just having one start_date for all tasks passed through the DAG.
Your second question:
The execution date for a run is always the previous period based on your schedule.
From the docs (Airflow Docs)
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
To clarify:
If set on a daily schedule, on the 8th it will execute the 7th.
If set to a weekly schedule to run on a Sunday, the execution date for this Sunday would be last Sunday.
Some complex requirements may need specific timings at the task level. For example, I may want my DAG to run each day for a full week before some aggregation logging task starts running, so to achieve this I could set different start dates at the task level.
A bit more useful info... looking through the airflow DAG class source it appears that setting the start_date at the DAG level simply means it is passed through to the task when no default value for task start_date was passed in to the DAG via the default_args dict, or when no specific start_date is are defined on a per task level. So for any case where you want all tasks in a DAG to kick off at the same time (dependencies aside), setting start_date at the DAG level is sufficient.
Just to add to what is already here. A task that depends on another task(s) must have a start date >= to the start date of its dependencies.
For example:
if task_a depends on task_b
you cannot have
task_a start_date = 1/1/2019
task_b start_date = 1/2/2019
Otherwise, task_a will not be runnable for 1/1/2019 as task_b will not run for that date and you cannot mark it as complete either
Why would you want this?
I would have liked this logic for a task, which was an external task sensor waiting for the completion of another dag. But the other dag had a start date after the current dag. Therefore, I didn't want the dependency in place for days when the other dag didn't exist
it's likely to not set the dag parameter of your tasks as stated by :
https://stackoverflow.com/a/61749549/1743724
I am looking to run a console application triggered from Autosys every X minutes.
The following commands do not seem to provide this capability
start_times: Exact time each day a job will run [cannot be
used with start_mins]
start_mins: Minutes after each hour a job will execute
[cannot be used with start_times]
The solution that I can see at the moment is to set start_mins : 0,5,10,15,20,25,30,35,40,45,50,55
This is ok if the time interval is 5 minutes, but becomes a little cumbersome if the interval is 1 or 2 minutes.
Is there any way to configure Autosys to easily repeat a job every x minutes ?
there is only one way for autosys to start a job every minute -
start_mins: 0,1,2..59