My Airflow webserver is up and running,As other job are running as per scheduled.
I added a new DAG to be executed every 5 minute.
Once added i ran it first time manually and it completed. However after that it is not running again
every 5 min.
Dag code is below
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
current_date = datetime.now()
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2019, 6, 11, current_date.hour, current_date.minute),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=1),
}
dag = DAG("Incremental", default_args=default_args, schedule_interval='*/5 * * * *')
Suggestion please
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended. Based on Airflow docs
In your case, if you have start date of 2019-01-01 00:00:00, 5 minutes interval. You might expect airflow to run at 2019-01-01 00:05:00, but it will run after 2019-01-01 00:10:00 because its waiting for the 5 minutes interval to finish (this is how I imagine it). Hope this helps. :)
Related
I'm using Airflow airflow-2.3.3 (through GCP Composer)
I pass this yaml configuration when deploying my DAG:
dag_args:
dag_id: FTP_DAILY
default_args:
owner: 'Dev team'
start_date: "00:00:00"
max_active_runs: 1
retries: 2
schedule_interval: "0 7 * * *"
ftp_conn_id: 'ftp_dev'
I want this DAG to run at 7am UTC every morning, but it's not running. In the UI it says next run: 2022-11-22, 07:00:00 (as of Nov 22nd) and it never runs. How should I configure my start_date and schedule_interval so that the DAG runs at 7am UTC every day, starting from the nearest 7am after the deployment?
You can pass default args directly in the Python DAG code and calculate yesterday's date, example :
from airflow.utils.dates import days_ago
dag_default_args = {
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'start_date': days_ago(1)
}
Then in the DAG :
with airflow.DAG(
"dag_name",
default_args=dag_default_args,
schedule_interval="0 7 * * *") as dag:
......
In this case the schedule_interval and cron will work correctly, Airflow will based the cron DAG on the start date.
The main concept of airflow is that the execution of a dag starts after the required interval has passed. If you schedule a dag with the above setup airflow will parse
interval_start_date as 2022-11-22 07:00:00
and interval_end_date as 2022-11-23 07:00:00
As you are requesting airflow to fetch data from this interval it will wait for the interval to pass, thus starting execution on 23rd November 7am.
If you want it to trigger immediately after you deploy the dag you need to move the start date back by one day. You might need to set up the catchup flag to true.
with DAG(
dag_id='new_workflow4',
schedule_interval="0 7 * * *",
start_date=pendulum.datetime(2022, 11, 21, hour=0, tz="UTC"),
catchup=True
) as dag:
I created a DAG that will run on a weekly basis. Below is what I tried and it's working as expected.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
default_args = {
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=2),
'wait_for_downstream': True,
'provide_context': True,
'start_date': datetime(2020, 12, 20, hour=00, minute=00, second=00)
}
with DAG("DAG", default_args=default_args, schedule_interval=SCHEDULE_INTERVAL, catchup=True) as dag:
t1 = BashOperator(
task_id='dag_schedule',
bash_command='echo DAG',
dag=dag)
As per the schedule, it ran on the 27(i.e. 20 in the script). As there is a change in requirement, Now I updated the start date to 30th(i.e 23 in the script) instead of 27(My idea is to start the schedule from 30 and from there onwards every week). When I change the schedule of the DAG i.e. start date from 27 to 30th. DAG is not picking as per the latest start date, not sure why? When I deleted the DAG(as it is test DAG I deleted it, in prod I can't delete it) and created the new DAG with the same name with the latest start date i.e. 30th, it's running as per the schedule.
As per the Airflow DOC's
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval
Changing start_date without changing schedule_interval is safe, but changing to an earlier start_date will not create any new DagRuns for the time between the new start_date and the old one, so tasks will not automatically backfill to the new dates. If you manually create DagRuns, tasks will be scheduled, as long as the DagRun date is after both the task start_date and the dag start_date.
So if we change start date we need to change the DAG name or delete the existing DAG so that it will be recreated with the same name again(metadata related to previous DAG will be deleted from metadata)
Source
Your DAG as you defined it will be triggered on 6-Jan-2021
Airflow schedule tasks at the END of the interval (See doc reference)
So per your settings:
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
and
'start_date': datetime(2020, 12 , 30, hour=00, minute=00, second=00)
This means the first run will be on 6-Jan-2021 because 30-Dec-2020 + 1 week = 6-Jan-2021 Note that the execution_date of this run will be 2020-12-30
I am testing a simple dag to run on scheduled interval which is at 6 UTC on every Friday and Saturday('0 6 * * 5,6').
But the dag did not triggered on 6 am on Friday.
I know that Friday's instance will run on Saturday and Saturday's on Friday.
What can I do so that it will run Friday's instance on Friday only? or any work around?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def create_txt():
f=open("/home/abc/test1.txt","w+")
for i in range(10):
f.write("This is line %d\r\n" % (i+1))
f.close()
default_args = {
'owner': 'abc',
'depends_on_past': False,
'start_date': datetime(2020, 6, 24),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup': False
}
with DAG('python_test',
default_args=default_args,schedule_interval='0 6 * * 5,6'
) as dag:
create_txt = PythonOperator(task_id='python_test',
python_callable=create_txt)
Check the scheduler health,
Try to increase the scheduler timeout
Make sure there are no other dags importing the same modules in and around the same time.
Suggesting you to import os,sys modules.
If the dag is going to be executed for the first time, then consider giving a manual execution for the first time.
I have a backfill DAG, which is scheduled to run yearly from 01-01-2012 to 01-01-2018, but this runs only from 01-01-2012 until 01-01-2017. Why is this not running until 01-01-2018 and how to make it run until 2018.
Here is the code that I have used in the DAG:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2012,1,1),
'end_date': datetime(2018,1,1),
'email': ['sef12#gmail.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='SAMPLE_LOAD',schedule_interval= '#yearly',default_args=default_args,catchup=True,max_active_runs=1, concurrency=1)
This is due to how Airflow handles scheduling. From the docs:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be triggered soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Your run for 2018 will start once 2018 is over, since that's the end of the interval.
I want to try to use Airflow instead of Cron.
But schedule_interval doesn't work as I expected.
I wrote the python code like below.
And in my understanding, Airflow should have ran on "2016/03/30 8:15:00" but it didn't work at that time.
If I changed it like this "'schedule_interval': timedelta(minutes = 5)", it worked correctly, I think.
The "notice_slack.sh" is just to call slack api to my channels.
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29, 8, 15),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = '/tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
I want to run some of my scripts at specific time every day like this cron setting.
15 08 * * * bash /tmp/notice_slack.sh
I have read the document Scheduling & Triggers, and I know it's a little bit different cron.
So I attempt to arrange at "start_date" and "schedule_interval" settings.
Does anyone know what should I do ?
airflow version
INFO - Using executor LocalExecutor
v1.7.0
amazon-linux-ami/2015.09-release-notes
Try this:
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="15 08 * * *",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = 'bash /tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
start_date (datetime) – The start_date for the task, determines the execution_date for the first task instance. The best practice is to have the start_date rounded to your DAG’s schedule_interval.
schedule_interval (datetime.timedelta or dateutil.relativedelta.relativedelta or str that acts as a cron expression) – Defines how often that DAG runs, this timedelta object gets added to your latest task instance’s execution_date to figure out the next schedule.
Simply configuring the schedule_interval and bash_command as the same in your cron setting is okay.
Airflow will start your DAG when the 2016/03/30 8:15:00 + schedule interval (daily) is passed. So your DAG will run on 2016/03/31 8:15:00.
You can check the Airflow FAQ
First, your start date should be in the past -
Instead of 'start_date': datetime(2016, 3, 29, 8, 15)
Would you try 'start_date': datetime(2016, 2, 29, 8, 15)
and apply 'catchup':False to prevent backfills - unless this was something you wanted to do.
From Airflow documentation -
The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed.
The schedule interval can be supplied as a cron -
If you want to run it everyday at 8:15 AM, the expression would be - *'15 8 * * '
If you want to run it only on Oct 31st at 8:15 AM, the expression would be - *'15 8 31 10 '
To supply this, 'schedule_inteval':'15 8 * * *' in your Dag property
You can figure this out more from https://crontab.guru/
Alternatively, there are Airflow presets -
If any of these meet your requirements, it would be simply, 'schedule_interval':'#hourly'
Lastly, you can also apply the schedule as python timedelta object e.g. for 12 PM
'schedule_interval': timedelta(hours=12)
With the example you've given #daily will run your job after it passes midnight. You might try changing it either to timedelta(days=1) which is relative to your fixed start_date that includes 08:15.
Or you could use a cron spec for the schedule_interval='15 08 * * *' in which case any start date prior to 8:15 on the day BEFORE the day you wanted the first run would work.
Note that depends_on_past: False is already the default, and you may have confused its behavior with catchup=false in the DAG parameters, which would avoid making past runs for time between the start date and now where the DAG schedule interval would have run.