How to work correctly airflow schedule_interval - airflow

I want to try to use Airflow instead of Cron.
But schedule_interval doesn't work as I expected.
I wrote the python code like below.
And in my understanding, Airflow should have ran on "2016/03/30 8:15:00" but it didn't work at that time.
If I changed it like this "'schedule_interval': timedelta(minutes = 5)", it worked correctly, I think.
The "notice_slack.sh" is just to call slack api to my channels.
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29, 8, 15),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = '/tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
I want to run some of my scripts at specific time every day like this cron setting.
15 08 * * * bash /tmp/notice_slack.sh
I have read the document Scheduling & Triggers, and I know it's a little bit different cron.
So I attempt to arrange at "start_date" and "schedule_interval" settings.
Does anyone know what should I do ?
airflow version
INFO - Using executor LocalExecutor
v1.7.0
amazon-linux-ami/2015.09-release-notes

Try this:
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="15 08 * * *",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = 'bash /tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
start_date (datetime) – The start_date for the task, determines the execution_date for the first task instance. The best practice is to have the start_date rounded to your DAG’s schedule_interval.
schedule_interval (datetime.timedelta or dateutil.relativedelta.relativedelta or str that acts as a cron expression) – Defines how often that DAG runs, this timedelta object gets added to your latest task instance’s execution_date to figure out the next schedule.
Simply configuring the schedule_interval and bash_command as the same in your cron setting is okay.

Airflow will start your DAG when the 2016/03/30 8:15:00 + schedule interval (daily) is passed. So your DAG will run on 2016/03/31 8:15:00.
You can check the Airflow FAQ

First, your start date should be in the past -
Instead of 'start_date': datetime(2016, 3, 29, 8, 15)
Would you try 'start_date': datetime(2016, 2, 29, 8, 15)
and apply 'catchup':False to prevent backfills - unless this was something you wanted to do.
From Airflow documentation -
The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed.
The schedule interval can be supplied as a cron -
If you want to run it everyday at 8:15 AM, the expression would be - *'15 8 * * '
If you want to run it only on Oct 31st at 8:15 AM, the expression would be - *'15 8 31 10 '
To supply this, 'schedule_inteval':'15 8 * * *' in your Dag property
You can figure this out more from https://crontab.guru/
Alternatively, there are Airflow presets -
If any of these meet your requirements, it would be simply, 'schedule_interval':'#hourly'
Lastly, you can also apply the schedule as python timedelta object e.g. for 12 PM
'schedule_interval': timedelta(hours=12)

With the example you've given #daily will run your job after it passes midnight. You might try changing it either to timedelta(days=1) which is relative to your fixed start_date that includes 08:15.
Or you could use a cron spec for the schedule_interval='15 08 * * *' in which case any start date prior to 8:15 on the day BEFORE the day you wanted the first run would work.
Note that depends_on_past: False is already the default, and you may have confused its behavior with catchup=false in the DAG parameters, which would avoid making past runs for time between the start date and now where the DAG schedule interval would have run.

Related

When do I use a task vs a DAG?

I'm struggling to understand the difference between a task and a DAG and when to use one over the other. I know a task is more granular and called within a DAG, but so much of Airflow documentation mentions creating DAGs on the go or calling other DAGs instead of tasks. Is there any significant difference between using either of these two options?
A DAG is a collection of tasks with schedule information. Each task can perform different work based on our requirement. Let us consider below DAG code as an example. In the below code we are printing current time and then sending an e-mail notification after that.
#importing operators and modules
from airflow import DAG
from airflow.operators.python_operator import PythonOperator ##to call a python object
from airflow.operators.email_operator import EmailOperator ##to send email
from datetime import datetime,timedelta,timezone ##to play with date and time
import dateutil
#setting default arguments
default_args = {
'owner': 'test dag',
'depends_on_past': False,
'start_date': datetime(2021, 1, 1),
'email': ['myemailid#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0
}
def print_time(**context):
now_utc = datetime.now(timezone.utc)
print("current time",now_utc)
with DAG('example_dag', schedule_interval='* 12 * * *', max_active_runs=1, catchup=False,default_args=default_args) as dag: ##dag name is 'example_dag'
current_time = PythonOperator(task_id='current_time', python_callable=print_time,
provide_context=True,
dag=dag) ##task to call print_time definition
send_email = EmailOperator(task_id='send_email', to='myemailid#example.com',
subject='DAG completed successfully',
html_content="<p>Hi,<br><br>example DAG completed successfully<br>", dag=dag) ## task to send email
current_time >> send_email ##defining tasks dependency
Here current_time and send_email are 2 different tasks performing different work. Now we have a dependency here like we have to send an email once the current time is printed so we have established that task dependency at the end. Also we have given a scheduled_interval to run the DAG everyday at 12 PM. This together forms a DAG.

Airflow schedule not updating

I created a DAG that will run on a weekly basis. Below is what I tried and it's working as expected.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
default_args = {
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=2),
'wait_for_downstream': True,
'provide_context': True,
'start_date': datetime(2020, 12, 20, hour=00, minute=00, second=00)
}
with DAG("DAG", default_args=default_args, schedule_interval=SCHEDULE_INTERVAL, catchup=True) as dag:
t1 = BashOperator(
task_id='dag_schedule',
bash_command='echo DAG',
dag=dag)
As per the schedule, it ran on the 27(i.e. 20 in the script). As there is a change in requirement, Now I updated the start date to 30th(i.e 23 in the script) instead of 27(My idea is to start the schedule from 30 and from there onwards every week). When I change the schedule of the DAG i.e. start date from 27 to 30th. DAG is not picking as per the latest start date, not sure why? When I deleted the DAG(as it is test DAG I deleted it, in prod I can't delete it) and created the new DAG with the same name with the latest start date i.e. 30th, it's running as per the schedule.
As per the Airflow DOC's
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval
Changing start_date without changing schedule_interval is safe, but changing to an earlier start_date will not create any new DagRuns for the time between the new start_date and the old one, so tasks will not automatically backfill to the new dates. If you manually create DagRuns, tasks will be scheduled, as long as the DagRun date is after both the task start_date and the dag start_date.
So if we change start date we need to change the DAG name or delete the existing DAG so that it will be recreated with the same name again(metadata related to previous DAG will be deleted from metadata)
Source
Your DAG as you defined it will be triggered on 6-Jan-2021
Airflow schedule tasks at the END of the interval (See doc reference)
So per your settings:
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
and
'start_date': datetime(2020, 12 , 30, hour=00, minute=00, second=00)
This means the first run will be on 6-Jan-2021 because 30-Dec-2020 + 1 week = 6-Jan-2021 Note that the execution_date of this run will be 2020-12-30

Airflow Dag not running on scheduled interval

I am testing a simple dag to run on scheduled interval which is at 6 UTC on every Friday and Saturday('0 6 * * 5,6').
But the dag did not triggered on 6 am on Friday.
I know that Friday's instance will run on Saturday and Saturday's on Friday.
What can I do so that it will run Friday's instance on Friday only? or any work around?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def create_txt():
f=open("/home/abc/test1.txt","w+")
for i in range(10):
f.write("This is line %d\r\n" % (i+1))
f.close()
default_args = {
'owner': 'abc',
'depends_on_past': False,
'start_date': datetime(2020, 6, 24),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup': False
}
with DAG('python_test',
default_args=default_args,schedule_interval='0 6 * * 5,6'
) as dag:
create_txt = PythonOperator(task_id='python_test',
python_callable=create_txt)
Check the scheduler health,
Try to increase the scheduler timeout
Make sure there are no other dags importing the same modules in and around the same time.
Suggesting you to import os,sys modules.
If the dag is going to be executed for the first time, then consider giving a manual execution for the first time.

Getting *** Task instance did not exist in the DB as error when running gcs_to_bq in composer

While executing the following python script using cloud-composer, I get *** Task instance did not exist in the DB under the gcs2bq task Log in Airflow
Code:
import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator
print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
'start_date': yesterday,
# To email on failure or retry set 'email' arg to your email and enable
# emailing here.
'email_on_failure': False,
'email_on_retry': False,
# If a task fails, retry it once after waiting at least 5 minutes
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': 'data-rubrics'
#models.Variable.get('gcp_project')
}
try:
# [START composer_quickstart_schedule]
with models.DAG(
'composer_agg_quickstart',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
# [END composer_quickstart_schedule]
op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
#op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
#op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
op_start >> op_load
UPDATE:
Can you remove dag=dag from gcs2bq task as you are already using with models.DAG and run your dag again?
It might be because you have a dynamic start date. Your start_date should never be dynamic. Read this FAQ: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Make your start_date static or use Airflow utils/macros:
import airflow
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
Okay, this was a stupid question on my part and apologies for everyone who wasted time here. I had a Dag running due to which the one I was shooting off was always in the que. Also, I did not write the correct value in destination_project_dataset_table. Thanks and apologies to all who spent time.

Run only the latest Airflow DAG

Let's say I would like to run a pretty simple ETL DAG with Airflow:
it checks the last insert time in DB2, and it loads newer rows from DB1 to DB2 if any.
There are some understandable requirements:
It scheduled hourly, the first few runs will last more than 1 hour
eg. the first run should process a month data, and it lasts for 72 hours,
so the second run should process the last 72 hour, it last 7.2 hours,
the third processes 7.2 hours and it finishes within an hour,
and from then on it runs hourly.
While the DAG is running, don't start the next one, skip it instead.
If the time passed the trigger event, and the DAG didn't start, don't start it subsequently.
There are other DAGs as well, the DAGs should be executed independently.
I've found these parameters and operator a little confusing, what is the distinctions between them?
depends_on_past
catchup
backfill
LatestOnlyOperator
Which one should I use, and which LocalExecutor?
Ps. there's already a very similar thread, but it isn't exhausting.
DAG max_active_runs = 1 combined with catchup = False would solve this.
This one satisfies my requirements. The DAG runs in every minute, and my "main" task lasts for 90 seconds, so it should skip every second run.
I've used a ShortCircuitOperator to check whether the current run is the only one at the moment (query in the dag_run table of airflow db), and catchup=False to disable backfilling.
However I cannot utilize properly the LatestOnlyOperator which should do something similar.
DAG file
import os
import sys
from datetime import datetime
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator, ShortCircuitOperator
import foo
import util
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018, 2, 13), # or any date in the past
'email': ['services#mydomain.com'],
'email_on_failure': True}
dag = DAG(
'test90_dag',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
condition_task = ShortCircuitOperator(
task_id='skip_check',
python_callable=util.is_latest_active_dagrun,
provide_context=True,
dag=dag)
py_task = PythonOperator(
task_id="test90_task",
python_callable=foo.bar,
provide_context=True,
dag=dag)
airflow.utils.helpers.chain(condition_task, py_task)
util.py
import logging
from datetime import datetime
from airflow.hooks.postgres_hook import PostgresHook
def get_num_active_dagruns(dag_id, conn_id='airflow_db'):
# for this you have to set this value in the airflow db
airflow_db = PostgresHook(postgres_conn_id=conn_id)
conn = airflow_db.get_conn()
cursor = conn.cursor()
sql = "select count(*) from public.dag_run where dag_id = '{dag_id}' and state in ('running', 'queued', 'up_for_retry')".format(dag_id=dag_id)
cursor.execute(sql)
num_active_dagruns = cursor.fetchone()[0]
return num_active_dagruns
def is_latest_active_dagrun(**kwargs):
num_active_dagruns = get_num_active_dagruns(dag_id=kwargs['dag'].dag_id)
return (num_active_dagruns == 1)
foo.py
import datetime
import time
def bar(*args, **kwargs):
t = datetime.datetime.now()
execution_date = str(kwargs['execution_date'])
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + '\n')
time.sleep(90)
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + ' +90\n')
return 'bar: ok'
Acknowledgement: this answer is based on this blog post.
DAG max_active_runs = 1 combined with catchup = False and add a DUMMY task right at the beginning( sort of START task) with wait_for_downstream=True.
As of LatestOnlyOperator - it will help to avoid reruning a Task if previous execution is not yet finished.
Or create the "START" task as LatestOnlyOperator and make sure all Taks part of 1st processing layer are connecting to it. But pay attention - as per the Docs "Note that downstream tasks are never skipped if the given DAG_Run is marked as externally triggered."

Resources