How to avoid Airflow to backfill when using trigger_dag? - airflow

I want to create a DAG that will only run upon an external trigger (i.e., using the 'airflow trigger_dag ' command). However, when I do this, I see multiple 'scheduled_xxx' DagRuns in addition to the 'manual_xxx' that I want. I am assuming the scheduled_xxx DagRuns are created to backfill?
Is there a way to only have the 'manual_xxx' DagRun created and no 'scheduled_xxx' DagRuns?
I tried different values for start_date (past, datetime.now() and future but got the same result. Here's my toy DAG ...
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'start_date': datetime.now(), (also tried past and future dates)
'schedule_interval': None,
'depends_on_past': False,
}
dag = DAG('my_test_dag', default_args=default_args)
date_task = BashOperator(
task_id='date',
bash_command='date',
dag=dag)
This is how I am issuing the trigger_dag command ...
airflow trigger_dag my_test_dag

It appears the "schedule_interval" is not a task level parameter. If I move it to the DAG object then I get the expected result. That is, I only get the manually triggered DagRun when I add the 'schedule_interval' argument to the DAG constructor ..
default_args = {
'start_date': datetime.now(), (also tried past and future dates)
'depends_on_past': False,
}
dag = DAG('my_test_dag', default_args=default_args, **schedule_interval=None**)
This results in only a single DagRun getting created for each external/manual trigger (trigger_dag).

Related

When do I use a task vs a DAG?

I'm struggling to understand the difference between a task and a DAG and when to use one over the other. I know a task is more granular and called within a DAG, but so much of Airflow documentation mentions creating DAGs on the go or calling other DAGs instead of tasks. Is there any significant difference between using either of these two options?
A DAG is a collection of tasks with schedule information. Each task can perform different work based on our requirement. Let us consider below DAG code as an example. In the below code we are printing current time and then sending an e-mail notification after that.
#importing operators and modules
from airflow import DAG
from airflow.operators.python_operator import PythonOperator ##to call a python object
from airflow.operators.email_operator import EmailOperator ##to send email
from datetime import datetime,timedelta,timezone ##to play with date and time
import dateutil
#setting default arguments
default_args = {
'owner': 'test dag',
'depends_on_past': False,
'start_date': datetime(2021, 1, 1),
'email': ['myemailid#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0
}
def print_time(**context):
now_utc = datetime.now(timezone.utc)
print("current time",now_utc)
with DAG('example_dag', schedule_interval='* 12 * * *', max_active_runs=1, catchup=False,default_args=default_args) as dag: ##dag name is 'example_dag'
current_time = PythonOperator(task_id='current_time', python_callable=print_time,
provide_context=True,
dag=dag) ##task to call print_time definition
send_email = EmailOperator(task_id='send_email', to='myemailid#example.com',
subject='DAG completed successfully',
html_content="<p>Hi,<br><br>example DAG completed successfully<br>", dag=dag) ## task to send email
current_time >> send_email ##defining tasks dependency
Here current_time and send_email are 2 different tasks performing different work. Now we have a dependency here like we have to send an email once the current time is printed so we have established that task dependency at the end. Also we have given a scheduled_interval to run the DAG everyday at 12 PM. This together forms a DAG.

Airflow first DAG not show in UI or dags list

I am new to airflow and not sure how to create my first DAG. I used Pycharm to create the first DAG_try.py like below:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'myname',
'email': 'myemail',
'start_date': datetime(2021, 4, 29)
}
dag = DAG(dag_id='DAG-try', default_args=default_args, schedule_interval='#once')
Fidelity = BashOperator(task_id='Fidelity_data',
bash_command='python Fidelity.py',
dag=dag)
Gemini = BashOperator(task_id='Gemini_data',
bash_command='python Gemini.py',
dag=dag)
Get_results = BashOperator(task_id='get_results',
bash_command='python get_results.py',
dag=dag)
[Fidelity, Gemini] >> Get_results
And put it under the direction C:\Users\name\AppData\Local\Packages\CanonicalGroupLimited.UbuntuonWindows_79rhkp1fndgsc\LocalState\rootfs\home\username\airflow\dags\
But my DAG is not shown in Airflow UI or in the dags list.
Under the ...\airflow\airflow.cfg file, dags_folder = ~/airflow/dags
I am not sure the direction is setting correctly or not. Could anyone help me with this?
Thanks a lot in advance!

airflow trigger_rule using ONE_FAILED cause dag failure

what i wanted to achieve is to create a task where will send notification if any-one of the task under the dag is failed. I am applying trigger rule to the task where:
batch11 = BashOperator(
task_id='Error_Buzz',
trigger_rule=TriggerRule.ONE_FAILED,
bash_command='python /home/admin/pythonwork/home/codes/notifications/dagLevel_Notification.py') ,
dag=dag,
catchup = False
)
batch>>batch11
batch1>>batch11
The problem for now is when there no other task failed, the batch11 task will not execute due to trigger_rule, which is what i wanted, but it will result the dag failure since the default trigger_rule for dag is ALL_SUCCESS. Is there a way to end the loop hole to make the dag runs successfully ?
screenshot of outcome :
We do something similar in our Airflow Deployment. The idea is to notify slack when a task in a dag fails. You can set a dag level configuration on_failure_callback as documented https://airflow.apache.org/code.html#airflow.models.BaseOperator
on_failure_callback (callable) – a function to be called when a task
instance of this task fails. a context dictionary is passed as a
single parameter to this function. Context contains references to
related objects to the task instance and is documented under the
macros section of the API.
Here is an example of how I use it. if any of the task fails or succeeds airflow calls notify function and I can get notification wherever I want.
import sys
import os
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.utils.dates import days_ago
from util.airflow_utils import AirflowUtils
schedule = timedelta(minutes=5)
args = {
'owner': 'user',
'start_date': days_ago(1),
'depends_on_past': False,
'on_failure_callback': AirflowUtils.notify_job_failure,
'on_success_callback': AirflowUtils.notify_job_success
}
dag = DAG(
dag_id='demo_dag',
schedule_interval=schedule, default_args=args)
def task1():
return 'Whatever you return gets printed in the logs!'
def task2():
return 'cont'
task1 = PythonOperator(task_id='task1',
python_callable=task1,
dag=dag)
task2 = PythonOperator(task_id='task2',
python_callable=task1,
dag=dag)
task1 >> task2

Is it possible to have a pipeline in Airflow that does not tie to any schedule?

I need to have pipeline that will be executed either manually or programmatically, is possible with Airflow? Looks like right now each workflow MUST be tied to a schedule.
Just set the schedule_interval to None when you create the DAG:
dag = DAG('workflow_name',
template_searchpath='path',
schedule_interval=None,
default_args=default_args)
From the Airflow Manual:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
The manual then goes on to list some cron 'presets' one of which is None.
Yes, this can be achieved by passing None to schedule_interval in default_args.
Check this documation on DAG run.
For example:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': None, # Check this line
}
In Airflow, every DAG is required to have a start date and schedule interval*, for example hourly:
import datetime
dag = DAG(
dag_id='my_dag',
schedule_interval=datetime.timedelta(hours=1),
start_date=datetime(2018, 5, 23),
)
(Without a schedule how would it know when to run?)
Alternatively to a cron schedule, you can set the schedule to #once to only run once.
*One exception: You can omit the schedule for externally triggered DAGs because Airflow will not schedule them itself.
However, that said, if you omit the schedule, then you need to trigger the DAG externally somehow. If you want to be able to call a DAG programmatically, for instance, as a result of a separate condition occurring in another DAG, you can do that with the TriggerDagRunOperator. You might also hear this idea called externally triggered DAGs.
Here's a usage example from the Airflow Example DAGs:
File 1 - example_trigger_controller_dag.py:
"""This example illustrates the use of the TriggerDagRunOperator. There are 2
entities at work in this scenario:
1. The Controller DAG - the DAG that conditionally executes the trigger
2. The Target DAG - DAG being triggered (in example_trigger_target_dag.py)
This example illustrates the following features :
1. A TriggerDagRunOperator that takes:
a. A python callable that decides whether or not to trigger the Target DAG
b. An optional params dict passed to the python callable to help in
evaluating whether or not to trigger the Target DAG
c. The id (name) of the Target DAG
d. The python callable can add contextual info to the DagRun created by
way of adding a Pickleable payload (e.g. dictionary of primitives). This
state is then made available to the TargetDag
2. A Target DAG : c.f. example_trigger_target_dag.py
"""
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
c_p = context['params']['condition_param']
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
# Define the DAG
dag = DAG(dag_id='example_trigger_controller_dag',
default_args={"owner": "airflow",
"start_date": datetime.utcnow()},
schedule_interval='#once')
# Define the single task in this controller example DAG
trigger = TriggerDagRunOperator(task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True,
'message': 'Hello World'},
dag=dag)
File 2 - example_trigger_target_dag.py:
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
# This example illustrates the use of the TriggerDagRunOperator. There are 2
# entities at work in this scenario:
# 1. The Controller DAG - the DAG that conditionally executes the trigger
# (in example_trigger_controller.py)
# 2. The Target DAG - DAG being triggered
#
# This example illustrates the following features :
# 1. A TriggerDagRunOperator that takes:
# a. A python callable that decides whether or not to trigger the Target DAG
# b. An optional params dict passed to the python callable to help in
# evaluating whether or not to trigger the Target DAG
# c. The id (name) of the Target DAG
# d. The python callable can add contextual info to the DagRun created by
# way of adding a Pickleable payload (e.g. dictionary of primitives). This
# state is then made available to the TargetDag
# 2. A Target DAG : c.f. example_trigger_target_dag.py
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_trigger_target_dag',
default_args=args,
schedule_interval=None)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag)

Airflow DAG not getting scheduled

I am new to Airflow and created my first DAG. Here is my DAG code. I want the DAG to start now and thereafter run once in a day.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['aaaa#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'alamode', default_args=default_args, schedule_interval=timedelta(1))
create_command = "/home/ubuntu/scripts/makedir.sh "
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command=create_command,
dag=dag)
run_spiders = "/home/ubuntu/scripts/crawl_spiders.sh "
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id='web_scrawl',
bash_command=run_spiders,
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)
The DAG is not getting picked by Airflow. I checked the log and here is what it says.
[2017-09-12 18:08:20,220] {jobs.py:343} DagFileProcessor398 INFO - Started process (PID=7001) to work on /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,223] {jobs.py:1521} DagFileProcessor398 INFO - Processing file /home/ubuntu/airflow/dags/alamode.py for tasks to queue
[2017-09-12 18:08:20,223] {models.py:167} DagFileProcessor398 INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,262] {jobs.py:1535} DagFileProcessor398 INFO - DAG(s) ['alamode'] retrieved from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,291] {jobs.py:1169} DagFileProcessor398 INFO - Processing alamode
/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/default_comparator.py:161: SAWarning: The IN-predicate on "dag_run.dag_id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance.
'strategies for improved performance.' % expr)
[2017-09-12 18:08:20,317] {models.py:322} DagFileProcessor398 INFO - Finding 'running' jobs without a recent heartbeat
[2017-09-12 18:08:20,318] {models.py:328} DagFileProcessor398 INFO - Failing jobs without heartbeat after 2017-09-12 18:03:20.318105
[2017-09-12 18:08:20,320] {jobs.py:351} DagFileProcessor398 INFO - Processing /home/ubuntu/airflow/dags/alamode.py took 0.100 seconds
What exactly am I doing wrong? I have tried changing the schedule_interval to schedule_interval=timedelta(minutes=1) to see if it starts immediately, but still no use. I can see the tasks under the DAG as expected in Airflow UI but with schedule status as 'no status'. Please help me here.
This issue has been resolved by following the below steps:
1) I used a much older date for start_date and schedule_interval=timedelta(minutes=10). Also, used a real date instead of datetime.now().
2) Added catchup = True in DAG arguments.
3) Setup environment variable as export AIRFLOW_HOME=pwd/airflow_home.
4) Deleted airflow.db
5) Moved the new code to DAGS folder
6) Ran the command 'airflow initdb' to create the DB again.
7) Turned the 'ON' switch of my DAG through UI
8) Ran the command 'airflow scheduler'
Here is the code which works now:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 9, 12),
'email': ['anjana#gapro.tech'],
'retries': 0,
'retry_delay': timedelta(minutes=15)
}
dag = DAG(
'alamode', catchup=False, default_args=default_args, schedule_interval="#daily")
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command='/home/ubuntu/scripts/makedir.sh ',
dag=dag)
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id= 'web_crawl',
bash_command='/home/ubuntu/scripts/crawl_spiders.sh ',
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)

Resources