Programmatically clear the state of airflow task instances - airflow

I want to clear the tasks in DAG B when DAG A completes execution. Both A and B are scheduled DAGs.
Is there any operator/way to clear the state of tasks and re-run DAG B programmatically?
I'm aware of the CLI option and Web UI option to clear the tasks.

I would recommend staying away from CLI here!
The airflow functionality of dags/tasks are much better exposed when referencing the objects, as compared to going through BashOperator and/or CLI module.
Add a python operation to dag A named "clear_dag_b", that imports dag_b from the dags folder(module) and this:
from dags.dag_b import dag as dag_b
def clear_dag_b(**context):
exec_date = context[some date object, I forget the name]
dag_b.clear(start_date=exec_date, end_date=exec_date)
Important! If you for some reason do not match or overlap the dag_b schedule time with start_date/end_date, the clear() operation will miss the dag executions. This example assumes dag A and B are scheduled identical, and that you only want to clear day X from B, when A executes day X
It might make sense to include a check for whether the dag_b has already run or not, before clearing:
dab_b_run = dag_b.get_dagrun(exec_date) # returns None or a dag_run object

cli.py is an incredibly useful place to peep into SQLAlchemy magic of Airflow.
The clear command is implemented here
#cli_utils.action_logging
def clear(args):
logging.basicConfig(
level=settings.LOGGING_LEVEL,
format=settings.SIMPLE_LOG_FORMAT)
dags = get_dags(args)
if args.task_regex:
for idx, dag in enumerate(dags):
dags[idx] = dag.sub_dag(
task_regex=args.task_regex,
include_downstream=args.downstream,
include_upstream=args.upstream)
DAG.clear_dags(
dags,
start_date=args.start_date,
end_date=args.end_date,
only_failed=args.only_failed,
only_running=args.only_running,
confirm_prompt=not args.no_confirm,
include_subdags=not args.exclude_subdags,
include_parentdag=not args.exclude_parentdag,
)
Looking at the source, you can either
replicate it (assuming you also want to modify the functionality a bit)
or maybe just do from airflow.bin import cli and invoke the required functions directly

Since my objective was to re-run the DAG B whenever DAG A completes execution, i ended up clearing the DAG B using BashOperator:
# Clear the tasks in another dag
last_task = BashOperator(
task_id='last_task',
bash_command= 'airflow clear example_target_dag -c ',
dag=dag)
first_task >> last_task

It is possible but I would be careful about getting into an endless loop of retries if the task never succeeds. You can call a bash command within the on_retry_callback where you can specify which tasks/dag runs you want to clear.
This works in 2.0 as the clear commands have changed
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#clear
In this example, I am clearing from t2 & downstream tasks when t3 eventually fails:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t2 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
#retries=1,
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

Related

Is there a way to configure different 'retries' for tasks in the same DAG

I have a DAG with many sub-tasks in it. In the middle of the DAG, there is a validation task and based on the result/return code from the task, i want to take two different paths. If success, one route(a sequence of tasks) will be followed and in case of failure, we would like to execute a different set of tasks. There are two problems with the current approach, one is that, validation tasks execute many times(as per the retries configured) if the exit code is 1. Second there is no way possible to take different branches of execution
To solve problem number 1, we can use the retry number is available from the task instance, which is available via the macro {{ task_instance }} . Appreciate if someone could point us to a cleaner approach, and the problem number 2 of taking different paths remains unsolved.
You can have retries at the task level.
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
retries=3,
dag=dag,
)
run_this_last = DummyOperator(
task_id='run_this_last',
retries=1,
dag=dag,
)
Regarding your 2nd problem, there is a concept of Branching.
The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id (or list of task_ids). The task_id returned is followed, and all of the other paths are skipped. The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task.
Example DAG:
import random
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import BranchPythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='example_branch_operator',
default_args=args,
schedule_interval="#daily",
)
run_this_first = DummyOperator(
task_id='run_this_first',
dag=dag,
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
dag=dag,
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='one_success',
dag=dag,
)
for option in options:
t = DummyOperator(
task_id=option,
dag=dag,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
dag=dag,
)
branching >> t >> dummy_follow >> join
Regarding your first problem, you set task/Operator specific retry options quite easily. Reference: baseoperator.py#L77.
Problem two, you can branch within a DAG easily with BranchPythonOperator (Example Usage: example_branch_operator.py). You will want to nest your validation task/logic within the BranchPythonOperator (You can define and execute operators within operators).

Is it possible to have a pipeline in Airflow that does not tie to any schedule?

I need to have pipeline that will be executed either manually or programmatically, is possible with Airflow? Looks like right now each workflow MUST be tied to a schedule.
Just set the schedule_interval to None when you create the DAG:
dag = DAG('workflow_name',
template_searchpath='path',
schedule_interval=None,
default_args=default_args)
From the Airflow Manual:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
The manual then goes on to list some cron 'presets' one of which is None.
Yes, this can be achieved by passing None to schedule_interval in default_args.
Check this documation on DAG run.
For example:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': None, # Check this line
}
In Airflow, every DAG is required to have a start date and schedule interval*, for example hourly:
import datetime
dag = DAG(
dag_id='my_dag',
schedule_interval=datetime.timedelta(hours=1),
start_date=datetime(2018, 5, 23),
)
(Without a schedule how would it know when to run?)
Alternatively to a cron schedule, you can set the schedule to #once to only run once.
*One exception: You can omit the schedule for externally triggered DAGs because Airflow will not schedule them itself.
However, that said, if you omit the schedule, then you need to trigger the DAG externally somehow. If you want to be able to call a DAG programmatically, for instance, as a result of a separate condition occurring in another DAG, you can do that with the TriggerDagRunOperator. You might also hear this idea called externally triggered DAGs.
Here's a usage example from the Airflow Example DAGs:
File 1 - example_trigger_controller_dag.py:
"""This example illustrates the use of the TriggerDagRunOperator. There are 2
entities at work in this scenario:
1. The Controller DAG - the DAG that conditionally executes the trigger
2. The Target DAG - DAG being triggered (in example_trigger_target_dag.py)
This example illustrates the following features :
1. A TriggerDagRunOperator that takes:
a. A python callable that decides whether or not to trigger the Target DAG
b. An optional params dict passed to the python callable to help in
evaluating whether or not to trigger the Target DAG
c. The id (name) of the Target DAG
d. The python callable can add contextual info to the DagRun created by
way of adding a Pickleable payload (e.g. dictionary of primitives). This
state is then made available to the TargetDag
2. A Target DAG : c.f. example_trigger_target_dag.py
"""
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
c_p = context['params']['condition_param']
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
# Define the DAG
dag = DAG(dag_id='example_trigger_controller_dag',
default_args={"owner": "airflow",
"start_date": datetime.utcnow()},
schedule_interval='#once')
# Define the single task in this controller example DAG
trigger = TriggerDagRunOperator(task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True,
'message': 'Hello World'},
dag=dag)
File 2 - example_trigger_target_dag.py:
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
# This example illustrates the use of the TriggerDagRunOperator. There are 2
# entities at work in this scenario:
# 1. The Controller DAG - the DAG that conditionally executes the trigger
# (in example_trigger_controller.py)
# 2. The Target DAG - DAG being triggered
#
# This example illustrates the following features :
# 1. A TriggerDagRunOperator that takes:
# a. A python callable that decides whether or not to trigger the Target DAG
# b. An optional params dict passed to the python callable to help in
# evaluating whether or not to trigger the Target DAG
# c. The id (name) of the Target DAG
# d. The python callable can add contextual info to the DagRun created by
# way of adding a Pickleable payload (e.g. dictionary of primitives). This
# state is then made available to the TargetDag
# 2. A Target DAG : c.f. example_trigger_target_dag.py
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_trigger_target_dag',
default_args=args,
schedule_interval=None)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag)

How to add task at run time if task 1 is failed

I want to execute task 2 if task 1 is success if task 1 fails i want to run task 3 and want to assign another flow if required.
Basically i want to run conditional tasks in airflow without ssh operators.
from airflow import DAG
from airflow.operators import PythonOperator,BranchPythonOperator
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from airflow.models import Variable
def t2_error_task(context):
instance = context['task_instance']
if instance.task_id == "performExtract":
print ("Please implement something over this")
task_3 = PythonOperator(
task_id='performJoin1',
python_callable=performJoin1, # maybe main?
dag = dag
)
dag.add_task(task_3)
with DAG(
'manageWorkFlow',
catchup=False,
default_args={
'owner': 'Mannu',
'start_date': datetime(2018, 4, 13),
'schedule_interval':None,
'depends_on_past': False,
},
) as dag:
task_1 = PythonOperator(
task_id='performExtract',
python_callable=performExtract,
on_failure_callback=t2_error_task,
depends_on_past=True
)
task_2 = PythonOperator(
task_id='printSchemas',
depends_on_past=True,
python_callable=printSchemaAll, # maybe main?
)
task_2.set_upstream(task_1)
Adding tasks dynamically based on execution-time statuses is not something Airflow supports. In order to get the desired behaviour, you should add task_3 to your dag but change its trigger_rule to all_failed. In this case, the task will get marked as skipped when task_1 succeeds, but it will get executed when it fails.

Run only the latest Airflow DAG

Let's say I would like to run a pretty simple ETL DAG with Airflow:
it checks the last insert time in DB2, and it loads newer rows from DB1 to DB2 if any.
There are some understandable requirements:
It scheduled hourly, the first few runs will last more than 1 hour
eg. the first run should process a month data, and it lasts for 72 hours,
so the second run should process the last 72 hour, it last 7.2 hours,
the third processes 7.2 hours and it finishes within an hour,
and from then on it runs hourly.
While the DAG is running, don't start the next one, skip it instead.
If the time passed the trigger event, and the DAG didn't start, don't start it subsequently.
There are other DAGs as well, the DAGs should be executed independently.
I've found these parameters and operator a little confusing, what is the distinctions between them?
depends_on_past
catchup
backfill
LatestOnlyOperator
Which one should I use, and which LocalExecutor?
Ps. there's already a very similar thread, but it isn't exhausting.
DAG max_active_runs = 1 combined with catchup = False would solve this.
This one satisfies my requirements. The DAG runs in every minute, and my "main" task lasts for 90 seconds, so it should skip every second run.
I've used a ShortCircuitOperator to check whether the current run is the only one at the moment (query in the dag_run table of airflow db), and catchup=False to disable backfilling.
However I cannot utilize properly the LatestOnlyOperator which should do something similar.
DAG file
import os
import sys
from datetime import datetime
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator, ShortCircuitOperator
import foo
import util
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018, 2, 13), # or any date in the past
'email': ['services#mydomain.com'],
'email_on_failure': True}
dag = DAG(
'test90_dag',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
condition_task = ShortCircuitOperator(
task_id='skip_check',
python_callable=util.is_latest_active_dagrun,
provide_context=True,
dag=dag)
py_task = PythonOperator(
task_id="test90_task",
python_callable=foo.bar,
provide_context=True,
dag=dag)
airflow.utils.helpers.chain(condition_task, py_task)
util.py
import logging
from datetime import datetime
from airflow.hooks.postgres_hook import PostgresHook
def get_num_active_dagruns(dag_id, conn_id='airflow_db'):
# for this you have to set this value in the airflow db
airflow_db = PostgresHook(postgres_conn_id=conn_id)
conn = airflow_db.get_conn()
cursor = conn.cursor()
sql = "select count(*) from public.dag_run where dag_id = '{dag_id}' and state in ('running', 'queued', 'up_for_retry')".format(dag_id=dag_id)
cursor.execute(sql)
num_active_dagruns = cursor.fetchone()[0]
return num_active_dagruns
def is_latest_active_dagrun(**kwargs):
num_active_dagruns = get_num_active_dagruns(dag_id=kwargs['dag'].dag_id)
return (num_active_dagruns == 1)
foo.py
import datetime
import time
def bar(*args, **kwargs):
t = datetime.datetime.now()
execution_date = str(kwargs['execution_date'])
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + '\n')
time.sleep(90)
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + ' +90\n')
return 'bar: ok'
Acknowledgement: this answer is based on this blog post.
DAG max_active_runs = 1 combined with catchup = False and add a DUMMY task right at the beginning( sort of START task) with wait_for_downstream=True.
As of LatestOnlyOperator - it will help to avoid reruning a Task if previous execution is not yet finished.
Or create the "START" task as LatestOnlyOperator and make sure all Taks part of 1st processing layer are connecting to it. But pay attention - as per the Docs "Note that downstream tasks are never skipped if the given DAG_Run is marked as externally triggered."

Airflow dynamic DAG and Task Ids

I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions.
I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for the script (call it tmp.py) which I execute on command line as python (python tmp.py 820 2016-08-24T22:50:00 ):
from __future__ import print_function
import os
import sys
import shutil
from datetime import date, datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
execution = '2016-08-24T22:20:00'
if len(sys.argv) > 2 :
dagid = sys.argv[1]
taskid = 'Activate' + sys.argv[1]
execution = sys.argv[2]
else:
dagid = 'DAGObjectId'
taskid = 'Activate'
default_args = {'owner' : 'airflow', 'depends_on_past': False, 'start_date':date.today(), 'email': ['fake#fake.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1}
dag = DAG(dag_id = dagid,
default_args=default_args,
schedule_interval='#once',
)
globals()[dagid] = dag
task1 = BashOperator(
task_id = taskid,
bash_command='ls -l',
dag=dag)
fakeTask = BashOperator(
task_id = 'fakeTask',
bash_command='sleep 5',
retries = 3,
dag=dag)
task1.set_upstream(fakeTask)
airflowcmd = "airflow run " + dagid + " " + taskid + " " + execution
print("airflowcmd = " + airflowcmd)
os.system(airflowcmd)
After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status.
Here's a sample code:
from airflow.operators import PythonOperator
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import sys
import time
dagid = 'DA' + str(int(time.time()))
taskid = 'TA' + str(int(time.time()))
input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG execution'''
time.sleep(random_base)
def_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(), 'email_on_failure': False,
'retries': 1, 'retry_delay': timedelta(minutes=2)
}
with open(input_file,'r') as f:
for line in f:
args = line.strip().split(',')
if len(args) < 6:
continue
dagid = 'DAA' + args[0]
taskid = 'TAA' + args[0]
yyyy = int(args[1])
mm = int(args[2])
dd = int(args[3])
hh = int(args[4])
mins = int(args[5])
ss = int(args[6])
dag = DAG(
dag_id=dagid, default_args=def_args,
schedule_interval='#once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
)
myBashTask = BashOperator(
task_id=taskid,
bash_command='python /home/directory/airflow/sendemail.py',
dag=dag)
task2id = taskid + '-X'
task_sleep = PythonOperator(
task_id=task2id,
python_callable=my_sleeping_function,
op_kwargs={'random_base': 10},
dag=dag)
task_sleep.set_upstream(myBashTask)
f.close()
From How can I create DAGs dynamically?:
Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
copying my answer from this question. Only for v2.3 and above:
This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher
More documentation and example here:
Official Dynamic Task Mapping documentation
Tutorial from Astronomer
Example:
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
example 2:
from airflow import XComArg
task = MyOperator(task_id="source")
downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))
The graph view and tree view are also updated:
Relevant issues here:
https://github.com/apache/airflow/projects/12

Resources