I have three necessary base nodes (PythonOperators) that I need to have sequentially called. Let us consider a scenario where my system receives a shape in the form of a jpeg file. I have three functions that I run over the file:
The first function will change the color of the shape, then, on success:
The shape gets passed on to the second function, which will rotate the shape 90deg, then, on success:
The shape gets passed onto the third function, which will transform the shape in some predetermined way, then, on success, the program is done.
So, I have three things that need to happen in order: ChangeColor-> Rotate -> Transform.
I wish to add a BranchingOperator to that will, on success, which is determined by a "SUCCESS" return value from each of these three nodes, move on to the next node on the list or terminate when the program ends. On failures, which is determined by a "FAILURE" return value from each node, the program progresses to a failure node (another PythonOperator, I imagine) that will do some things (save the error in some DB, extract some details and reasons, whatever - doesn't really matter).
But I am having some trouble figuring out how to do this in airflow. The docs haven't been.. too helpful when it comes to airflow's branching operator and I havent been able to find that many examples online. Can anyone help?
Thanks!
Edit:
Okay, to make things a bit more concrete, here is what I currently have and think I need:
def branch_func_from_standardization(**context):
if context['mode'] == 'FAILURE':
return 'failure_node'
return 'rotate_shapes'
def branch_func_from_rotate(**context):
if context['mode'] == 'FAILURE':
return 'failure_node'
return 'transform_shapes'
def failure_node(**context):
print("FAILURE")
with dag:
color_task = PythonOperator(
task_id="color_shapes",
dag=dag,
python_callable=color_shape,
provide_context=True
)
rotate_task = PythonOperator(
task_id="rotate_shapes",
dag=dag,
python_callable=rotate_shape,
provide_context=True
)
transform_task = PythonOperator(
task_id="transform_shapes",
dag=dag,
python_callable=transform_shape,
provide_context=True
)
failure_task = PythonOperator(
task_id="failure_node",
dag=dag,
provide_context=True,
python_callable=failure_node
)
branching_task_color = BranchPythonOperator(
task_id='branch_task_color',
provide_context=True,
python_callable=branch_func_from_color
)
branching_task_rotate = BranchPythonOperator(
task_id='branch_task_rotate',
provide_context=True,
python_callable=branch_func_from_rotate
)
Don't worry about the callable functions... I don't 100% know if I'm using context correctly in the callables I did provide, but nonetheless, I just want to focus on the graph. I want a graph that, in airflow, looks like this:
|---> END_PROGRAM
|---> transform_shapes ---> branch_task_transform -------
|---> rotate_shapes ---> branch_task_rotate--- |
color_shapes ---> branch_task_color --- | |
|----------------------------------------------------------------------------------------------------------> failure_node
Okay, I don't have a branch_task_transform in the code, but you get the idea. How the hell do I do this in airflow? To reiterate, on any failures, I need to go to the failure_node; on successes, I just continue forward. Failure is determined by a return value from the python callables.
Related
I try create graph with chain of dynamic tasks.
First of all, I start with expand function. But problem is program should wait, when all the Add tasks finished and only then start Mul tasks. I need the next Mul to run immediately after each Add. Then I got the code that the graph could make
with DAG(dag_id="simple_maping", schedule='* * * * *', start_date=datetime(2022, 12, 22)) as dag:
#task
def read_conf():
conf = Variable.get('tables', deserialize_json=True)
return conf
#task
def add_one(x: str):
sleep(5)
return x + '1'
#task
def mul_two(x: str):
return x * 2
for i in read_conf():
mul_two(add_one(i))
but now there is an error - 'xcomarg' object is not iterable. I can fix it just remove task decorator from read_conf method, but I am not sure it's the best decision, because in my case list configuration names could contain >1000 elements. Without decorator, method have to read configuration every time when scheduler parsed graph.
Maybe the load without the decorator will not be critical? Or is there a way to make an object iterable? How to do it right?
EDIT: This solution has a bug in 2.5.0 which was solved for 2.5.1 (not released yet).
Yes, when you are chaining dynamically mapped tasks the latter (mul_2) will wait until all mapped instances of the first task (add_one) are done by default because the default trigger rule is all_success. While you can change the trigger rule for example to one_done this will not solve your issue because the second task will only once, when it first starts running, decide how many mapped task instances it creates (with one_done it only creates one mapped task instance, so not helpful for your use-case).
The issue with the for-loop (and why Airflow wont allow you to iterate over an XComArg) is that for-loops are parsed when the DAG code is parsed, which happens outside of runtime, when Airflow does not know yet how many results read_conf() will return. If the number of the configurations only rarely change then having a for-loop like that iterating over list in a separate file is an option, but yes at scale this can cause performance issues.
The best solution in my opinion is to use dynamic task group mapping which was added in Airflow 2.5.0:
All mapped task groups will run in parallel and for every input from read_conf(). So for every add_one its mul_two will run immediately. I put the code for this below.
One note: You will not be able to see the mapped task groups in the Airflow UI or be able to access their logs just yet, the feature is still quite new and the UI extension should come in 2.5.1. That is why I added a task downstream of the mapped task groups that prints out the list of results of the mul_two tasks, so you can check if it is working.
from airflow import DAG
from airflow.decorators import task, task_group
from datetime import datetime
from time import sleep
with DAG(
dag_id="simple_mapping",
schedule=None,
start_date=datetime(2022, 12, 22),
catchup=False
) as dag:
#task
def read_conf():
return [10, 20, 30]
#task_group
def calculations(x):
#task
def add_one(x: int):
sleep(x)
return x + 1
#task()
def mul_two(x: int):
return x * 2
mul_two(add_one(x))
#task
def pull_xcom(**context):
pulled_xcom = context["ti"].xcom_pull(
task_ids=['calculations.mul_two'],
key="return_value"
)
print(pulled_xcom)
calculations.expand(x=read_conf()) >> pull_xcom()
Hope this helps! :)
PS: you might want to set catchup=False unless you want to backfill a few weeks of tasks.
By transactional-like I mean - etither all tasks in a set are successful or none of them are and should be retried from the first one.
Consider two operators, A and B, A downloads a file, B reads it and performs some actions.
A successfully executes, but before B comes into play a blackout/corruption occurs, and B cannot process the file, so it fails.
To pass, it needs A to re-download the file, but since A is in success status, there is no direct way to do that automatically. A human must go and clear A`s status
Now, if wonder, if there is a known way to clear the statuses of task instances up to a certain task, if some task fails?
I know I can use a hook, as in clear an upstream task in airflow within the dag, but that looks a bit ugly
This feature is not supported by airflow, but you can use on_failure_callback to achieve that:
def clear_task_A(context):
execution_date = context["execution_date"]
clear_task_A = BashOperator(
task_id='clear_task_A',
bash_command=f'airflow tasks clear -s {execution_date} -t A -d -y <dag_id>'
) # -s: start date, -t: task_id regex -d: include downstream?, -y: yes?
return clear_task_A.execute(context=context)
with DAG(...) as dag:
A = DummyOperator(
task_id='A'
)
B = BashOperator(
task_id='B',
bash_command='exit 1',
on_failure_callback=clear_task_A
)
A >> B
I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.
My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it.
So, ideally, my DAG should look something like this:
The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate.
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1)
}
dag = DAG('dynamic_dag_generator', schedule_interval=None, default_args=default_args)
foo_operator = BashOperator(
task_id='foo',
bash_command="echo '%s'" % json.dumps(range(0, random.randint(40,60))),
xcom_push=True,
dag=dag)
def gen_nodes(**kwargs):
ti = kwargs['ti']
workers = json.loads(ti.xcom_pull(task_ids='foo'))
for wid in workers:
print("Iterating worker %s" % wid)
op = PythonOperator(
task_id='test_op_%s' % wid,
python_callable=lambda: print("Dynamic task!"),
dag=dag
)
op.set_downstream(bar_operator)
op.set_upstream(dummy_op)
gen_subdag_node_op = PythonOperator(
task_id='gen_subdag_nodes',
python_callable=gen_nodes,
provide_context=True,
dag=dag
)
gen_subdag_node_op.set_upstream(foo_operator)
dummy_op = DummyOperator(
task_id='dummy',
dag=dag
)
dummy_op.set_upstream(gen_subdag_node_op)
bar_operator = DummyOperator(
task_id='bar',
dag=dag)
bar_operator.set_upstream(dummy_op)
In the logs, I can see that gen_nodes is executed correctly (i.e. Iterating worker 5, etc). However, the new tasks are not scheduled and there is no evidence that they were executed.
I found related code samples online, such as this, but could not make it work. Am I missing something?
Alternatively, is there a more appropriate approach to this problem (isolating units of work)?
At this point in time, airflow does not support adding/removing a task while the dag is running.
The workflow order will be whatever is evaluated at the start of the dag run.
See the second paragraph here.
This means you cannot add/remove tasks based on something that happens in the run. You can add X tasks in a for loop based on something not related to the run, but after the run has begun there is no changing the workflow shape/order.
Many times you can instead use a BranchPythonOperator to make a decision during a dag run, (and these decisions can be based on your xcom values) but they must be a decision to go down a branch that already exists in the workflow.
Dag runs, and Dag definitions are separated in airflow in ways that aren't entirely intuitive, but more or less anything that is created/generated inside a dag run (xcom, dag_run.conf, etc.) is not usable for defining the dag itself.
I need the status of the task like if it is running or upforretry or failed within the same dag. So i tried to get it using the below code, though i got no output...
Auto = PythonOperator(
task_id='test_sleep',
python_callable=execute_on_emr,
op_kwargs={'cmd':'python /home/hadoop/test/testsleep.py'},
dag=dag)
logger.info(Auto)
The intention is to kill certain running tasks once a particular task on airflow completes.
Question is how do i get the state of a task like is it in the running state or failed or success
I am doing something similar. I need to check for one task if the previous 10 runs of another task were successful.
taky2 sent me on the right path. It is actually fairly easy:
from airflow.models import TaskInstance
ti = TaskInstance(*your_task*, execution_date)
state = ti.current_state()
As I want to check that within the dag, it is not neccessary to specify the dag.
I simply created a function to loop through the past n_days and check the status.
def check_status(**kwargs):
last_n_days = 10
for n in range(0,last_n_days):
date = kwargs['execution_date']- timedelta(n)
ti = TaskInstance(*my_task*, date) #my_task is the task you defined within the DAG rather than the task_id (as in the example below: check_success_task rather than 'check_success_days_before')
state = ti.current_state()
if state != 'success':
raise ValueError('Not all previous tasks successfully completed.')
When you call the function make sure to set provide_context.
check_success_task = PythonOperator(
task_id='check_success_days_before',
python_callable= check_status,
provide_context=True,
dag=dag
)
UPDATE:
When you want to call a task from another dag, you need to call it like this:
from airflow import configuration as conf
from airflow.models import DagBag, TaskInstance
dag_folder = conf.get('core','DAGS_FOLDER')
dagbag = DagBag(dag_folder)
check_dag = dagbag.dags[*my_dag_id*]
my_task = check_dag.get_task(*my_task_id*)
ti = TaskInstance(my_task, date)
Apparently there is also an api-call by now doing the same thing:
from airflow.api.common.experimental.get_task_instance import get_task_instance
ti = get_task_instance(*my_dag_id*, *my_task_id*, date)
Take a look at the code responsible for the command line interface operation suggested by Priyank.
https://github.com/apache/incubator-airflow/blob/2318cea74d4f71fba353eaca9bb3c4fd3cdb06c0/airflow/bin/cli.py#L581
def task_state(args):
dag = get_dag(args)
task = dag.get_task(task_id=args.task_id)
ti = TaskInstance(task, args.execution_date)
print(ti.current_state())
Hence, it seem you should easily be able to accomplish this within your DAG codebase using similar code.
Alternatively you could execute these CLI operations from within your code using python's subprocess library.
Okay, I think I know what you're doing and I don't really agree with it, but I'll start with an answer.
A straightforward, but hackish, way would be to query the task_instance table. I'm in postgres, but the structure should be the same. Start by grabbing the task_ids and state of the task you're interested in with a db call.
SELECT task_id, state
FROM task_instance
WHERE dag_id = '<dag_id_attrib>'
AND execution_date = '<execution_date_attrib>'
AND task_id = '<task_to_check>'
That should give you the state (and name, for reference) of the task you're trying to monitor. State is stored as a simple lowercase string.
You can use the command line Interface for this:
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
For more on this you can refer official airflow documentation:
http://airflow.incubator.apache.org/cli.html