Is there a way to get the "params" to a task group, as they are with a pain task?
#dag(dag_id='start-matrix',
params={"a" : 1, "b" : 2},
schedule_interval=None,
start_date=datetime(2021, 4, 5, 15, 0))
def startMatrix():
#task()
def mytask(params=None):
# This works
a = params["a"]
pass
#task_group()
def mygroup(params=None):
# This throws an error that params is not defined.
a = params["a"]
pass
pass
When I load this dag file, I get an error that params is not defined in mygroup. I am running 2.5.
Is there another way to access the params dict?
Airflow context is only accessible from tasks in runtime, and TaskGroup is not a task, it's just a collection of tasks used to group the tasks in the UI.
But params is accessible from the TaskGroup tasks:
#task_group()
def mygroup(params=None):
#task
def task1():
return params["a"]
task1()
Addressing your comment to Hussein's answer: maybe dynamically mapping over task groups could be the answer?
As of Airflow 2.5.0 you can do the following:
#task_group(
group_id="group1"
)
def tg1(my_num):
#task
def add_one(num):
return num + 1
add_one(my_num)
tg1.expand(my_num=[0, 1, 2, 3, 4])
This would create 5 parallel task groups (one for each element in the list provided) See also these docs. There are some caveats in that you can't see these task groups yet in the Airflow UI, this is soon to be added (#28392, #28208). Also there is some unexpected behavior if an upstream task provides the list over which you map, this too should be fixed soon(28592).
Disclaimer: I work at Astronomer (who wrote the docs at the first link). :)
Related
My dag is started with configuration JSON:
{"foo" : "bar"}
I have a Python operator which uses this value:
my_task = PythonOperator(
task_id="my_task",
op_kwargs={"foo": "{{ dag_run.conf['foo'] }}"},
python_callable=lambda foo: print(foo))
I’d like to replace it with a TaskFlow task…
#task
def my_task:
# how to get foo??
How can I get a reference to context, dag_run, or otherwise get to the configuration JSON from here?
There are several ways to do this using the TaskFlow API:
import datetime
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
#dag(start_date=datetime.datetime(2023, 1, 1), schedule=None)
def so_75303816():
#task
def example_1(**context):
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_2(dag_run=None):
foo = dag_run.conf["foo"]
print(foo)
#task
def example_3():
context = get_current_context()
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_4(params=None):
foo = params["foo"]
print(foo)
example_1()
example_2()
example_3()
example_4()
so_75303816()
Depending on your needs/preference, you can use one of the following examples:
example_1: You get all task instance context variables and have to extract "foo".
example_2: You explicitly state via arguments you want only dag_run from the task instance context variables. Note that you have to default arguments to None.
example_3: You can also fetch the task instance context variables from inside a task using airflow.operators.python.get_current_context().
example_4: DAG run context is also available via a variable named "params".
For more information, see https://airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html#accessing-context-variables-in-decorated-tasks and https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.
I have a use case where I want to run dynamic tasks.
The expectation is
Task1 (output = list of dicts)-> Task2(a) - > Task3(a)
|
----> Task 2(b) -> Task3(b)
Task 2 and Task 3 needs to be run for every object in list and needs to be sequential.
You can connect multiple dynamically mapped tasks. For example:
import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(dag_id="so_74848271", schedule_interval=None, start_date=datetime.datetime(2022, 1, 1)):
#task
def start():
return [{"donald": "duck"}, {"bugs": "bunny"}, {"mickey": "mouse"}]
#task
def create_name(cartoon):
first_name = list(cartoon.keys())[0]
last_name = list(cartoon.values())[0]
return f"{first_name} {last_name}"
#task
def print_name(full_name):
print(f"Hello {full_name}")
print_name.expand(full_name=create_name.expand(cartoon=start()))
The task create_name will generate one task for each dict in the list returned by start. And the print_name task will generate one task for each result of create_name.
The graph view of this DAG looks as follows:
I am new to Apache Airflow and I am trying to figure out how to unit/integration test my dags/tasks
Here is my directory structure
/airflow
/dags
/tests/dags
I created a simple DAG which has a task to reads data from a Postgres table
def read_files(ti):
sql = "select id from files where status='NEW'"
pg_hook = PostgresHook(postgres_conn_id="metadata")
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(sql)
files = cursor.fetchall()
ti.xcom_push(key="files_to_process", value=files)
with DAG(dag_id="check_for_new_files", schedule_interval=timedelta(minutes=30),
start_date=datetime(2022, 9, 1), catchup=False) as dag:
check_files = PythonOperator(task_id="read_files",
python_callable=read_files)
Is it possible to test this by mocking Airflow/Postgres connection etc
yes it is possible to do test in dags, here is an example of basic things you can do:
import unittest
from airflow.models import DagBag
class TestCheckForNewFilesDAG(unittest.TestCase):
"""Check Dag"""
def setUp(self):
self.dagbag = DagBag()
def test_task_count(self):
"""Check task count for a dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
self.assertEqual(len(dag.tasks), 1)
def test_contain_tasks(self):
"""Check task contains in hello_world dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
tasks = dag.tasks
task_ids = list(map(lambda task: task.task_id, tasks))
self.assertListEqual(task_ids, ['read_files'])
def test_dependencies_of_read_files_task(self):
"""Check the task dependencies of a taskin hello_world dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
read_files_task = dag.get_task('read_files')
# to be use in case you have upstream task
upstream_task_ids = list(map(lambda task: task.task_id,
read_files_task.upstream_list))
self.assertListEqual(upstream_task_ids, [])
downstream_task_ids = list(map(lambda task: task.task_id,
read_files_task.downstream_list))
self.assertListEqual(downstream_task_ids, [])
suite = unittest.TestLoader().loadTestsFromTestCase(TestHelloWorldDAG)
unittest.TextTestRunner(verbosity=2).run(suite)
In case of verifying that manipulated data of files are moved correctly the documentations suggest:
https://airflow.apache.org/docs/apache-airflow/2.0.1/best-practices.html#self-checks
Self-Checks
You can also implement checks in a DAG to make sure the tasks are producing the results as expected. As an example, if you have a task that pushes data to S3, you can implement a check in the next task. For example, the check could make sure that the partition is created in S3 and perform some simple checks to determine if the data is correct.
I think this is an excellent and straightforward way to verify a specific task.
Here there are other useful links you can use:
https://www.youtube.com/watch?v=ANJnYbLwLjE
In the next ones, they talk about mock
https://www.astronomer.io/guides/testing-airflow/
https://medium.com/#montadhar/apache-airflow-testing-guide-7956a3f4bbf5
https://godatadriven.com/blog/testing-and-debugging-apache-airflow/
I am trying to write unittests for some of the tasks built with Airflow TaskFlow API. I tried multiple approaches for example, by creating a dagrun or only running the task function but nothing is helping.
Here is a task where I download a file from S3, there is more stuff going on but I removed that for this example.
#task()
def updates_process(files):
context = get_current_context()
try:
updates_file_path = utils.download_file_from_s3_bucket(files.get("updates_file"))
except FileNotFoundError as e:
log.error(e)
return
# Do something else
Now I was trying to write a test case where I can check this except clause. Following is one the example I started with
class TestAccountLinkUpdatesProcess(TestCase):
#mock.patch("dags.delta_load.updates.log")
#mock.patch("dags.delta_load.updates.get_current_context")
#mock.patch("dags.delta_load.updates.utils.download_file_from_s3_bucket")
def test_file_not_found_error(self, download_file_from_s3_bucket, get_current_context, log):
download_file_from_s3_bucket.side_effect = FileNotFoundError
task = account_link_updates_process({"updates_file": "path/to/file.csv"})
get_current_context.assert_called_once()
log.error.assert_called_once()
I also tried by creating a dagrun as shown in the example here in docs and fetching the task from the dagrun but that also didin't help.
I was struggling to do this myself, but I found that the decorated tasks have a .function parameter: https://github.dev/apache/airflow/blob/be7cb1e837b875f44fcf7903329755245dd02dc3/airflow/decorators/base.py#L522
You can then use .funciton to call the actual function. Using your example:
class TestAccountLinkUpdatesProcess(TestCase):
#mock.patch("dags.delta_load.updates.log")
#mock.patch("dags.delta_load.updates.get_current_context")
#mock.patch("dags.delta_load.updates.utils.download_file_from_s3_bucket")
def test_file_not_found_error(self, download_file_from_s3_bucket, get_current_context, log):
download_file_from_s3_bucket.side_effect = FileNotFoundError
task = dags.delta_load.updates.updates_process
# Call the function for testing
task.function({"updates_file": "path/to/file.csv"})
get_current_context.assert_called_once()
log.error.assert_called_once()
This prevents you from having to set up any of the DAG infrastructure and just run the python function as intended!
This is what I could figure out. Not sure if this is the right thing but it works.
class TestAccountLinkUpdatesProcess(TestCase):
TASK_ID = "updates_process"
#classmethod
def setUpClass(cls) -> None:
cls.dag = dag_delta_load()
#mock.patch("dags.delta_load.updates.log")
#mock.patch("dags.delta_load.updates.get_current_context")
#mock.patch("dags.delta_load.updates.utils.download_file_from_s3_bucket")
def test_file_not_found_error(self, download_file_from_s3_bucket, get_current_context, log):
download_file_from_s3_bucket.side_effect = FileNotFoundError
task = self.dag.get_task(task_id=self.TASK_ID)
task.op_args = [{"updates_file": "file.csv"}]
task.execute(context={})
log.error.assert_called_once()
UPDATE: Based on the answer of #AetherUnbound I did some investigation and found that we can use task.__wrapped__() to call the actual python function.
class TestAccountLinkUpdatesProcess(TestCase):
#mock.patch("dags.delta_load.updates.log")
#mock.patch("dags.delta_load.updates.get_current_context")
#mock.patch("dags.delta_load.updates.utils.download_file_from_s3_bucket")
def test_file_not_found_error(self, download_file_from_s3_bucket, get_current_context, log):
download_file_from_s3_bucket.side_effect = FileNotFoundError
update_process.__wrapped__({"updates_file": "file.csv"})
log.error.assert_called_once()
I do not understand how callables (function called as specified by PythonOperator) n Airflow should have their parameter list set. I have seen the with no parameters or with named params or **kwargs. I can always add "ti" or **allargs as parameters it seems, and ti seems to be used for task instance info, or ds for execution date. But my callables do not NEED params apparently. They can be simply be "def function():". If I wrote a regular python function func() instead of func(**kwargs), it would fail at runtime when called unless no params were passed. Airflow always seems to pass t1 all the time, so how can the callable function signature not require it?? Example below from a training site where _process_data func gets the ti, but _extract_bitcoin_price() does not. I was thinking that is because of the xcom push, but ti is ALWAYS available it seems, so how can "def somefunc()" ever work? I tried looking at pythonoperator source code, but I am unclear how this works or best practices for including parameters in a callable. Thanks!!
from airflow import DAG
from airflow.operators.python_operator
import PythonOperator
from datetime import datetime
import json
from typing import Dict
import requests
import logging
API = "https://api.coingecko.com/api/v3/simple/price?ids=bitcoin&vs_currencies=usd&include_market_cap=true&include_24hr_vol=true&include_24hr_change=true&include_last_updated_at=true"
def \_extract_bitcoin_price():
return requests.get(API).json()\['bitcoin'\]
def \_process_data(ti):
response = ti.xcom_pull(task_ids='extract_bitcoin_price')
logging.info(response)
processed_data = {'usd': response\['usd'\], 'change': response\['usd_24h_change'\]}
ti.xcom_push(key='processed_data', value=processed_data)
def \_store_data(ti):
data = ti.xcom_pull(task_ids='process_data', key='processed_data')
logging.info(f"Store: {data\['usd'\]} with change {data\['change'\]}")
with DAG('classic_dag', schedule_interval='#daily', start_date=datetime(2021, 12, 1), catchup=False) as dag:
extract_bitcoin_price = PythonOperator(
task_id='extract_bitcoin_price',
python_callable=_extract_bitcoin_price
)
process_data = PythonOperator(
task_id='process_data',
python_callable=_process_data
)
store_data = PythonOperator(
task_id='store_data',
python_callable=_store_data
)
extract_bitcoin_price >> process_data >> store_data
Tried callables with no params somefunc() expecting to get error saying too many params passed, but it succeeded. Adding somefunc(ti) also works! How can both work?
I think what you are missing is that Airflow allows to pass the context of the task to the python callable (as you can see one of them is the ti). These are additional useful parameters that Airflow provides and you can use them in your task.
In older Airflow versions user had to set provide_context=True which for that to work:
process_data = PythonOperator(
...,
provide_context=True
)
Since Airflow>=2.0 there is no need to use provide_context. Airflow handles it under the hood.
When you see in the Python Callable signatures like:
def func(ti, **kwargs):
...
This means that the ti is "unpacked" from the kwargs. You can also do:
def func(**kwargs):
ti = kwargs['ti']
EDIT:
I think what you are missing is that while you write:
def func()
...
store_data = PythonOperator(
task_id='task',
python_callable=func
)
Airflow does more than just calling func. The code being executed is the execute() function of PythonOperator and this function calls the python callable you provided with args and kwargs.