Unable to pass xcom in Custom Operators in Airflow - airflow

I have a simple, linear DAG(created using Airflow 2.0) with two tasks. I have custom operators for each of the task which extend over BaseOperator. Following is the code for dag and operators:-
class Operator1(BaseOperator):
#apply_defaults
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
def execute(self, context):
...
logging.info('First task')
context['task_instance'].xcom_push(key="payload", value=data)
return data
class Operator2(BaseOperator):
#apply_defaults
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
def execute(self, context):
...
logging.info("context is ", context)
parameters = context['task_instance'].xcom_pull(key="payload", value=data)
with DAG('dag_1', default_args=DEFAULT_ARGS, schedule_interval=None) as dag:
TASK_1 = Operator1(
task_id='task_1',
do_xcom_push=True)
TASK_2 = Operator2(
task_id='task_2',
do_xcom_push=True)
TASK_1 >> TASK_2
When I run the DAG, I find that context which is used for getting xcom values is empty. I have searched a lot of answers on stackoverflow and tried the way mentioned in them but they didn't work.
Would really appreciate some hint over the issue - how to push and pull xcom values in custom operators?

I took your code and run it, the first problem was that start_date wasn't defined, so it ended up in an exception:
Exception has occurred: AirflowException (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Task is missing the start_date parameter
Also, in Operator1 class, data variable is not defined. I guess maybe you missed them when you made the code example.
Other than that the code worked, but I think you should consider defining the task_id parameter when doing the xcom_pull operation.
From TaskInstance xcom_pull method description:
:param task_ids: Only XComs from tasks with matching ids will be
pulled. Can pass None to remove the filter.
Here is the code of a working example, note that I use two equivalent methods to perform the XComs operations:
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.utils.decorators import apply_defaults
from airflow.models import BaseOperator
class Operator1(BaseOperator):
#apply_defaults
def __init__(self, *args, **kwargs) -> None:
super(Operator1, self).__init__(*args, **kwargs)
def execute(self, context):
print('First task')
data = "valuable_data"
more_data = "more_valueable_data"
context['task_instance'].xcom_push(key="payload", value=data)
self.xcom_push(context, "more_data", more_data)
return data
class Operator2(BaseOperator):
#apply_defaults
def __init__(self, *args, **kwargs) -> None:
super(Operator2, self).__init__(*args, **kwargs)
def execute(self, context):
# print(f"context is {context}")
data = context['task_instance'].xcom_pull(
"task_1",
key="payload")
more_data = self.xcom_pull(context, "task_1", key="more_data")
print(f"Obtained data: {data}")
print(f"Obtained more_data: {more_data}")
with DAG('dag_1',
default_args={'owner': 'airflow'},
start_date=days_ago(1),
catchup=False,
schedule_interval=None) as dag:
TASK_1 = Operator1(
task_id='task_1'
)
TASK_2 = Operator2(
task_id='task_2'
)
TASK_1 >> TASK_2
Log from Task_2:
[2021-06-15 12:55:01,206] {taskinstance.py:1255} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=dag_1
AIRFLOW_CTX_TASK_ID=task_2
AIRFLOW_CTX_EXECUTION_DATE=2021-06-14T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-06-14T00:00:00+00:00
Obtained data: valuable_data
Obtained more_data: more_valueable_data
[2021-06-15 12:55:01,227] {taskinstance.py:1159} INFO - Marking task as SUCCESS. dag_id=dag_1, task_id=task_2, execution_date=20210614T000000, start_date=20210615T120402, end_date=20210615T125501
Side notes: I changed the __init__ method in order to accept *args as well. I'm using print but It could be done using Airflow logger as self.log.info('msg') .
Let me know if that worked for you!

Related

Get dag_run context in Airflow TaskFlow task

My dag is started with configuration JSON:
{"foo" : "bar"}
I have a Python operator which uses this value:
my_task = PythonOperator(
task_id="my_task",
op_kwargs={"foo": "{{ dag_run.conf['foo'] }}"},
python_callable=lambda foo: print(foo))
I’d like to replace it with a TaskFlow task…
#task
def my_task:
# how to get foo??
How can I get a reference to context, dag_run, or otherwise get to the configuration JSON from here?
There are several ways to do this using the TaskFlow API:
import datetime
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
#dag(start_date=datetime.datetime(2023, 1, 1), schedule=None)
def so_75303816():
#task
def example_1(**context):
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_2(dag_run=None):
foo = dag_run.conf["foo"]
print(foo)
#task
def example_3():
context = get_current_context()
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_4(params=None):
foo = params["foo"]
print(foo)
example_1()
example_2()
example_3()
example_4()
so_75303816()
Depending on your needs/preference, you can use one of the following examples:
example_1: You get all task instance context variables and have to extract "foo".
example_2: You explicitly state via arguments you want only dag_run from the task instance context variables. Note that you have to default arguments to None.
example_3: You can also fetch the task instance context variables from inside a task using airflow.operators.python.get_current_context().
example_4: DAG run context is also available via a variable named "params".
For more information, see https://airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html#accessing-context-variables-in-decorated-tasks and https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.

Using airflow dag_run.conf inside custom operator

We created a custom airflow based on EMRContainerOperator and we need to take a decision based on a config passed using the airflow UI.
My custom operator:
from airflow.providers.amazon.aws.operators.emr_containers import EMRContainerOperator
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Sequence
from uuid import uuid4
from airflow.utils.decorators import apply_defaults
class EmrBatchProcessorOperator(EMRContainerOperator):
template_fields: Sequence[str] = (
"name",
"virtual_cluster_id",
"execution_role_arn",
"release_label",
"job_driver",
"operation_type"
)
#apply_defaults
def __init__(
self,
operation_type,
*args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.operation_type = operation_type
if self.operation_type == 'full':
number_of_pods=10
else:
number_of_pods=5
BASE_CONSUMER_DRIVER_ARG = {
"sparkSubmitJobDriver": {"entryPoint": "s3://bucket/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.28.jar","entryPointArguments": ["group_name=courier_api_group01"], "sparkSubmitParameters": f"--conf spark.executor.instances={ number_of_pods } --conf spark.executor.memory=32G --conf spark.executor.cores=5 --conf spark.driver.cores=1 --conf spark.driver.memory=12G --conf spark.sql.broadcastTimeout=2000 --class TableProcessorWrapper"}
}
self.job_driver = BASE_CONSUMER_DRIVER_ARG
This is the way that I call my operator:
with DAG(
dag_id="batch_processor_model_dag",
schedule_interval="#daily",
default_args=default_args,
catchup=False
) as dag:
start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag, trigger_rule='none_failed')
base_consumer = EmrBatchProcessorOperator(
task_id="base_consumer",
virtual_cluster_id=VIRTUAL_CLUSTER_ID,
execution_role_arn=JOB_ROLE_ARN,
configuration_overrides=CONFIGURATION_OVERRIDES_ARG,
release_label="emr-6.5.0-latest",
job_driver={},
name="pi.py",
operation_type= '{{dag_run.conf["operation_type"]}}'
)
start >> base_consumer >> end
But this code didn't work, I can't use the dag_run.conf value.
could you help me?

Using dag_run variables in airflow Dag

I am trying to use airflow variables to determine whether to execute a task or not. I have tried this and it's not working:
if '{{ params.year }}' == '{{ params.message }}':
run_this = DummyOperator (
task_id = 'dummy_dag'
)
I was hoping to get some help making it work. Also is there a better way of doing something like this in airflow?
I think a good way to solve this, is with BranchPythonOperator to branch dynamically based on the provided DAG parameters. Consider this example:
Use params to provide the parameters to the DAG (could be also done from the UI), in this example: {"enabled": True}
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.operators.python import get_current_context, BranchPythonOperator
#dag(
default_args=default_args,
schedule_interval=None,
start_date=days_ago(1),
catchup=False,
tags=["example"],
params={"enabled": True},
)
def branch_from_dag_params():
def _print_enabled():
context = get_current_context()
enabled = context["params"].get("enabled", False)
print(f"Task id: {context['ti'].task_id}")
print(f"Enabled is: {enabled}")
#task
def task_a():
_print_enabled()
#task
def task_b():
_print_enabled()
Define a callable to the BranchPythonOperator in which you will perform your conditionals and return the next task to be executed. You can access the execution context variables from **kwargs. Also keep in mind that this operator should return a single task_id or a list of task_ids to follow downstream. Those resultant tasks should always be directly downstream from it.
def _get_task_run(ti, **kwargs):
custom_param = kwargs["params"].get("enabled", False)
if custom_param:
return "task_a"
else:
return "task_b"
branch_task = BranchPythonOperator(
task_id="branch_task",
python_callable=_get_task_run,
)
task_a_exec = task_a()
task_b_exec = task_b()
branch_task >> [task_a_exec, task_b_exec]
The result is that task_a gets executed and task_b is skipped :
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=branch_from_dag_params
AIRFLOW_CTX_TASK_ID=task_a
Task id: task_a
Enabled is: True
Let me know if that worked for you.
Docs

Assign airflow task to several DAGs

I am trying to reuse an existing airflow task by assigning it to different dags.
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
print_datetime_task = python_operator.PythonOperator(
task_id='print_datetime', python_callable=_print_datetime)
# define a new dag ...
# add to the new dag
create_new_task_for_dag(print_datetime_task, new_dag)
Then it gives the error Task is missing the start_date parameter.
If I define the dag when creating the operator, print_datetime_task = PythonOperator(task_id='print_datetime', python_callable=_print_datetime, dag=new_dag), then it is OK.
I have searched around, and this seems to be the root cause: https://github.com/apache/airflow/pull/5598, but PR has been marked as stale.
I wonder if there is any other approach to reuse an existing airflow task assign to a different dag.
I am using apache-airflow[docker,kubernetes]==1.10.10
While I don't know the solution to your problem with current design (code-layout), it can be made to work by tweaking the design slightly (note that the following code-snippets have NOT been tested)
Instead of copying a task from a DAG,
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
you can move the instantiation of task (as well as it's assignment to the DAG) to a separate utility function.
from datetime import datetime
from typing import Dict, Any
from airflow.models.dag import DAG
from airflow.operators.python_operator import PythonOperator
def add_new_print_datetime_task(my_dag: DAG,
kwargs: Dict[str, Any]) -> PythonOperator:
"""
Creates and adds a new 'print_datetime' (PythonOperator) task in 'my_dag'
and returns it's reference
:param my_dag: reference to DAG object in which to add the task
:type my_dag: DAG
:param kwargs: dictionary of args for PythonOperator / BaseOperator
'task_id' is mandatory
:type kwargs: Dict[str, Any]
:return: PythonOperator
"""
def my_callable() -> None:
print(datetime.now())
return PythonOperator(dag=my_dag, python_callable=my_callable, **kwargs)
Thereafter you can call that function everytime you want to instantiate that same task (and assign to any DAG)
with DAG(dag_id="my_dag_id", start_date=datetime(year=2020, month=8, day=22, hour=16, minute=30)) as my_dag:
print_datetime_task_kwargs: Dict[str, Any] = {
"task_id": "my_task_id",
"depends_on_past": True
}
print_datetime_task: PythonOperator = add_new_print_datetime_task(my_dag=my_dag, kwargs=print_datetime_task_kwargs)
# ... other tasks and their wiring
References / good reads
Astronomer.io: Dynamically Generating DAGs in Airflow
Apache Airflow | With Statement and DAG

airflow.exceptions.AirflowException: Use keyword arguments when initializing operators

I am using Airflow version of 1.9.2 with Python 2.7 in Ubuntu. I tried to inherit from ParentOperator class which works fine itself and to create a class called ChildOperator. But when I create a ChildOperator instance, I think some keyword arguments are missing or messed up here and I am getting this error:
airflow.exceptions.AirflowException: Use keyword arguments when
initializing operators
Here is a simplified example:
class ParentOperator(BaseOperator, SkipMixin):
#apply_defaults
def __init__(self,
conn_id,
object,
args={},
s3_conn_id=None,
s3_key=None,
s3_bucket=None,
fields=None,
*args,
**kwargs
):
super(ParentOperator, self).__init__(*args, **kwargs)
...
class ChildOperator(ParentOperator):
#apply_defaults
def __init__(self,
conn_id,
object,
args={},
s3_conn_id=None,
s3_key=None,
s3_bucket=None,
fields=None,
*args,
**kwargs
):
args=...
super(ChildOperator, self).__init__(
conn_id,
object,
args=args,
s3_conn_id=s3_conn_id,
s3_key=s3_key,
s3_bucket=s3_bucket,
fields=fields,
*args,
**kwargs
)
...
myobjc = ChildOperator(
conn_id="my_default",
object=table,
args={},
s3_conn_id='s3_postgres_dump',
s3_key=s3_key,
s3_bucket=s3_bucket,
dag=dag,
task_id="task1"
)
Any idea what is causing this error? Is this more of a Python specific issue?
__init__ function of ChildOperator needs to have all keyword parameters like the following (for the first two parameters of conn_id and object):
super(ChildOperator, self).__init__(
conn_id=conn_id,
object=object,
args=args,
s3_conn_id=s3_conn_id,
s3_key=s3_key,
s3_bucket=s3_bucket,
fields=fields,
*args,
**kwargs
)

Resources