I want to pass the execution date, which is in the variable {{ ds }}. However, I passed it through a function it does not get the execution date.
def get_spark_step_2(date):
#logic in here
return step
exec_date = '{{ ds }}'
step_adder2 = EmrAddStepsOperator(
task_id='create_parquets',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=get_spark_step_2(exec_date),
dag=dag
)
Do you know how I can use the variable in the context above?
Create a class that extends EmrAddStepsOperator, and make steps a templated field.
Something like this:
class MyEmrAddStepsOperator(EmrAddStepsOperator):
template_fields = ['job_flow_id','steps']
EmrAddStepsOperator itself only has job_flow_id as a templated field:
class EmrAddStepsOperator(BaseOperator):
"""
An operator that adds steps to an existing EMR job_flow.
:param job_flow_id: id of the JobFlow to add steps to
:type job_flow_name: str
:param aws_conn_id: aws connection to uses
:type aws_conn_id: str
:param steps: boto3 style steps to be added to the jobflow
:type steps: list
"""
template_fields = ['job_flow_id']
You can only use macros (like ds) in fields that are templated.
Related
i've a task that returns a tuple. passing one element of that tuple to another task is not working. i can pass the entire tuple, but not an element from the return value:
from airflow.decorators import dag, task
from pendulum import datetime
#task
def create():
return 1, 2
#task
def consume(one):
print('arg is', one)
#dag(
schedule_interval='#once',
start_date=datetime(2022, 4, 10),
)
def test_dag():
out = create()
consume(out[0]) # does not work: the task gets None as argument
consume(out) # this works
dag = test_dag()
Within TaskFlow the object returned from a TaskFlow function is actually an XComArg. These XComArgs are abstractions over the classic task_instance.xcom_pull(...) retrieval of XComs. Additionally XComArg objects implement __getitem__ for specifying an XCom key other than "return_value" (which is the default).
So what's going on in the case of using consume(out[0]) is that Airflow is leveraging an XComArg object to retrieve an XCom with a key of 0 not retrieving the output from create() and then the first item. What's going on behind the scenes is task_instance.xcom_pull(task_ids="create", key=0).
Yes, this is unexpected in a way and it's not quite inline with the classic xcom_pull() approach. This issue has been opened to try and achieve feature parity.
In the meantime, you can of course access the whole XComArg like you show by just using consume(out) or you can update the TaskFlow function to return a dictionary and use multiple_outputs to have each key/value pair serialized as their own XComs.
For example:
from pendulum import datetime
from airflow.decorators import dag, task
#task(multiple_outputs=True)
def create():
return {"one": 1, "two": 2}
#task
def consume(arg):
print('arg is', arg)
#dag(
schedule_interval='#once',
start_date=datetime(2022, 4, 10),
)
def test_dag():
out = create()
consume(out["one"])
dag = test_dag()
Separate XComs created from the create task:
consume task log:
Side note: multiple_outputs can also be inferred if the TaskFlow function has a dictionary return type annotation too. This will set multiple_outputs=True based on the return annotation:
from typing import Dict
#task
def create() -> Dict[str, int]:
return {"one": 1, "two": 2}
I have a requirement to compute a value in python operator and use it in other operators as shown below .But I'm getting "dag_var does not exist" for spark submit and email operators/
I'm declaring dag_var as a global variable in the python callable. But I'm not able to access this in other operators.
def get_dag_var(ds, **kwargs):
global dag_var
dag_var = kwargs['dag_run'].run_id
with DAG(
dag_id='sample',
schedule_interval=None, # executes at 6 AM UTC every day
start_date=datetime(2021, 1, 1),
default_args=default_args,
catchup=False
) as dag:
get_dag_var = PythonOperator(
task_id='get_dag_id',
provide_context=True,
python_callable=get_dag_var)
spark_submit = SparkSubmitOperator(application="abc".....
..
application_args = [dag_var])
failure_notification = EmailOperator(
task_id = "failure_notification ",
to='abc#gmail.com',
subject='Workflow Failes',
trigger_rule="one_failed",
html_content= f""" <h3>Failure Mail - {dag_var}</h3> """
)
get_dag_var >> spark_submit >> failure_notification
Any help is appreciated. Thank you.
You can share data between operators using XComs. In your get_dag_var function, any returned value is automatically stored as an XCom record in Airflow. You can inspect the values under Admin -> XComs.
To use an XCom value in a following task, you can apply templating:
spark_submit = SparkSubmitOperator(
application="ABC",
...,
application_args = ["{{ ti.xcom_pull(task_ids='get_dag_id') }}"],
)
The {{ }} define a templated string that is evaluated at runtime. ti.xcom_pull will "pull" the XCom value from the get_dag_id task at runtime.
One thing to note using templating: not all operator's arguments are template-able. Non-template-able arguments do not evaluate {{ }} at runtime. SparkSubmitOperator.application_args and EmailOperator.html_content are template-able, meaning a templated string is evaluated at runtime and you'll be able to provide an XCom value. Inspect the template_fields property for your operator to know which fields are template-able and which are not.
And one thing to note using XComs: be aware the XCom value is stored in the Airflow metastore, so be careful not to return huge variables which might not fit in a database record. To store XCom values in a different system than the Airflow metastore, check out custom XCom backends.
I have two Airflow tasks that are pushing xcoms with the same key srcDbName, but with different values. These two tasks are followed by a task that reads the xcoms with key srcDbName and prints their values. See the code below:
def _fill_facebook_task(ti):
ti.xcom_push(key='srcDbName', value='SRC_PL_Facebook')
def _fill_trip_advisor_task(ti):
ti.xcom_push(key='srcDbName', value='SRC_PL_TripAdvisor')
def _pm_task(ti):
values = ti.xcom_pull(key='srcDbName')
print(', '.join(values))
facebook = PythonOperator(
task_id="fill-facebook",
python_callable= _fill_facebook_task,
dag=dag
)
tripAdvisor = PythonOperator(
task_id="fill-trip-advisor",
python_callable=_fill_trip_advisor_task,
dag=dag
)
pm = PythonOperator(
task_id="premises-matching",
python_callable=_pm_task,
dag=dag
)
facebook >> pm
tripAdvisor >> pm
I expect the pm task should print
SRC_PL_Facebook, SRC_PL_TripAdvisor
(or in a different order) because the documentation for xcom_pull states:
:param task_ids: Only XComs from tasks with matching ids will be
pulled. Can pass None to remove the filter.
Actually, it prints
S, R, C, _, P, L, _, F, a, c, e, b, o, o, k
Is it possible to read all xcoms with a given key from all upstream tasks?
To read all of the xcoms you'd need to pass in all of the upstream task_instance names as an argument to xcom_pull. The documentation for the xcom_pull method definitely doesn't make that clear -- it only says:
If a single task_id string is provided, the result is the value of the most
recent matching XCom from that task_id. If multiple task_ids are provided, a
tuple of matching values is returned. None is returned whenever no matches
are found.
but it should also mention that if you don't pass any task_ids then xcom_pull will return only the first (if any) matching value that it finds. You can verify that behavior in the code for airflow.models.taskinstance.
In Airflow 2 taskflow API I can, using the following code examples, easily push and pull XCom values between tasks:-
#task(task_id="task_one")
def get_height() -> int:
response = requests.get("https://swapi.dev/api/people/4")
data = json.loads(response.text)
height = int(data["height"])
return height
#task(task_id="task_two")
def check_height(val):
# Show val:
print(f"Value passed in is: {val}")
check_height(get_height())
I can see that the val passed into check_height is 202 and is wrapped in the xcom default key 'return_value' and that's fine for some of the time, but I generally prefer to use specific keys.
My question is how can I push the XCom with a named key? This was really easy previously with ti.xcom_push where you could just supply the key name you wanted the value to be stuffed into, but I can't quite put my finger on how to achieve this in the taskflow api workflow.
Would appreciate any pointers or (simple, please!) examples on how to do this.
You can just set ti in the decorator as:
#task(task_id="task_one", ti)
def get_height() -> int:
response = requests.get("https://swapi.dev/api/people/4")
data = json.loads(response.text)
height = int(data["height"])
# Handle named Xcom
ti.xcom_push("my_key", height)
For cases where you need context in deep function you can also use get_current_context. I'll use it in my example below just to show it but it's not really required in your case.
here is a working example:
import json
from datetime import datetime
import requests
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
DEFAULT_ARGS = {"owner": "airflow"}
#dag(dag_id="stackoverflow_dag", default_args=DEFAULT_ARGS, schedule_interval=None, start_date=datetime(2020, 2, 2))
def my_dag():
#task(task_id="task_one")
def get_height() -> int:
response = requests.get("https://swapi.dev/api/people/4")
data = json.loads(response.text)
height = int(data["height"])
# Handle named Xcom
context = get_current_context()
ti = context["ti"]
ti.xcom_push("my_key", height)
return height
#task(task_id="task_two")
def check_height(val):
# Show val:
print(f"Value passed in is: {val}")
#Read from named Xcom
context = get_current_context()
ti = context["ti"]
ti.xcom_pull("task_one")
print(f"Value passed from xcom my_key is: {val}")
check_height(get_height())
my_dag = my_dag()
two xcoms being pushed (one for the returned value and one with the by the key we choose):
printing the two xcoms in downstream task_two:
I have a custom DAG (meant to be subclassed), let's name it MyDAG. In the __enter__ method I want to add (or not) an operator based on the subclassing DAG. I'm not interested in using the BranchPythonOperator.
class MyDAG(DAG):
def __enter__(self, context):
start = DummyOperator(taks_id=start)
end = DummyOperator(task_id=end)
op = self.get_additional_operator()
if op:
start >> op
else:
start >> end
retrun self
def get_additional_operator(self):
# None if the subclass doesn't add any operator. A reference to another operator otherwise
if get_additional_operator is returning a reference, I'm obtaining this shape (two branches):
* start --> op
* end
otherwise, if it's returning None, I'm obtaining this (one branch):
* start --> end
What I want is not having end at all in the subclass inherting from MyDAG if get_additional_operator doesn't return None, something like this:
* start --> op
Instead of the two branches I'm obtaining above.
Airflow is somehow parsing every operator declared in the __enter__ method of a subclass of MyDAG. From that assumption, in order not to have an operator it only suffices to declare the operator in the right place. code below:
class MyDAG(DAG):
def __enter__(self, context):
start = DummyOperator(taks_id=start)
op = self.get_additional_operator()
if op:
start >> op
else:
end = DummyOperator(task_id=end)
start >> end
retrun self
def get_additional_operator(self):
# None if the subclass doesn't add any operator. A reference to another operator otherwise
The declaration of the end operator is made in the else section. I think it's only parsed when the else is evaluated to true.