I have two Airflow tasks that are pushing xcoms with the same key srcDbName, but with different values. These two tasks are followed by a task that reads the xcoms with key srcDbName and prints their values. See the code below:
def _fill_facebook_task(ti):
ti.xcom_push(key='srcDbName', value='SRC_PL_Facebook')
def _fill_trip_advisor_task(ti):
ti.xcom_push(key='srcDbName', value='SRC_PL_TripAdvisor')
def _pm_task(ti):
values = ti.xcom_pull(key='srcDbName')
print(', '.join(values))
facebook = PythonOperator(
task_id="fill-facebook",
python_callable= _fill_facebook_task,
dag=dag
)
tripAdvisor = PythonOperator(
task_id="fill-trip-advisor",
python_callable=_fill_trip_advisor_task,
dag=dag
)
pm = PythonOperator(
task_id="premises-matching",
python_callable=_pm_task,
dag=dag
)
facebook >> pm
tripAdvisor >> pm
I expect the pm task should print
SRC_PL_Facebook, SRC_PL_TripAdvisor
(or in a different order) because the documentation for xcom_pull states:
:param task_ids: Only XComs from tasks with matching ids will be
pulled. Can pass None to remove the filter.
Actually, it prints
S, R, C, _, P, L, _, F, a, c, e, b, o, o, k
Is it possible to read all xcoms with a given key from all upstream tasks?
To read all of the xcoms you'd need to pass in all of the upstream task_instance names as an argument to xcom_pull. The documentation for the xcom_pull method definitely doesn't make that clear -- it only says:
If a single task_id string is provided, the result is the value of the most
recent matching XCom from that task_id. If multiple task_ids are provided, a
tuple of matching values is returned. None is returned whenever no matches
are found.
but it should also mention that if you don't pass any task_ids then xcom_pull will return only the first (if any) matching value that it finds. You can verify that behavior in the code for airflow.models.taskinstance.
Related
Typically I send an asynchronous task with .apply_async method of the Promise defined, and then I use the taskid on the AsyncResult method of the same object to get task status, and eventually, result.
But this requires me to know the exact type of task when more than one tasks are defined in the same deployment. Is there any way to circumvent this, when I can know the task status and result (if available) without knowing the exact task?
For example, take this example celery master node code.
#!/usr/bin/env python3
# encoding:utf-8
"""Define the tasks in this file."""
from celery import Celery
redis_host: str = 'redis://localhost:6379/0'
celery = Celery(main='test', broker=redis_host,
backend=redis_host)
celery.conf.CELERY_TASK_SERIALIZER = 'pickle'
celery.conf.CELERY_RESULT_SERIALIZER = 'pickle'
celery.conf.CELERY_ACCEPT_CONTENT = {'json', 'pickle'}
# pylint: disable=unused-argument
#celery.task(bind=True)
def add(self, x: float, y: float) -> float:
"""Add two numbers."""
return x + y
#celery.task(bind=True)
def multiply(self, x: float, y: float) -> float:
"""Multiply two numbers."""
return x * y
When I call something like this in a different module
task1=add.apply_async(args=[2, 3]).id
task2=multiply.apply_async(args=[2, 3]).id
I get two uuids for the tasks. But when checking back the task status, I need to know which method (add or multiply) is associated with that task id, since I have to call the method on the corresponding object, like this.
status: str = add.AsyncResult(task_id=task1).state
My question is how can I fetch the state and result armed only with the task id without knowing whether the task belongs add, multiply or any other category defined.
id and state are just properties of the AsyncResult objects. If you looked at documentation for the AsyncResult class, you would find the name property which is exactly what you are asking for.
I have a requirement to compute a value in python operator and use it in other operators as shown below .But I'm getting "dag_var does not exist" for spark submit and email operators/
I'm declaring dag_var as a global variable in the python callable. But I'm not able to access this in other operators.
def get_dag_var(ds, **kwargs):
global dag_var
dag_var = kwargs['dag_run'].run_id
with DAG(
dag_id='sample',
schedule_interval=None, # executes at 6 AM UTC every day
start_date=datetime(2021, 1, 1),
default_args=default_args,
catchup=False
) as dag:
get_dag_var = PythonOperator(
task_id='get_dag_id',
provide_context=True,
python_callable=get_dag_var)
spark_submit = SparkSubmitOperator(application="abc".....
..
application_args = [dag_var])
failure_notification = EmailOperator(
task_id = "failure_notification ",
to='abc#gmail.com',
subject='Workflow Failes',
trigger_rule="one_failed",
html_content= f""" <h3>Failure Mail - {dag_var}</h3> """
)
get_dag_var >> spark_submit >> failure_notification
Any help is appreciated. Thank you.
You can share data between operators using XComs. In your get_dag_var function, any returned value is automatically stored as an XCom record in Airflow. You can inspect the values under Admin -> XComs.
To use an XCom value in a following task, you can apply templating:
spark_submit = SparkSubmitOperator(
application="ABC",
...,
application_args = ["{{ ti.xcom_pull(task_ids='get_dag_id') }}"],
)
The {{ }} define a templated string that is evaluated at runtime. ti.xcom_pull will "pull" the XCom value from the get_dag_id task at runtime.
One thing to note using templating: not all operator's arguments are template-able. Non-template-able arguments do not evaluate {{ }} at runtime. SparkSubmitOperator.application_args and EmailOperator.html_content are template-able, meaning a templated string is evaluated at runtime and you'll be able to provide an XCom value. Inspect the template_fields property for your operator to know which fields are template-able and which are not.
And one thing to note using XComs: be aware the XCom value is stored in the Airflow metastore, so be careful not to return huge variables which might not fit in a database record. To store XCom values in a different system than the Airflow metastore, check out custom XCom backends.
In Airflow 2 taskflow API I can, using the following code examples, easily push and pull XCom values between tasks:-
#task(task_id="task_one")
def get_height() -> int:
response = requests.get("https://swapi.dev/api/people/4")
data = json.loads(response.text)
height = int(data["height"])
return height
#task(task_id="task_two")
def check_height(val):
# Show val:
print(f"Value passed in is: {val}")
check_height(get_height())
I can see that the val passed into check_height is 202 and is wrapped in the xcom default key 'return_value' and that's fine for some of the time, but I generally prefer to use specific keys.
My question is how can I push the XCom with a named key? This was really easy previously with ti.xcom_push where you could just supply the key name you wanted the value to be stuffed into, but I can't quite put my finger on how to achieve this in the taskflow api workflow.
Would appreciate any pointers or (simple, please!) examples on how to do this.
You can just set ti in the decorator as:
#task(task_id="task_one", ti)
def get_height() -> int:
response = requests.get("https://swapi.dev/api/people/4")
data = json.loads(response.text)
height = int(data["height"])
# Handle named Xcom
ti.xcom_push("my_key", height)
For cases where you need context in deep function you can also use get_current_context. I'll use it in my example below just to show it but it's not really required in your case.
here is a working example:
import json
from datetime import datetime
import requests
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
DEFAULT_ARGS = {"owner": "airflow"}
#dag(dag_id="stackoverflow_dag", default_args=DEFAULT_ARGS, schedule_interval=None, start_date=datetime(2020, 2, 2))
def my_dag():
#task(task_id="task_one")
def get_height() -> int:
response = requests.get("https://swapi.dev/api/people/4")
data = json.loads(response.text)
height = int(data["height"])
# Handle named Xcom
context = get_current_context()
ti = context["ti"]
ti.xcom_push("my_key", height)
return height
#task(task_id="task_two")
def check_height(val):
# Show val:
print(f"Value passed in is: {val}")
#Read from named Xcom
context = get_current_context()
ti = context["ti"]
ti.xcom_pull("task_one")
print(f"Value passed from xcom my_key is: {val}")
check_height(get_height())
my_dag = my_dag()
two xcoms being pushed (one for the returned value and one with the by the key we choose):
printing the two xcoms in downstream task_two:
I am trying to create an External Sensor (in DAG B) on a task in a different DAG (let's call this as DAG A) which runs at following intervals: 'schedule_interval': '0 4,6,8,10,12,14,16,18,20,22 * * *'.
DAG B is scheduled to run at 2AM daily. I want to create a Sensor Task in DAG B that checks whether the 4AM run for the external task in DAG A has succeeded
I cannot reschedule my DAG B to run at 4 since there are other tasks in DAG B which needs to run at 2. I have tried changing the window_size and window_offset parameters but it does not work.
The ExternalTaskSensor methods have been overwritten as follows
from airflow.models import TaskInstance, DagRun
def return_start_end_time(self, context):
execution_date = context.get('next_execution_date')
return (execution_date - self.window_offset - self.window_size,
execution_date - self.window_offset)
def poke(self, context):
start_date, end_date = self.return_start_end_time(context)
expected_executions = date_range(start_date, end_date,
delta=self.dep_dag_schedule)
TI = TaskInstance
DR = DagRun
executions = (
session.query(TI.dag_id, TI.task_id, TI.execution_date,
TI.state)
.join(DR, and_(DR.dag_id == TI.dag_id,
DR.execution_date == TI.execution_date))
.filter(TI.dag_id == self.external_dag_id,
TI.task_id == self.external_task_id,
TI.execution_date.in_(expected_executions),
DR.run_id.startswith('scheduled__'))
.order_by(TI.execution_date.desc()).all()
)
The code for Task Sensor is as follows:
wait_task = CustomTaskSensor(
task_id=wait_task,
poke_interval=60,
dag=dag,
external_dag_id=DAGA,
external_task_id=TaskA,
window_size=timedelta(days=0, hours=5),
window_offset=timedelta(days=0,hours=-5),
execution_timeout=timedelta(hours=5),
success_fn=MOST_RECENT_SUCCESS
)
You can use the execution_date_fn param of ExternalTaskSensor to achieve this
:param execution_date_fn: function that receives the current execution date
and returns the desired execution dates to query. Either execution_delta
or execution_date_fn can be passed to ExternalTaskSensor, but not both.
:type execution_date_fn: callable
This function will be supplied the current execution_date as argument and is supposed to return a single execution_date of a list of those, who's execution should be 'sensed'
..
elif self.execution_date_fn:
dttm = self.execution_date_fn(context['execution_date'])
..
dttm_filter = dttm if isinstance(dttm, list) else [dttm]
serialized_dttm_filter = ','.join(
[datetime.isoformat() for datetime in dttm_filter])
You can have the usage example from test-cases
We had our own custom ExternalTaskSensor class which did not have the execution_date_fn and execution_delta params. Instead we had parameters like window_size and window_offset. The way the methods were implemented, the execution date of Task Sensor was same as the actual execution date (which is not the default). That means if the DAG containing the TaskSensor triggered at 9/17 2 AM, the execution date of the sensor was set to 9/17 2 AM. However the execution date of the external task was set to previous execution date (which is the default Lakitu behaviour) i.e. if the external task runs at 9/17 4 AM then the execution date is set to 9/16 10 PM (which is the previous execution date). I had to define my window_size and window_offset parameters such that the execution date of the external task falls in the window calculated using the return_start_end_time function (where execution date refers to the execution date of the TaskSensor).
I am importing a variable file (eg, variables.json) into airflow, which has one depth-1 variable being a list like this:
{...
"var1": ["value1", "value2"],
...
}
I tried 3 methods:
1). in command line: airflow variables -i variables.json
2). in airflow UI, admin -> Variables -> Choose file -> Import Variables
3). in airflow UI, admin -> Variables -> Create -> input key (ie, Var1) and value (ie, ["value1", "value2"]) respectively.
Method 1 and 2 failed, but 3 succeeded.
method 1 returns info like "15 of 27 variables successfully updated.", which means some variables are not successfully updated
method 2 shows error:
InterfaceError: (sqlite3.InterfaceError) Error binding parameter 1 - probably unsupported type. [SQL: u'INSERT INTO variable ("key", val, is_encrypted) VALUES (?, ?, ?)'] [parameters: (u'var1', [u'value1', u'value2'], 0)] (Background on this error at: http://sqlalche.me/e/rvf5)
I search and found this thread: InterfaceError:(sqlte3.InterfaceError)Error binding parameter 0.
It seems that sqlite does not support list type.
I also tested a case having nesting variable (here for example, var2_1) being list like this
{...
"var2": {"var2_1": ["A","B"]},
...
}
all of above 3 methods are working.
So my questions are:
(1) why method 1 and 2 failed, but 3 succeeded for depth-1 variable being a list?
(2) why nesting (depth-2,3,...) variable can be a list without any issue?
If you're running Airflow 1.10.3, the import_helper used in the CLI only serializes dict values to JSON.
def import_helper(filepath):
#...
for k, v in d.items():
if isinstance(v, dict):
Variable.set(k, v, serialize_json=True)
else:
Variable.set(k, v)
n += 1
except Exception:
pass
finally:
print("{} of {} variables successfully updated.".format(n, len(d)))
https://github.com/apache/airflow/blob/1.10.3/airflow/bin/cli.py#L376
The WebUI importer also does the same thing with dict values.
models.Variable.set(k, v, serialize_json=isinstance(v, dict))
https://github.com/apache/airflow/blob/1.10.3/airflow/www/views.py#L2073
However, current revision (1.10.4rc1) shows that non string values will be serialized to string in future releases in CLI import_helper
Variable.set(k, v, serialize_json=not isinstance(v, six.string_types))
https://github.com/apache/airflow/blob/1.10.4rc1/airflow/bin/cli.py
...and WebUI importer.
models.Variable.set(k, v, serialize_json=not isinstance(v, six.string_types))
https://github.com/apache/airflow/blob/1.10.4rc1/airflow/www/views.py#L2118
Currently, it will serve you to perform the serialization of non string values in your import process when you do it with the CLI or WebUI importer.
...and when you retrieve value for such variable pass the option to deserialize them e.g.
Variable.get('some-key', deserialize_json=True)
In your variable.json, ["value1", "value2"] is an array, where Airflow expects a value/string or a JSON.
It would work if you cast that array into a string in your JSON.