How can a variable be reused from one task to another task in a DAG? - airflow

So, I have a couple tasks in a DAG. Let's say I calculate a value and assign it to a variable in the first task. I want to be able to use the variable in the subsequent tasks.
How can I do this?
In a python program, I can just elevate the status of the variable in a function to Global , so that I can use that variable in other functions.
How can I achieve a similar thing with Airflow - use a variable from first task in the subsequent tasks.
I know I can use XCOMS. Is there any other way?
I tried elevating the status of variable to Global in the function called by first task and tried to use it in the subsequent tasks. It did not work.

As you see in below answer from a similar question, as long as you don't use an XCOM or persistent file, it is impossible to pass variables between tasks.
https://stackoverflow.com/a/60495564/19969296

You will need to either write the variable to a file that persists (can be a local file, a relational database, a file in a blob storage) or use XCom (see this guide for concrete examples) or Airflow Variables. Is there a reason you don't want to use XCom?
XCom example which is a more common pattern than the Airflow Variable example below:
from airflow.decorators import dag, task
from pendulum import datetime
#dag(
start_date=datetime(2022,12,10),
schedule=None,
catchup=False,
)
def write_var():
#task
def set_var():
return "bar"
#task
def retrive_var(my_variable):
print(my_variable)
retrive_var(set_var())
write_var()
The alternative to XCom would be to save the variable as an Airflow variable as shown in this DAG:
from airflow.decorators import dag, task
from pendulum import datetime
from airflow.models import Variable
#dag(
start_date=datetime(2022,12,10),
schedule=None,
catchup=False,
)
def write_var():
#task
def set_var():
Variable.set("foo", "bar")
#task
def retrive_var():
var = Variable.get("foo")
print(var)
set_var() >> retrive_var()
write_var()

Related

XCOM is a tuple, how to pass the right value to two different downstream tasks

I have an upstream extract task, that extracts files into two different s3 paths. This operator returns a tuple of the two separate s3 paths as XCOM. How do I pass the appropriate XCOM value to the appropriate task?
extract_task >> load_task_0
load_task_1
Probably a little late to the party, but will answer anyways.
With TaskFlow API in Airflow 2.0 you can do something like this using decorators:
#task(multiple_outputs=True)
def extract_task():
return {
"path_0": "s3://path0",
"path_1": "s3://path1",
}
Then in your DAG:
#dag()
def my_dag():
output = extract_task()
load_task_0(output["path_0"])
load_task_1(output["path_1"])
This works with dictionary, probably won't work with tuple but you can try.

Asyncio in a for loop

So I am trying to do something like this,
for job in jobs:
input = some_io_task(job)
output_list.append(process(input))
return output_list
I want the loop to continue as the some_io_task function is being executed, and come back to the that particular iteration and then append to the output list, the order in which the append takes place does not matter, and I am trying to do this using asyncio in python. I am new to this and would really appreciate any help. Thanks!
Per your request, here is an example considering some_io_task returns an awaitable:
import asyncio
async def some_io_task(job):
...
async def stuff(jobs):
# Create all coroutines for the jobs
tasks = map(some_io_task, jobs)
# Schedule all coroutines, returning each as they're ready
for task in asyncio.as_completed(tasks):
result = await task
output_list.append(process(result))
return output_list
It is extremely specific to your case. Keep in mind I'm missing some information about some_io_task but I believe this should explain it well enough.

How to use macros within functions in Airflow

I am trying to calculate a hash for each task in airflow, using a combination of dag_id, task_id & execution_date. I am doing the calculation in the init of a custom operator, so that I could use it to calculate a unique retry_delay for each task (I don't want to use exponential backoff)
I find it difficult to use the {{ execution_date}} macro inside a call to hash function or int function, in those cases airflow does not replace it the specific date (just keeps the string {{execution_date}} and I get the same has for all execution dates
self.task_hash = int(hashlib.sha1("{}#{}#{}".format(self.dag_id,
self.task_id,
'{{execution_date}}')
.encode('utf-8')).hexdigest(), 16)
I have put task_hash in template_fields, also I have tried to do the calculation in a custom macro - this works for the hash part, but when I put it inside int(), it's the same issue
Any workround, or perhaps I could retrieve the execution_date (on the init of an operator), not from macros?
thanks
Try:
self.task_hash = int(hashlib.sha1("{}#{}#{{execution_date}}".format(
self.dag_id, self.task_id).encode('utf-8')).hexdigest(), 16)

Airflow: PythonOperator: why to include 'ds' arg?

While defining a function to be later used as a python_callable, why is 'ds' included as the first arg of the function?
For example:
def python_func(ds, **kwargs):
pass
I looked into the Airflow documentation, but could not find any explanation.
This is related to the provide_context=True parameter. As per Airflow documentation,
if set to true, Airflow will pass a set of keyword arguments that can be used in your function. This set of kwargs correspond exactly to what you can use in your jinja templates. For this to work, you need to define **kwargs in your function header.
ds is one of these keyword arguments and represents execution date in format "YYYY-MM-DD". For parameters that are marked as (templated) in the documentation, you can use '{{ ds }}' default variable to pass the execution date. You can read more about default variables here:
https://pythonhosted.org/airflow/code.html?highlight=pythonoperator#default-variables (obsolete)
https://airflow.incubator.apache.org/concepts.html?highlight=python_callable
PythonOperator doesn't have templated parameters, so doing something like
python_callable=print_execution_date('{{ ds }}')
won't work. To print execution date inside the callable function of your PythonOperator, you will have to do it as
def print_execution_date(ds, **kwargs):
print(ds)
or
def print_execution_date(**kwargs):
print(kwargs.get('ds'))
Hope this helps.

Use string in subprocess

I've written Python code to compute an IP programmatically, that I then want to use in an external connection program.
I don't know how to pass it to the subprocess:
import subprocess
from subprocess import call
some_ip = "192.0.2.0" # Actually the result of some computation,
# so I can't just paste it into the call below.
subprocess.call("given.exe -connect host (some_ip)::5631 -Password")
I've read what I could and found similar questions but I truly cannot understand this step, to use the value of some_ip in the subprocess. If someone could explain this to me it would be greatly appreciated.
If you don't use it with shell=True (and I don't recommend shell=True unless you really know what you're doing, as shell mode can have security implications) subprocess.call takes the command as an sequence (e.g. a list) of its components: First the executable name, then the arguments you want to pass to it. All of those should be strings, but whether they are string literals, variables holding a string or function calls returning a string doesn't matter.
Thus, the following should work:
import subprocess
some_ip = "192.0.2.0" # Actually the result of some computation.
subprocess.call(
["given.exe", "-connect", "host", "{}::5631".format(some_ip), "-Password"])
I'm using str's format method to replace the {} placeholder in "{}::5631" with the string in some_ip.
If you invoke it as subprocess.call(...), then
import subprocess
is sufficient and
from subprocess import call
is unnecessary. The latter would be needed if you want to invoke the function as just call(...). In that case the former import would be unneeded.

Resources