Airflow: PythonOperator: why to include 'ds' arg? - airflow

While defining a function to be later used as a python_callable, why is 'ds' included as the first arg of the function?
For example:
def python_func(ds, **kwargs):
pass
I looked into the Airflow documentation, but could not find any explanation.

This is related to the provide_context=True parameter. As per Airflow documentation,
if set to true, Airflow will pass a set of keyword arguments that can be used in your function. This set of kwargs correspond exactly to what you can use in your jinja templates. For this to work, you need to define **kwargs in your function header.
ds is one of these keyword arguments and represents execution date in format "YYYY-MM-DD". For parameters that are marked as (templated) in the documentation, you can use '{{ ds }}' default variable to pass the execution date. You can read more about default variables here:
https://pythonhosted.org/airflow/code.html?highlight=pythonoperator#default-variables (obsolete)
https://airflow.incubator.apache.org/concepts.html?highlight=python_callable
PythonOperator doesn't have templated parameters, so doing something like
python_callable=print_execution_date('{{ ds }}')
won't work. To print execution date inside the callable function of your PythonOperator, you will have to do it as
def print_execution_date(ds, **kwargs):
print(ds)
or
def print_execution_date(**kwargs):
print(kwargs.get('ds'))
Hope this helps.

Related

How can a variable be reused from one task to another task in a DAG?

So, I have a couple tasks in a DAG. Let's say I calculate a value and assign it to a variable in the first task. I want to be able to use the variable in the subsequent tasks.
How can I do this?
In a python program, I can just elevate the status of the variable in a function to Global , so that I can use that variable in other functions.
How can I achieve a similar thing with Airflow - use a variable from first task in the subsequent tasks.
I know I can use XCOMS. Is there any other way?
I tried elevating the status of variable to Global in the function called by first task and tried to use it in the subsequent tasks. It did not work.
As you see in below answer from a similar question, as long as you don't use an XCOM or persistent file, it is impossible to pass variables between tasks.
https://stackoverflow.com/a/60495564/19969296
You will need to either write the variable to a file that persists (can be a local file, a relational database, a file in a blob storage) or use XCom (see this guide for concrete examples) or Airflow Variables. Is there a reason you don't want to use XCom?
XCom example which is a more common pattern than the Airflow Variable example below:
from airflow.decorators import dag, task
from pendulum import datetime
#dag(
start_date=datetime(2022,12,10),
schedule=None,
catchup=False,
)
def write_var():
#task
def set_var():
return "bar"
#task
def retrive_var(my_variable):
print(my_variable)
retrive_var(set_var())
write_var()
The alternative to XCom would be to save the variable as an Airflow variable as shown in this DAG:
from airflow.decorators import dag, task
from pendulum import datetime
from airflow.models import Variable
#dag(
start_date=datetime(2022,12,10),
schedule=None,
catchup=False,
)
def write_var():
#task
def set_var():
Variable.set("foo", "bar")
#task
def retrive_var():
var = Variable.get("foo")
print(var)
set_var() >> retrive_var()
write_var()

XCOM is a tuple, how to pass the right value to two different downstream tasks

I have an upstream extract task, that extracts files into two different s3 paths. This operator returns a tuple of the two separate s3 paths as XCOM. How do I pass the appropriate XCOM value to the appropriate task?
extract_task >> load_task_0
load_task_1
Probably a little late to the party, but will answer anyways.
With TaskFlow API in Airflow 2.0 you can do something like this using decorators:
#task(multiple_outputs=True)
def extract_task():
return {
"path_0": "s3://path0",
"path_1": "s3://path1",
}
Then in your DAG:
#dag()
def my_dag():
output = extract_task()
load_task_0(output["path_0"])
load_task_1(output["path_1"])
This works with dictionary, probably won't work with tuple but you can try.

How to use macros within functions in Airflow

I am trying to calculate a hash for each task in airflow, using a combination of dag_id, task_id & execution_date. I am doing the calculation in the init of a custom operator, so that I could use it to calculate a unique retry_delay for each task (I don't want to use exponential backoff)
I find it difficult to use the {{ execution_date}} macro inside a call to hash function or int function, in those cases airflow does not replace it the specific date (just keeps the string {{execution_date}} and I get the same has for all execution dates
self.task_hash = int(hashlib.sha1("{}#{}#{}".format(self.dag_id,
self.task_id,
'{{execution_date}}')
.encode('utf-8')).hexdigest(), 16)
I have put task_hash in template_fields, also I have tried to do the calculation in a custom macro - this works for the hash part, but when I put it inside int(), it's the same issue
Any workround, or perhaps I could retrieve the execution_date (on the init of an operator), not from macros?
thanks
Try:
self.task_hash = int(hashlib.sha1("{}#{}#{{execution_date}}".format(
self.dag_id, self.task_id).encode('utf-8')).hexdigest(), 16)

Use string in subprocess

I've written Python code to compute an IP programmatically, that I then want to use in an external connection program.
I don't know how to pass it to the subprocess:
import subprocess
from subprocess import call
some_ip = "192.0.2.0" # Actually the result of some computation,
# so I can't just paste it into the call below.
subprocess.call("given.exe -connect host (some_ip)::5631 -Password")
I've read what I could and found similar questions but I truly cannot understand this step, to use the value of some_ip in the subprocess. If someone could explain this to me it would be greatly appreciated.
If you don't use it with shell=True (and I don't recommend shell=True unless you really know what you're doing, as shell mode can have security implications) subprocess.call takes the command as an sequence (e.g. a list) of its components: First the executable name, then the arguments you want to pass to it. All of those should be strings, but whether they are string literals, variables holding a string or function calls returning a string doesn't matter.
Thus, the following should work:
import subprocess
some_ip = "192.0.2.0" # Actually the result of some computation.
subprocess.call(
["given.exe", "-connect", "host", "{}::5631".format(some_ip), "-Password"])
I'm using str's format method to replace the {} placeholder in "{}::5631" with the string in some_ip.
If you invoke it as subprocess.call(...), then
import subprocess
is sufficient and
from subprocess import call
is unnecessary. The latter would be needed if you want to invoke the function as just call(...). In that case the former import would be unneeded.

robotframework is adding arguments where there are none: Type Error: takes no arguments (1 given)

Robot is telling me that I'm providing too many arguments to my keyword. I've boiled it down to a base case where I have a keyword that should do nothing:
def do_nothing():
"""
Does absolutly nothing
"""
Calling this keywork like this:
*** Test Cases ***
testCaseOne
do_nothing
Give this result:
TypeError: do_nothing() takes no arguments (1 given)
Adding a parameter to the keyword definition fixes the problem. Why does robot seem to pass 1 parameter to each keyword, even if there are no parameters in the test case?
I found the answer here.
The issue has nothing to do with the robotframework, and has every thing to do with Python; Python implicitly passes the current instance of the class to method calls, but I needed to explicitly declare the parameter. This is customarily named self:
def do_nothing(self):
This test runs.

Resources