Airflow Macros In Python Operator - airflow

I'm trying to use the Airflow macros in my Python Operator but I keep receiving "airflow: error: unrecognized arguments:"
So I am importing a function that has 3 positional arguments: (sys.argv,start_date,end_date) and I am hoping to make the start_date and end_date the execution date in Airflow.
The function arguments look something like this
def main(argv,start_date,end_date):
Here is the task I have in the DAG:
t1 = PythonOperator(
task_id='Pull_DCM_Report',
provide_context=True,
python_callable=main,
op_args=[sys.argv,'{{ ds }}','{{ ds }}'],
dag=dag)

Since you're passing in dates that need to be rendered by Airflow, you'll want to use the templates_dict parameter in the Python Operator. This field is the only one that Airflow will recognize as containing templates.
You can create a custom Python operator that recognizes more fields as templates by copy-ing the existing operator and add the relevant fields to the template_fields tuple.
def main(**kwargs):
argv = kwargs.get('templates_dict').get('argv')
start_date = kwargs.get('templates_dict').get('start_date')
end_date = kwargs.get('templates_dict').get('end_date')
t1 = PythonOperator(task_id='Pull_DCM_Report',
provide_context=True,
python_callable=main,
templates_dict={'argv': sys.argv,
'start_date': '{{ yesterday_ds }}',
'end_date': '{{ ds }}'},
dag=dag)

You can "wrap" the call to the main function with the following:
t1 = PythonOperator(
task_id='Pull_DCM_Report',
provide_context=True,
python_callable=lambda **context: main([], context["ds"], context["ds"]),
dag=dag)
If lambdas aren't your cup of tea you could define a function, call that, and have it call out to main.

Related

How to set airflow `http_conn_id` with a param?

Running Airflow 2.2.2
I would like to parametrize the http_conn_id using the DAG input parameters as such:
with DAG(params={'api': 'my-api-id'}) as dag:
post_op = SimpleHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}', # <- this doesn't get filled correctly
dag=dag)
Where my-api-id is set in the Airflow Connections.
However, when executing, the operator evaluates http_conn_id as '{{ params.api }}'.
I'm suspecting this is not possible - or is an anti-pattern?
Airflow operators do not render all the fields, they render only the fields which are listed in the attribute template_fields. For the operator SimpleHttpOperator, you have only the fiels:
template_fields: Sequence[str] = (
'endpoint',
'data',
'headers',
)
To get around the problem, you can create a new class which extend the official operator, and just add the extra fields you want to render:
from datetime import datetime
from airflow import DAG
from airflow.providers.http.operators.http import SimpleHttpOperator
class MyHttpOperator(SimpleHttpOperator):
template_fields = (
*SimpleHttpOperator.template_fields,
'http_conn_id'
)
with DAG(
dag_id='http_dag',
start_date=datetime.today(),
params={'api': 'my-api-id'}
) as dag:
post_op = MyHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}',
dag=dag
)

Apache Airflow - use python result in the next steps

I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date

How to pass DAG run date to the tasks?

I'm new with airflow and trying to figure out how to pass the DAG run date to each task, I have this in my DAG:
tzinfo=tz.gettz('America/Los_Angeles')
dag_run_date = datetime.now(_tzinfo)
dag = DAG(
'myDag',
default_args=default_args,
schedule_interval = None,
params = {
"runDateTimeTz" : dag_run_date.strftime("%Y-%m-%dT%H:%M:%S.%f%z")
}
)
Then I try to pass the runDateTimeTz parameter to each of my tasks, something like this..
task1 = GKEPodOperator(
image='gcr.io/myJar:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar.jar", {{params.runDateTimeTz}}"],
dag=dag)
task2 = GKEPodOperator(
image='gcr.io/myJar2:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar2.jar", {{params.runDateTimeTz}}"],
dag=dag)
My tasks are executed correctly but I was expecting all of them to receive the same run date in params.runDateTimeTz, but It didn't happen, for example task1 gets params.runDateTimeTz=2020-04-16T07:42:47.412716-07:00 and task2 gets params.runDateTimeTz= 2020-04-16T07:43:29.913289-07:00
I suppose this behavior is related to the way airflow fills the params for the DAG, looks like params.runDateTimeTz is gotten only when the task starts to run, but I want to get it before and send it to each task as an argument expecting all the task to get the same value.
Can someone assist me on what I'm doing wrong?
You can use the execution_date or ds from Airflow Macros:
Details: https://airflow.apache.org/docs/stable/macros-ref#default-variables
task1 = GKEPodOperator(
image='gcr.io/myJar:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar.jar", "{{ ds }}"],
dag=dag)
task2 = GKEPodOperator(
image='gcr.io/myJar2:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar2.jar", "{{ ds }}"],
dag=dag)
If you need a timestamp you can use {{ ts }}

Airflow is taking jinja template as string

in Airflow im trying to us jinja template in airflow but the problem is it is not getting parsed and rather treated as a string . Please see my code
``
from datetime import datetime
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
def test_method(dag,network_id,schema_name):
print "Schema_name in test_method", schema_name
third_task = PythonOperator(
task_id='first_task_' + network_id,
provide_context=True,
python_callable=print_context2,
dag=dag)
return third_task
dag = DAG('testing_xcoms_pull', description='Testing Xcoms',
schedule_interval='0 12 * * *',
start_date= datetime.today(),
catchup=False)
def print_context(ds, **kwargs):
return 'Returning from print_context'
def print_context2(ds, **kwargs):
return 'Returning from print_context2'
def get_schema(ds, **kwargs):
# Returning schema name based on network_id
schema_name = "my_schema"
return get_schema
first_task = PythonOperator(
task_id='first_task',
provide_context=True,
python_callable=print_context,
dag=dag)
second_task = PythonOperator(
task_id='second_task',
provide_context=True,
python_callable=get_schema,
dag=dag)
network_id = '{{ dag_run.conf["network_id"]}}'
first_task >> second_task >> test_method(
dag=dag,
network_id=network_id,
schema_name='{{ ti.xcom_pull("second_task")}}')
``
The Dag creation is failing because '{{ dag_run.conf["network_id"]}}' is taken as string by airflow. Can anyone help me with the problem in my code ???
Airflow operators have a variable called template_fields. This variable is usually declared at the top of the operator Class, check out any of the operators in the github code base.
If the field you are trying to pass Jinja template syntax into is not in the template_fields list the jinja syntax will appear as a string.
A DAG object, and its definition code, isn't parsed within the context an execution, it's parsed with regards to the environment available to it when loaded by Python.
The network_id variable, which you use to define the task_id in your function, isn't templated prior to execution, it can't be since there is no execution active. Even with templating you still need a valid, static, non-templated task_id value to instantiate a DAG object.

Apache airflow macro to get last dag run execution time

I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.

Resources