When I do something like:
some_value = "{{ dag_run.get_task_instance('start').start_date }}"
print(f"some interpolated value: {some_value}")
I see this in the airflow logs:
some interpolated value: {{ dag_run.get_task_instance('start').start_date }}
but not the actual value itself. How can I easily see what the value is?
Everything in the DAG task run comes through as kwargs (before 1.10.12 you needed to add provide_context, but all context is provided after version 2).
To get something out of kwargs, do something like this in your python callable:
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
Additional info:
To get the kwargs out, add them to your callable, so:
def my_func(**kwargs):
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
And just call this from your DAG task like:
my_task = PythonOperator(
task_id='my_task'
, dag=dag
, python_callable=my_func)
I'm not sure what your current code structure is because you haven't provided more info I'm afraid.
Related
I am using a DAG (s3_sensor_dag) to trigger another DAG (advanced_dag) and I pass the tag_names configurations to the triggered DAG (advanced_dag) using the conf argument. It looks something like this:
s3_sensor_dag.py:
trigger_advanced_dag = TriggerDagRunOperator(
task_id="trigger_advanced_dag",
trigger_dag_id="advanced_dag",
wait_for_completion="True",
conf={"tag_names": "{{ task_instance.xcom_pull(key='tag_names', task_ids='get_file_paths') }}"}
)
In the advanced_dag, I am trying to access the dag_conf (tag_names) like this:
advanced_dag.py:
with DAG(
dag_id="advanced_dag",
start_date=datetime(2020, 12, 23),
schedule_interval=None,
is_paused_upon_creation=False,
catchup=False,
dagrun_timeout=timedelta(minutes=60),
) as dag:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag_run.conf["tag_names"]
)
But I get the error stating that dag_run does not exist. I realized that this is a run time variable from Accessing configuration parameters passed to Airflow through CLI.
So, I tried a solution which was mentioned in the comment that uses dag.get_dagrun(execution_date=dag.latest_execution_date).conf which goes something like:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag.get_dagrun(execution_date=dag.latest_execution_date).conf['tag_names']
)
But it looks like it didn't fetch the value either.
I was able to solve this issue by using Airflow Variables but I wanted to know if there is a way to use the dag_conf (which obviously gets data only during runtime) inside the dag() code and get the value.
I am trying to pass data between a PythonOperator, _etl_lasic to another PythonOperator, _download_s3_data, which works fine but I want to throw an exception when the value passed is None which should mark the task as a failure.
import airflow
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.exceptions import AirflowFailException
def _etl_lasic(**context):
path_s3 = None
context["task_instance"].xcom_push(
key="path_s3",
value=path_s3,
)
def _download_s3_data(templates_dict, **context):
path_s3 = templates_dict["path_s3"]
if not path_s3:
raise AirflowFailException("Path to S3 was not passed!")
else:
print(f"Path to S3: {path_s3}")
with DAG(
dag_id="02_lasic_retraining_without_etl",
start_date=airflow.utils.dates.days_ago(3),
schedule_interval="#once",
) as dag:
etl_lasic = PythonOperator(
task_id="etl_lasic",
python_callable=_etl_lasic,
)
download_s3_data = PythonOperator(
task_id="download_s3_data",
python_callable=_download_s3_data,
templates_dict={
"path_s3": "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}"
},
)
etl_lasic >> download_s3_data
Logs:
[2021-08-17 04:04:41,128] {logging_mixin.py:103} INFO - Path to S3: None
[2021-08-17 04:04:41,128] {python.py:118} INFO - Done. Returned value was: None
[2021-08-17 04:04:41,143] {taskinstance.py:1135} INFO - Marking task as SUCCESS. dag_id=02_lasic_retraining_without_etl, task_id=download_s3_data, execution_date=20210817T040439, start_date=20210817T040440, end_date=20210817T040441
[2021-08-17 04:04:41,189] {taskinstance.py:1195} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-08-17 04:04:41,212] {local_task_job.py:118} INFO - Task exited with return code 0
Jinja-templated values are rendered as strings by default. In your case, even though you push an XCom value of None, when the value is pulled via "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}" the value is actually rendered as "None" which doesn't throw an exception based on the current logic.
There are two options that will solve this:
Instead of setting path_s3 to None in the "_etl_lasic" function, set it to an empty string.
If you are using Airflow 2.1+, there is a parameter, render_template_as_native_obj, that can be set at the DAG level which will render Jinja-templated values as native Python types (list, dict, etc.). Setting that parameter to True will do the trick without changing how path_s3 is set in the function. A conceptual example is documented here.
What I'm trying to do is use the dag_id and run_id as parts of the path in S3 that I want to land the data, but I'm starting to understand that these templated values are only applied in a task execution context.
Is there anyway that I can provide their values to the Operator like below to control where the files go?
my_task = RedshiftToS3Transfer(
task_id='my_task',
schema='public',
table='my_table',
s3_bucket='bucket',
s3_key='foo/bar/{{ dag_id }}/{{ run_id }}',
redshift_conn_id='MY_CONN',
aws_conn_id='AWS_DEFAULT',
dag=dag
)
This is a two part answer.
FIRST PART:
How to get get s3_key templated.
Recommended approach:
Your code will be templated just fine if you import the operator from providers. This is because the RedshiftToS3Transfer in providers has s3_key listed as templated field.
Deprecated approach: (Will not be valid for Airflow > 2.0)
Importing the operator from Airflow core you will need to write a custom operator that wraps RedshiftToS3Transfer as:
from airflow.operators.redshift_to_s3_operator import RedshiftToS3Transfer
class MyRedshiftToS3Transfer (RedshiftToS3Transfer):
template_fields = ['s3_key']
my_task = MyRedshiftToS3Transfer(
task_id='my_task',
schema='public',
table='my_table',
s3_bucket='bucket',
s3_key='foo/bar/{{ dag_id }}/{{ run_id }}',
redshift_conn_id='MY_CONN',
aws_conn_id='AWS_DEFAULT',
dag=dag
)
Which will give you:
Second PART:
How to choose the templated value.
Now as you can see in the first part the output isn't a real working path as it contains invalid values.
I would recommend using task_instance_key_str from the docs it's a unique, human-readable key to the task instance formatted as {dag_id}__{task_id}__{ds_nodash}
So you can use it in you code:
s3_key='foo/bar/{{ task_instance_key_str }}'
Which will give you:
That's good for daily DAGs but if your DAG runs on smaller interval you can do:
s3_key='foo/bar/{{task.dag_id}}__{{task.task_id}}__{{ ts_nodash }}'
Which will give you:
Ended up doing
class TemplatedRedshiftToS3Transfer(RedshiftToS3Transfer):
template_fields = ['s3_key']
#apply_defaults
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
to create a new derived class from RedshiftToS3Transfer which passes the s3_key field from instantiation through the templating engine
I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.
The Airflow docs say: You can use Jinja templating with every parameter that is marked as “templated” in the documentation. It makes sense that specific parameters in the Airflow world (such as certain parameters to PythonOperator) get templated by Airflow automatically. I'm wondering what the best/correct way is to get a non-Airflow variable to get templated. My specific use case is something similar to:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from somewhere import export_votes_data, export_queries_data
from elsewhere import ApiCaucus, ApiQueries
dag = DAG('export_training_data',
description='Export training data for all active orgs to GCS',
schedule_interval=None,
start_date=datetime(2018, 3, 26), catchup=False)
HOST = "http://api-00a.dev0.solvvy.co"
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
votes_api = ApiCaucus.get_votes_api(HOST)
queries_api = ApiQueries.get_queries_api(HOST)
export_votes = PythonOperator(task_id="export_votes", python_callable=export_votes_data,
op_args=[BUCKET, votes_api], dag=dag)
export_queries = PythonOperator(task_id="export_queries", python_callable=export_query_data,
op_args=[BUCKET, queries_api, export_solutions.task_id], dag=dag,
provide_context=True)
The provide_context argument for the PythonOperator will pass along the arguments that are used for templating. From the documentation:
provide_context (bool) – if set to true, Airflow will pass a set of
keyword arguments that can be used in your function. This set of
kwargs correspond exactly to what you can use in your jinja templates.
For this to work, you need to define **kwargs in your function header.
With the context provided to your callable, you can then do the interpolation in your function:
def your_callable(bucket, api, **kwargs):
bucket = bucket.format(**kwargs)
[...]
Inside methods(execute/pre_execute/post_execute, and anywhere you can get the Airflow context) of an Operator:
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
jinja_context = context['ti'].get_template_context()
rendered_content = self.render_template('', BUCKET, jinja_context)