dynamic task id names in Airflow - airflow

I have a DAG with one DataflowTemplateOperator that can deal with different json files. When I trigger the dag I pass some parameters via {{dag_run.conf['param1']}} and works fine.
The issue I have is trying to rename the task_id based on param1.
i.e. task_id="df_operator_read_object_json_file_{{dag_run.conf['param1']}}",
it complains about only alphanumeric characters
or
task_id="df_operator_read_object_json_file_{}".format(dag_run.conf['param1']),
it does not recognise dag_run plus the alpha issue.
The whole idea behind this is that when I see at the dataflow jobs console and job has failed I know who the offender is based on param1. Dataflow Job names are based on task_id like this:
df-operator-read-object-json-file-8b9eecec
and what I need is this:
df-operator-read-object-param1-json-file-8b9eecec
Any ideas if this is possible?

There is no need to generate new operator per file.
DataflowTemplatedJobStartOperator has job_name parameter which is also templated so can be used with Jinja.
I didn't test it but this should work:
from airflow.providers.google.cloud.operators.dataflow import DataflowTemplatedJobStartOperator
op = DataflowTemplatedJobStartOperator(
task_id="df_operator_read_object_json_file",
job_name= "df_operator_read_object_json_file_{{dag_run.conf['param1']}}"
template='gs://dataflow-templates/your_template',
location='europe-west3',
)

Related

Vertex AI Airflow Operators don't render XCom pulls (specifically CreateBatchPredictionJobOperator)

I am trying to run a batch predict job task using the Vertex AI Airflow Operator CreateBatchPredictionJobOperator. This requires pulling a model id from XCom which was pushed by a previous custom container training job. However, CreateBatchPredictionJobOperator doesn't seem to render Xcom pulls as expected.
I am running Airflow 2.3.0 on my local machine.
My code looks something like this:
batch_job_task = CreateBatchPredictionJobOperator(
gcp_conn_id="gcp_connection",
task_id="batch_job_task",
job_display_name=JOB_DISPLAY_NAME,
model_name="{{ ti.xcom_pull(key='model_conf')['model_id'] }}",
predictions_format="bigquery",
bigquery_source=BIGQUERY_SOURCE,
region=REGION,
project_id=PROJECT_ID,
machine_type="n1-standard-2",
bigquery_destination_prefix=BIGQUERY_DESTINATION_PREFIX,
This results in a value error when the task runs:
ValueError: Resource {{ ti.xcom_pull(key='model_conf')['model_id'] }} is not a valid resource id.
The expected behaviour would be to pull that variable by key and render it as a string.
I can also confirm that I am able to see the model id (and other info) in XCom by navigating there in the UI. I attempted using the same syntax with xcom_pull with a PythonOperator and it works.
def print_xcom_value(value):
print("VALUE:", value)
print_xcom_value_by_key = PythonOperator(
task_id="print_xcom_value_by_key", python_callable=print_xcom_value,
op_kwargs={"value": "{{ ti.xcom_pull(key='model_conf')['model_id'] }}" },
provide_context=True,
)
> [2022-12-15, 13:11:19 UTC] {logging_mixin.py:115} INFO - VALUE: 3673414612827265024
CreateBatchPredictionJobOperator does not accept provide_context as a variable. I assumed it would render xcom pulls by default since xcom pulls are used in the CreateBatchPredictionJobOperator in an example on the Airflow docs (link here).
Is there any way I can provide context to this Vertex AI Operator to pull from the XCom storage?
Is something wrong with the syntax that I am not seeing? Anything I a misunderstanding in the docs?
UPDATE:
One thing that confuses me is that model_name is a templated field according to the Airflow docs (link here) but the field is not rendering the XCom template.
Did you set render_template_as_native_obj=True in your DAG definition?
What version of apache-airflow-providers-google do you use?
====
From OP:
Your answer was a step in the right direction.
The solution was to upgrade apache-airflow-providers-google to the latest version (at the moment, this is 8.6.0). I'm not able to pinpoint exactly where in the changelog this fix is mentioned.
Setting render_template_as_native_obj=True was not useful for this issue since it rendered the id pulled from XCom as an int, and I found no proper way to convert it to str when passed into CreateBatchPredictionJobOperator in the model_name arg.

How to set a default dag trigger configuration json

When triggering a dag in airflow, there is a window, with which I am able to parameters to the dag in a json format. This looks like the following:
This json is always empty and I do have to know which parameters I can pass to the dag. Instead I would like to be able to prefill this json, so that when another user tries to trigger the dag he can simply change to values of the json, instead of having to look at the dags code first.
Is there any way to do this in the current version (2.0.0) of airflow?
On Airflow 2.1.0 it is possible to set default arguments as follow:
dag = DAG(dag_id="my_dag",
schedule_interval=None,
default_args={'retries': 3, 'retry_delay': timedelta(seconds=20)},
catchup=False,
tags=['maintenance'],
params={"description": ""} #Set parameters as a dictionary
)
In the thrigger UI it looks like this:
When writing my feature request i actually found a pull request, which is already merged and seems to exactly do as described:
https://github.com/apache/airflow/pull/10839
An improvement of the this feature also seems to be planned. See:
https://github.com/apache/airflow/issues/11054
No, it is currently not supported -- at least for Airflow 2.0.0

Using XCom with a BashOperator: Pushing multi-line sdout output to other tasks?

According to the documentation, "If xcom_push is True, the last line written to stdout will also be pushed to an XCom when the bash command completes."
However, I need to return a multi-line string that results from a python script being executed from the terminal. I would like to subsequently use this string in an EmailOperator.
So my question is: is it possible to push more than the last line via xcom_push? Ideally, it would be arbitrarily long. I would really appreciate your help, thanks!
EDIT: I have gotten around this problem by using a PythonOperator and calling the script, but I'm still curious if it's possible to push multi-line data to XCom from a BashOperator
As clearly stated in the source code, only the last line of the BashOperator is being pushed if xcom_push = True.
:param xcom_push: If xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes.
However, you could easily create a custom operator inheriting from the BashOperator and implement the double xcom_push.
See the plugins doc on how to build custom operators with Airflow plugins.

Apache Airflow - BigQueryOperator: How to dynamically set destination_dataset_table partition

I need a BigQueryOperator task like the following one: in which I need to save result from a query to a partitioned table. However, the "month_start" need to be derived from the actual DAG execution_date. I couldn't find any documents or examples on how to read the execution_date in my DAG definition script (in Python). Looking forward to some help here.
FYR: I'm with Airflow 1.8.2
t1_invalid_geohash_by_traffic = BigQueryOperator(
task_id='invalid_geohash_by_traffic',
bql='SQL/dangerous-area/InvalidGeohashByTraffic.sql',
params = params,
destination_dataset_table=
'mydataset.mytable${}'.format(month_start), write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False
)
I think I found the answer. Just ran into this blog: https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow

Example of using a parameter in Airflow?

Very often, the files I am downloading have a date in the filename.
csat_surveys_2017_03_05.csv
03062017_roster.csv
My code deals with this individually.
Compare the dates in the processed file list (based on slicing the filename) with the expected dates which should exist (some date range up until the current date)
For each file I process, add the filename to a database table and only process new files which have not been added to that table
Can I (and should I) use the airflow schedule date to replace the need of having to code this logic? Every day, my task is scheduled to run. I take that scheduled date (minus 1 day, perhaps) and use that value as a parameter to pass as part of the filename to read (in pandas). If so, can I please see a clear example that I can use as a template?
Is that a better approach, and would that cover me if a file is missing or delayed for a few days (I would want the task to fail, then keep trying each day until it succeeds or until I notice it and can raise the issue to our clients)?
I'd say yes, using the execution_date is probably best practice.
To access it, you'll need a templated field. Some default operators have those already, or you might want to create your own operator, which will then look something like this:
In your DAG, you'd have the task as:
my_task = MyOperator(
task_id='t1',
filename='prefix_{{ ds }}_suffix')
ds is the airflow macro for accessing the execution_date parameter as a string representation of a date.
And your MyOperator would look like:
class MyOperator(BaseOperator):
template_fields = ('filename')
def __init__(self, filename)
self.filename = filename
def execute(self, context):
download_file(self.filename)
do_other_stuff()
You can find more about how to parametrize tasks in the Macros section https://airflow.incubator.apache.org/code.html#macros

Resources