I have a daily Dag that contains a subdag. The subdag has five tasks, T1 through T5, that must run in order (e.g. T1 >> T2 >> T3 >> T4 >> T5)
The dag successfully runs for a few days but then I discover a bug with T4. I fix the bug and want to re-run just T4 and T5 for all previous days. It's important to NOT re-run T1-T3 because these steps take a lot longer than T4-T5.
What I've tried that has failed:
Select T4, Clear downstream+recursive - Nothing happens. The Dag tree view shows the subdag as "success" even though T4 and T5 within it are cleared.
Select T4, clear downstream+recursive, select subdag, clear just that task - This will re-run the entire subdag (T1-T5) even though T1-T3 were marked as success
Select T4, clear downstream+recursive, select subdag, click run - Same as #2. Re-runs entire subdag.
Select T4, clear downstream+recursive, manually set the subdag to "running" state. Nothing happens. The tree view shows the subdag in the "running" state but no tasks actually get run.
This seems to only be a problem when trying to re-run part of a subdag. If I have a bunch of tasks in a regular dag, normally selecting a task in the middle and selecting clear downstream+recursive will re-run the dag from that point.
Any suggestions would be appreciated.
You can restart the failed tasks inside a subDAG, this is how:
Zoom into the subDAG, clear the status of failed tasks.
Go back to main DAG, select the subDAG.
Uncheck Recursive and/or Downstream.
Clear status of subDAG.
I use this command to run a subdag successfully, hope it can helps someone:
airflow backfill dag_name.subdag_name -s 2018-05-31 -e 2018-05-31 --reset_dagruns
Related
I have a DAG with one DataflowTemplateOperator that can deal with different json files. When I trigger the dag I pass some parameters via {{dag_run.conf['param1']}} and works fine.
The issue I have is trying to rename the task_id based on param1.
i.e. task_id="df_operator_read_object_json_file_{{dag_run.conf['param1']}}",
it complains about only alphanumeric characters
or
task_id="df_operator_read_object_json_file_{}".format(dag_run.conf['param1']),
it does not recognise dag_run plus the alpha issue.
The whole idea behind this is that when I see at the dataflow jobs console and job has failed I know who the offender is based on param1. Dataflow Job names are based on task_id like this:
df-operator-read-object-json-file-8b9eecec
and what I need is this:
df-operator-read-object-param1-json-file-8b9eecec
Any ideas if this is possible?
There is no need to generate new operator per file.
DataflowTemplatedJobStartOperator has job_name parameter which is also templated so can be used with Jinja.
I didn't test it but this should work:
from airflow.providers.google.cloud.operators.dataflow import DataflowTemplatedJobStartOperator
op = DataflowTemplatedJobStartOperator(
task_id="df_operator_read_object_json_file",
job_name= "df_operator_read_object_json_file_{{dag_run.conf['param1']}}"
template='gs://dataflow-templates/your_template',
location='europe-west3',
)
DockerOperator has a parameter xcom_push which when set, pushes the output of the Docker container to Xcom:
t1 = DockerOperator(task_id='run-hello-world-container',
image='hello-world',
xcom_push=True, xcom_all=True,
dag=dag)
In the admin interface under Xcom, I can see these values with key return_value. However, how can I access them in the DAG?
If I try:
t1_email_output = EmailOperator(task_id='t1_email_output',
to='user#example.com',
subject='Airflow sent you an email!',
html_content={{ ti.xcom_pull(task_ids='return_value') }},
dag=dag)
I get Broken DAG: [PATH] name 'ti' is not defined.
If I try:
t1_email_output = EmailOperator(task_id='t1_email_output',
to='user#example.com',
subject='Airflow sent you an email!',
html_content=t1.xcom_pull(task_ids='return_value'),
dag=dag)
I get Broken DAG: [PATH] xcom_pull() missing 1 required positional argument: 'context'.
You need to pass the task id from which you are pulling the xcom and not the variable name
In your example it would be
{{ ti.xcom_pull('run-hello-world-container') }}
Also in the second snippet it should be "ti" instead of "t1"
html_content=ti.xcom_pull('run-hello-world-container'),
I found the problem - turns out I was missing a quote and my parameter was also wrong:
t1_email_output = EmailOperator(task_id='t1_email_output',
to='user#example.com',
subject='Airflow sent you an email!',
html_content="{{ ti.xcom_pull(key='return_value') }}",
dag=dag)
Sends an email with the Docker container's output like I expect.
I think what is happening is that the {{ }} syntax gets processed as a Jinja template by Airflow when the DAG is run, but not when it is loaded. So if I don't put the quotes around it, Airflow gets Python exceptions when it tries to detect and load the DAG, because the template hasn't been rendered yet. But if the quotes are added, the templated expression is treated as a string, and ignored by Python interpreter when being loaded by Airflow. However when the EmailOperator is actually triggered during a DAG run, the template is rendered into actual references to the relevant data.
According to the documentation, "If xcom_push is True, the last line written to stdout will also be pushed to an XCom when the bash command completes."
However, I need to return a multi-line string that results from a python script being executed from the terminal. I would like to subsequently use this string in an EmailOperator.
So my question is: is it possible to push more than the last line via xcom_push? Ideally, it would be arbitrarily long. I would really appreciate your help, thanks!
EDIT: I have gotten around this problem by using a PythonOperator and calling the script, but I'm still curious if it's possible to push multi-line data to XCom from a BashOperator
As clearly stated in the source code, only the last line of the BashOperator is being pushed if xcom_push = True.
:param xcom_push: If xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes.
However, you could easily create a custom operator inheriting from the BashOperator and implement the double xcom_push.
See the plugins doc on how to build custom operators with Airflow plugins.
Is there any way to set a variable in Airflow UI to get today.date() or something similar to {{ds}} in the DAG code?
I want to have flexibility to set a hard code date in variable without changing the DAG code for some use cases.
I am getting today date in DAG code right now:
today = datetime.today()
but wanted to get it like this:
today= models.Variable.get('todayVar')
This is a duplicate of stackoverflow post:
Airflow - Get start time of dag run
You can achieve what you want by:
{{ dag_run.start_date }}
In airflow the date you are meaning is also called the 'run_date'
I need a BigQueryOperator task like the following one: in which I need to save result from a query to a partitioned table. However, the "month_start" need to be derived from the actual DAG execution_date. I couldn't find any documents or examples on how to read the execution_date in my DAG definition script (in Python). Looking forward to some help here.
FYR: I'm with Airflow 1.8.2
t1_invalid_geohash_by_traffic = BigQueryOperator(
task_id='invalid_geohash_by_traffic',
bql='SQL/dangerous-area/InvalidGeohashByTraffic.sql',
params = params,
destination_dataset_table=
'mydataset.mytable${}'.format(month_start), write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False
)
I think I found the answer. Just ran into this blog: https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow