I have an airflow DAG which I can run with some parameters using:
airflow trigger_dag 'my_dag' --conf '{"key":"value"}'
then I can get the 'value' in my DAG like this:
context['dag_run'].conf.get('key')
I would like to do the same using backfill:
airflow backfill 'my_dag' --conf '{"key":"value"}' -s 2019-04-15 -e 2019-04-16
Is it possible to get passed value in --conf for backfill?
I came upon this question while having the same issue, and although its a few years later I thought this might help someone.
As the OP suspected, prior DAG executions impact whether a backfill will use the conf provided in the command line. This was recently raised in this issue and a fix merged: https://github.com/apache/airflow/pull/22837
Yes, the backfill command also has a conf parameter.
From: https://airflow.apache.org/1.10.3/cli.html#backfill
airflow backfill [-h] [-t TASK_REGEX] [-s START_DATE] [-e END_DATE] [-m] [-l]
[-x] [-i] [-I] [-sd SUBDIR] [--pool POOL]
[--delay_on_limit DELAY_ON_LIMIT] [-dr] [-v] [-c CONF]
[--reset_dagruns] [--rerun_failed_tasks] [-B]
dag_id
-c, --conf
JSON string that gets pickled into the DagRun’s conf attribute
Related
I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?
In Airflow 2.0, you can get the status of tasks in a dag by running CLI command: airflow tasks states-for-dag-run. (See docs here: https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#state_repeat1)
What's the equivalent in Airflow 1.10.12? I can't seem to find it in the 1.10.12 docs.
There is no direct equivalent as this is a new CLI command of Airflow 2.0.
In Airflow 1.10.12 you can do (docs):
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
I'm using apache-airflow 1.10.10.
My use case is: I have 4 DAGs, all of them triggers a common DAG named "dag_common" with different conf parameters using BashOperator after some work.
airflow trigger_dag -c {"id":"1"} dag_common
airflow trigger_dag -c {"id":"2"} dag_common
airflow trigger_dag -c {"id":"3"} dag_common
airflow trigger_dag -c {"id":"4"} dag_common
Inside these DAGs I have to wait for the triggered DAG to finish, how can I accomplish this?
Dag1 has to wait until finish dag_common with conf id=1.
Is there any way to find all dag runs with specific conf?
It looks like a use case for SubDAGs: Implement dag_common as a subDAG and use SubDagOperator() in those four DAGs to run dag_common.
I have a Apache Airflow DAG with tens of thousands of tasks and after a run, say a handful of them failed.
I fixed the bug that caused some tasks to fail and I would like to re-run ONLY FAILED TASKS.
This SO post suggests using the GUI to "clear" failed task:
How to restart a failed task on Airflow
This approach works if you have a handful number of failed tasks.
I am wondering if we can bypass the GUI and do it problematically, through command line something like:
airflow_clear_failed_tasks dag_id execution_data
Use the following command to clear only failed tasks:
airflow clear [-s START_DATE] [-e END_DATE] --only_failed dag_id
Documentation: https://airflow.readthedocs.io/en/stable/cli.html#clear
The command to clear only failed tasks was updated. It is now (Airflow 2.0 as of March 2021):
airflow tasks clear [-s START_DATE] [-e END_DATE] --only-failed dag_id
I have some failing DAGs, let's say from 1st-Feb to 20th-Feb. From that date upword, all of them succeeded.
I tried to use the cli (instead of doing it twenty times with the Web UI):
airflow clear -f -t * my_dags.my_dag_id
But I have a weird error:
airflow: error: unrecognized arguments: airflow-webserver.pid airflow.cfg airflow_variables.json my_dags.my_dag_id
EDIT 1:
Like #tobi6 explained it, the * was indeed causing troubles.
Knowing that, I tried this command instead:
airflow clear -u -d -f -t ".*" my_dags.my_dag_id
but it's only returning failed task instances (-f flag). -d and -u flags don't seem to work because taskinstances downstream and upstream the failed ones are ignored (not returned).
EDIT 2:
like #tobi6 suggested, using -s and -e permits to select all DAG runs within a date range. Here is the command:
airflow clear -s "2018-04-01 00:00:00" -e "2018-04-01 00:00:00" my_dags.my_dag_id.
However, adding -f flag to the command above only returns failed task instances. is it possible to select all failed task instances of all failed DAG runs within a date range ?
If you are using an asterik * in the Linux bash, it will automatically expand the content of the directory.
Meaning it will replace the asterik with all files in the current working directory and then execute your command.
This will help to avoid the automatic expansion:
"airflow clear -f -t * my_dags.my_dag_id"
One solution I've found so far is by executing sql(MySQL in my case):
update task_instance t left join dag_run d on d.dag_id = t.dag_id and d.execution_date = t.execution_date
set t.state=null,
d.state='running'
where t.dag_id = '<your_dag_id'
and t.execution_date > '2020-08-07 23:00:00'
and d.state='failed';
It will clear all tasks states on failed dag_runs, as button 'clear' pressed for entire dag run in web UI.
In airflow 2.2.4 the airflow clear command was deprecated.
You could now run:
airflow tasks clear -s <your_start_date> -e <end_date> <dag_id>