I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?
In Airflow 2.0, you can get the status of tasks in a dag by running CLI command: airflow tasks states-for-dag-run. (See docs here: https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#state_repeat1)
What's the equivalent in Airflow 1.10.12? I can't seem to find it in the 1.10.12 docs.
There is no direct equivalent as this is a new CLI command of Airflow 2.0.
In Airflow 1.10.12 you can do (docs):
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
I'm using apache-airflow 1.10.10.
My use case is: I have 4 DAGs, all of them triggers a common DAG named "dag_common" with different conf parameters using BashOperator after some work.
airflow trigger_dag -c {"id":"1"} dag_common
airflow trigger_dag -c {"id":"2"} dag_common
airflow trigger_dag -c {"id":"3"} dag_common
airflow trigger_dag -c {"id":"4"} dag_common
Inside these DAGs I have to wait for the triggered DAG to finish, how can I accomplish this?
Dag1 has to wait until finish dag_common with conf id=1.
Is there any way to find all dag runs with specific conf?
It looks like a use case for SubDAGs: Implement dag_common as a subDAG and use SubDagOperator() in those four DAGs to run dag_common.
I have a Apache Airflow DAG with tens of thousands of tasks and after a run, say a handful of them failed.
I fixed the bug that caused some tasks to fail and I would like to re-run ONLY FAILED TASKS.
This SO post suggests using the GUI to "clear" failed task:
How to restart a failed task on Airflow
This approach works if you have a handful number of failed tasks.
I am wondering if we can bypass the GUI and do it problematically, through command line something like:
airflow_clear_failed_tasks dag_id execution_data
Use the following command to clear only failed tasks:
airflow clear [-s START_DATE] [-e END_DATE] --only_failed dag_id
Documentation: https://airflow.readthedocs.io/en/stable/cli.html#clear
The command to clear only failed tasks was updated. It is now (Airflow 2.0 as of March 2021):
airflow tasks clear [-s START_DATE] [-e END_DATE] --only-failed dag_id
I use Airflow to manage ETL tasks execution and schedule. A DAG has been created and it works fine. But is it possible to pass parameters when manually trigger the dag via cli.
For example:
My DAG runs every day at 01:30, and processes data for yesterday(time range from 01:30 yesterday to 01:30 today). There might be some issues with the data source. I need to re-process those data (manually specify the time range).
So can I create such an airflow DAG, when it's scheduled, that the default time range is from 01:30 yesterday to 01:30 today. Then if anything wrong with the data source, I need to manually trigger the DAG and manually pass the time range as parameters.
As I know airflow test has -tp that can pass params to the task. But this is only for testing a specific task. and airflow trigger_dag doesn't have -tp option. So is there any way to tigger_dag and pass parameters to the DAG, and then the Operator can read these parameters?
Thanks!
You can pass parameters from the CLI using --conf '{"key":"value"}' and then use it in the DAG file as "{{ dag_run.conf["key"] }}" in templated field.
CLI:
airflow trigger_dag 'example_dag_conf' -r 'run_id' --conf '{"message":"value"}'
DAG File:
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_dag_conf',
default_args=args,
schedule_interval=None,
)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag,
)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag,
)
This should work, as per the airflow documentation: https://airflow.apache.org/cli.html#trigger_dag
airflow trigger_dag -c '{"key1":1, "key2":2}' dag_id
Make sure the value of -c is a valid json string, so the double quotes wrapping the keys are necessary here.
key: ['param1=somevalue1', 'param2=somevalue2']
First way:
"{{ dag_run.conf["key"] }}"
This will render the passed value as String "['param1=somevalue1', 'param2=somevalue2']"
Second way:
def get_parameters(self, **kwargs):
dag_run = kwargs.get('dag_run')
parameters = dag_run.conf['key']
return parameters
In this scenario, a list of strings is being passed and will be rendered as a list ['param1=somevalue1', 'param2=somevalue2']