Airflow find DAG Runs with specific conf - airflow

I'm using apache-airflow 1.10.10.
My use case is: I have 4 DAGs, all of them triggers a common DAG named "dag_common" with different conf parameters using BashOperator after some work.
airflow trigger_dag -c {"id":"1"} dag_common
airflow trigger_dag -c {"id":"2"} dag_common
airflow trigger_dag -c {"id":"3"} dag_common
airflow trigger_dag -c {"id":"4"} dag_common
Inside these DAGs I have to wait for the triggered DAG to finish, how can I accomplish this?
Dag1 has to wait until finish dag_common with conf id=1.
Is there any way to find all dag runs with specific conf?

It looks like a use case for SubDAGs: Implement dag_common as a subDAG and use SubDagOperator() in those four DAGs to run dag_common.

Related

Airflow - Can't backfill via CLI

I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?

How to read value passed to the airflow backfill --conf {"key": "value"}

I have an airflow DAG which I can run with some parameters using:
airflow trigger_dag 'my_dag' --conf '{"key":"value"}'
then I can get the 'value' in my DAG like this:
context['dag_run'].conf.get('key')
I would like to do the same using backfill:
airflow backfill 'my_dag' --conf '{"key":"value"}' -s 2019-04-15 -e 2019-04-16
Is it possible to get passed value in --conf for backfill?
I came upon this question while having the same issue, and although its a few years later I thought this might help someone.
As the OP suspected, prior DAG executions impact whether a backfill will use the conf provided in the command line. This was recently raised in this issue and a fix merged: https://github.com/apache/airflow/pull/22837
Yes, the backfill command also has a conf parameter.
From: https://airflow.apache.org/1.10.3/cli.html#backfill
airflow backfill [-h] [-t TASK_REGEX] [-s START_DATE] [-e END_DATE] [-m] [-l]
[-x] [-i] [-I] [-sd SUBDIR] [--pool POOL]
[--delay_on_limit DELAY_ON_LIMIT] [-dr] [-v] [-c CONF]
[--reset_dagruns] [--rerun_failed_tasks] [-B]
dag_id
-c, --conf
JSON string that gets pickled into the DagRun’s conf attribute

How to re-run all failed tasks in Apache Airflow?

I have a Apache Airflow DAG with tens of thousands of tasks and after a run, say a handful of them failed.
I fixed the bug that caused some tasks to fail and I would like to re-run ONLY FAILED TASKS.
This SO post suggests using the GUI to "clear" failed task:
How to restart a failed task on Airflow
This approach works if you have a handful number of failed tasks.
I am wondering if we can bypass the GUI and do it problematically, through command line something like:
airflow_clear_failed_tasks dag_id execution_data
Use the following command to clear only failed tasks:
airflow clear [-s START_DATE] [-e END_DATE] --only_failed dag_id
Documentation: https://airflow.readthedocs.io/en/stable/cli.html#clear
The command to clear only failed tasks was updated. It is now (Airflow 2.0 as of March 2021):
airflow tasks clear [-s START_DATE] [-e END_DATE] --only-failed dag_id

Airflow run specifying incorrect -sd or -SUBDIRECTORY

I have an Airflow process running every day with many DAGS. Today, all of a sudden, none of the DAGS can be run because when Airflow calls airflow run it misspecifies the -sd directory to find the DAG.
Here's the example:
[2018-09-26 15:18:10,406] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run daily_execution dag1 2018-09-26T13:17:50.511269 --job_id 1 --raw -sd DAGS_FOLDER/newfoldernewfolder/dags/all_dags.py']
As you can see, right after -sd, the the subdirectory repeates newfolder twice when it should only state DAGS_FOLDER/newfolder/dags/all_dags.py.
I tried running the DAG with the same files that were running two days ago (when everything was correct) but I get the same error. I'm guessing that something has changed in Airflow configuration but I'm not aware of any changes in airflow.cfg. I've been only managing the UI and airflow run gets called automatically once I turn the DAG on.
Anybody has an idea where airflow run might get this directory and how I can update this?

Airflow will keep showing example dags even after removing it from configuration

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?
There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.
Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )
Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]
Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.

Resources