Airflow run specifying incorrect -sd or -SUBDIRECTORY - airflow

I have an Airflow process running every day with many DAGS. Today, all of a sudden, none of the DAGS can be run because when Airflow calls airflow run it misspecifies the -sd directory to find the DAG.
Here's the example:
[2018-09-26 15:18:10,406] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run daily_execution dag1 2018-09-26T13:17:50.511269 --job_id 1 --raw -sd DAGS_FOLDER/newfoldernewfolder/dags/all_dags.py']
As you can see, right after -sd, the the subdirectory repeates newfolder twice when it should only state DAGS_FOLDER/newfolder/dags/all_dags.py.
I tried running the DAG with the same files that were running two days ago (when everything was correct) but I get the same error. I'm guessing that something has changed in Airflow configuration but I'm not aware of any changes in airflow.cfg. I've been only managing the UI and airflow run gets called automatically once I turn the DAG on.
Anybody has an idea where airflow run might get this directory and how I can update this?

Related

Airflow dags being deleted

I have airflow installed on kubernetes cluster. Ive created some dags nevertheless any time i restart server for example by: cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
server deletes all my dags i created. How to keep dags even on webserver restarts?

Airflow - Can't backfill via CLI

I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?

How to run an existing shell script using airflow?

I wanted to run some existing bash scripts using airflow without modifying the code in the script itself. Is it possible without mentioning the shell commands in the script in a task?
Not entirely sure if understood your question, but you can load your shell commands into
Variables through Admin >> Variables menu as a json file.
And in your dag read the variable and pass as parameter into the BashOperator.
Airflow variables in more detail:
https://www.applydatascience.com/airflow/airflow-variables/
Example of variables file:
https://github.com/tuanavu/airflow-tutorial/blob/v0.7/examples/intro-example/dags/config/example_variables.json
How to read the variable:
https://github.com/tuanavu/airflow-tutorial/blob/v0.7/examples/intro-example/dags/example_variables.py
I hope this post helps you.
As long as the shell script is on the same machine that the Airflow Worker is running you can just call the shell script using the Bash Operator like the following:
t2 = BashOperator(
task_id='bash_example',
# Just call the script
bash_command="/home/batcher/test.sh ",
dag=dag)
you have to 'link' your local folder where your shell script is with worker, which means that you need to add volume in worker part of your docker-compose file..
so I added volume line under worker settings and worker now looks at this folder on your local machine:
airflow-worker:
<<: *airflow-common
command: celery worker
healthcheck:
test:
- "CMD-SHELL"
- 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery#$${HOSTNAME}"'
interval: 10s
timeout: 10s
retries: 5
restart: always
volumes:
- /LOCAL_MACHINE_FOLDER/WHERE_SHELL_SCRIPT_IS:/folder_in_root_folder_of_worker

How to re-run all failed tasks in Apache Airflow?

I have a Apache Airflow DAG with tens of thousands of tasks and after a run, say a handful of them failed.
I fixed the bug that caused some tasks to fail and I would like to re-run ONLY FAILED TASKS.
This SO post suggests using the GUI to "clear" failed task:
How to restart a failed task on Airflow
This approach works if you have a handful number of failed tasks.
I am wondering if we can bypass the GUI and do it problematically, through command line something like:
airflow_clear_failed_tasks dag_id execution_data
Use the following command to clear only failed tasks:
airflow clear [-s START_DATE] [-e END_DATE] --only_failed dag_id
Documentation: https://airflow.readthedocs.io/en/stable/cli.html#clear
The command to clear only failed tasks was updated. It is now (Airflow 2.0 as of March 2021):
airflow tasks clear [-s START_DATE] [-e END_DATE] --only-failed dag_id

Airflow will keep showing example dags even after removing it from configuration

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?
There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.
Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )
Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]
Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.

Resources