I have airflow installed on kubernetes cluster. Ive created some dags nevertheless any time i restart server for example by: cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
server deletes all my dags i created. How to keep dags even on webserver restarts?
Related
I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?
I am new to airflow, tried to run a dag by starting airflow webserver and scheduler. After I closed the scheduler and airflow webserver, the airflow processes are still running.
ps aux | grep airflow shows 2 airflow webserver running, and scheduler running for all dags.
I tried running kill $(ps aux | grep airflow | awk '{print $2}') but it did not help.
I don't have sudo permissions and webserver UI access.
If you run Airflow locally and start it with the two commands airflow scheduler and airflow webserver, then those processes will run in the foreground. So, simply hitting Ctrl-C for each of them should terminate them and all their child processes.
If you don't have those two processes running in the foreground, there is another way. Airflow creates files with process IDs of the scheduler and gunicorn server in its home directory (by default ~/airflow/).
Running
kill $(cat ~/airflow/airflow-scheduler.pid)
should terminate the scheduler.
Unfortunately, airflow-webserver.pid contains the PID of the gunicorn server and not the initial Airflow command that started it (which is the parent of the gunicorn process). So, we will first have to find the parent PID of the gunicorn process and then kill the parent process.
Running
kill $(ps -o ppid= -p $(cat ~/airflow/airflow-webserver.pid))
should terminate the webserver.
If simply running kill (i.e., sending SIGTERM) for these processes does not work you can always try sending SIGKILL: kill -9 <pid>. This should definitely kill them.
I have an Airflow process running every day with many DAGS. Today, all of a sudden, none of the DAGS can be run because when Airflow calls airflow run it misspecifies the -sd directory to find the DAG.
Here's the example:
[2018-09-26 15:18:10,406] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run daily_execution dag1 2018-09-26T13:17:50.511269 --job_id 1 --raw -sd DAGS_FOLDER/newfoldernewfolder/dags/all_dags.py']
As you can see, right after -sd, the the subdirectory repeates newfolder twice when it should only state DAGS_FOLDER/newfolder/dags/all_dags.py.
I tried running the DAG with the same files that were running two days ago (when everything was correct) but I get the same error. I'm guessing that something has changed in Airflow configuration but I'm not aware of any changes in airflow.cfg. I've been only managing the UI and airflow run gets called automatically once I turn the DAG on.
Anybody has an idea where airflow run might get this directory and how I can update this?
Hi I'm using Airflow and put my airflow project in EC2. However, how would one keep the airflow scheduler running while my mac goes sleep or exiting ssh?
You have a few options, but none will keep it active on a sleeping laptop. On a server:
Can use --daemon to run as daemon: airflow scheduler --daemon
Or, maybe run in background: airflow scheduler >& log.txt &
Or, run inside 'screen' as above, then detach from screen using ctrl-a d, reattach as needed using 'screen -r'. That would work on an ssh connection.
I use nohup to keep the scheduler running and redirect the output to a log file like so:
nohup airflow scheduler >> ${AIRFLOW_HOME}/logs/scheduler.log 2>&1 &
Note: Assuming you are running the scheduler here on your EC2 instance and not on your laptop.
In case you need more details on running it as deamon i.e. detach completely from terminal and redirecting stdout and stderr, here is an example:
airflow webserver -p 8080 -D --pid /your-path/airflow-webserver.pid --stdout /your-path/airflow-webserver.out --stderr /your-path/airflow-webserver.err
airflow scheduler -D --pid /your-path/airflow-scheduler.pid —stdout /your-path/airflow-scheduler.out --stderr /your-path/airflow-scheduler.err
The most robust solution would be to register it as a service on your EC2 instance. Airflow provides systemd and upstart scripts for that (https://github.com/apache/incubator-airflow/tree/master/scripts/systemd and https://github.com/apache/incubator-airflow/tree/master/scripts/upstart).
For Amazon Linux, you'd need the upstart scripts, and for e.g. Ubuntu, you would use the systemd scripts.
Registering it as a system service is much more robust because Airflow will be started upon reboot or when it crashes. This is not the case when you use e.g. nohup like other people suggest here.
Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?
There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.
Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )
Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]
Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.