Airflow dags and PYTHONPATH - airflow

I have some dags that can't seem to locate python modules. Inside of the Airflow UI, I see a ton of these message variations.
Broken DAG: [/home/airflow/source/airflow/dags/test.py] No module named 'paramiko'
Inside of a file I can directly modify the python sys.path and that seems to mitigate my issue.
import sys
sys.path.append('/home/airflow/.local/lib/python2.7/site-packages')
That doesn't feel right though having to set my path in my code directly. I've tried exporting PYTHONPATH in the Airflow user accounts .bashrc but doesn't seem to be read when the dag jobs are executed. What's the correct way to go about this?
Thanks.
----- update -----
Thanks for the responses.
below is my systemctl scripts.
::::::::::::::
airflow-scheduler-airflow2.service
::::::::::::::
[Unit]
Description=Airflow scheduler daemon
[Service]
EnvironmentFile=/usr/local/airflow/instances/airflow2/etc/envars
User=airflow2
Group=airflow2
Type=simple
ExecStart=/usr/local/airflow/instances/airflow2/venv/bin/airflow scheduler
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
::::::::::::::
airflow-webserver-airflow2.service
::::::::::::::
[Unit]
Description=Airflow webserver daemon
[Service]
EnvironmentFile=/usr/local/airflow/instances/airflow2/etc/envars
User=airflow2
Group=airflow2
Type=simple
ExecStart=/usr/local/airflow/instances/airflow2/venv/bin/airflow webserver
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
this is the EnvironentFile Contents uses from above
more /usr/local/airflow/instances/airflow2/etc/envars
PATH=/usr/local/airflow/instances/airflow2/venv/bin:/usr/local/bin:/usr/bin:/bin
AIRFLOW_HOME=/usr/local/airflow/instances/airflow2/home
AIRFLOW_CONFIG=/usr/local/airflow/instances/airflow2/etc/airflow.cfg

I had similar issue:
Python wasn't loaded from virtualenv for running airflow (this fixed airflow deps not being fetched from virtualenv)
Submodules under dags path wasn't loaded due different base path (this fixed importing own modules under dags folder
I added following strings to the environemnt file for systemd service
(/usr/local/airflow/instances/airflow2/etc/envars in your case)
source /home/ubuntu/venv/airflow/bin/activate
PYTHONPATH=/home/ubuntu/venv/airflow/dags:$PYTHONPATH

It looks like your python environment is degraded - you have multiple instances of python on your vm (python 3.6 and python 2.7) and multiple instances of pip. There is a pip with python3.6 that is trying to be used, but all of your modules are actually with your python 2.7.
This can be solved easily by using symbolic links to redirect to 2.7.
Type the commands and see which instance of python is used (2.7.5, 2.7.14, 3.6, etc):
python
python2
python2.7
or type which python to find which python instance is being used by your vm. You can also do which pip to see what pip instance is being used.
I am going to assume python and which python leads to python 3 (which you do not want to use), but python2 and python2.7 lead to the instance you do want to use.
To create a symbolic link so that /home/airflow/.local/lib/python2.7/ is used, do the following and create the following symbolic links:
cd home/airflow/.local/lib/python2.7
ln -s python2 python
ln -s /home/airflow/.local/lib/python2.7 python2
Symbolic link structure is: ln -s #PATHDIRECTED #LINKNAME
You are essentially saying when you run the command python, go to python2. When python2 is then ran, go to /home/airflow/.local/lib/python2.7. Its all being redirected.
Now re run the three commands above (python, python2, python2.7). All should lead to the python instance you want.
Hope this helps!

You can add this directly to the Airflow Dockerfile, such as the example below. If you have a .env file you can do ENV PYTHONPATH "${PYTHONPATH}:${AIRFLOW_HOME}".
FROM puckel/docker-airflow:1.10.6
RUN pip install --user psycopg2-binary
ENV AIRFLOW_HOME=/usr/local/airflow
# add persistent python path (for local imports)
ENV PYTHONPATH "/home/jovyan/work:${AIRFLOW_HOME}"
COPY ./airflow.cfg /usr/local/airflow/airflow.cfg
CMD ["airflow initdb"]

I still have the same problem when I try to trigger a dag from UI (cant locate python local modules i.e my_module.my_sub_module... etc), but when I test with :
airflow test my_dag my_task 2021-04-01
It works fine !
I also have in my .bashrc the line (where it supposed to find python local modules):
export PYTHONPATH="/home/my_user"

Sorry guys this topics is very old but i have a lot of problem for launch airflow as daemon, i share my solution
first i installed anaconda in /home/myuser/anaconda3 and i installed all library that i using in my dags, next create follow files:
/etc/systemd/system/airflow-webserver.service
[Unit]
Description=Airflow webserver daemon
After=network.target
[Service]
Environment="PATH=/home/ubuntu/anaconda3/envs/airflow/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
RuntimeDirectory=airflow
RuntimeDirectoryMode=0775
User=myuser
Group=myuser
Type=simple
ExecStart=/bin/bash -c 'source /home/myuser/anaconda3/bin/activate; airflow webserver -p 8080 --pid /home/myuser/airflow/webserver.pid'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
same for daemon scheduler
/etc/systemd/system/airflow-schedule.service
[Unit]
Description=Airflow schedule daemon
After=network.target
[Service]
Environment="PATH=/home/ubuntu/anaconda3/envs/airflow/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
RuntimeDirectory=airflow
RuntimeDirectoryMode=0775
User=myuser
Group=myuser
Type=simple
ExecStart=/bin/bash -c 'source /home/myuser/anaconda3/bin/activate; airflow scheduler'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
next exec command of systemclt:
sudo systemctl daemon-reload
sudo systemctl enable airflow-webserver.service
sudo systemctl enable airflow-schedule.service
sudo systemctl start airflow-webserver.service
sudo systemctl start airflow-schedule.service

Related

airflow tells me to delete ~/airflow/airflow.cfg. But when I do, it keep getting re-created

AIRFLOW_HOME=/path/to/my/airflow_home
I get this warning ...
>airflow trigger_dag python_dag3
/Users/alexryan/miniconda3/envs/airflow/lib/python3.7/site-packages/airflow/configuration.py:627: DeprecationWarning: You have two airflow.cfg files: /Users/alexryan/airflow/airflow.cfg and /path/to/my/airflow_home/airflow.cfg. Airflow used to look at ~/airflow/airflow.cfg, even when AIRFLOW_HOME was set to a different value. Airflow will now only read /path/to/my/airflow_home/airflow.cfg, and you should remove the other file
I complied and deleted ~/airflow/airflow.cfg, but it keeps coming back.
Is there some way to tell airflow to stop re-creating this?
Running on macOS Mojave
>pip freeze | grep air
apache-airflow==1.10.6
Have you created a systemd service for airflow server?
I think the ~/airflow folder is recreated again and again because you run webserver as daemon mode (maybe launchctl? I am not familiar with macOS)
You should figure out how to pass the enviroment variables to the daemon process. On Linux platform, when daemon configuration is created, "Environment=AIRFLOW_HOME=/path/to/my/airflow_home" is neccessary.
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
EnvironmentFile=/etc/sysconfig/airflow
Environment="AIRFLOW_HOME=/path/to/my/airflow_home"
User=airflow
Group=airflow
Type=simple
ExecStart=/bin/airflow webserver
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target

how to resolve "Error: No module named 'airflow.www'" while starting airflow websever

Getting below error while starting Airflow webserver
balajee#Balajees-MacBook-Air.local:~$ airflow webserver -p 8080
[2018-12-03 00:29:37,066] {init.py:51} INFO - Using executor SequentialExecutor
[2018-12-03 00:29:38,776] {models.py:271} INFO - Filling up the DagBag from /Users/balajee/airflow/dags
Running the Gunicorn Server with:
Workers: 4 sync
Host: 0.0.0.0:8080
Timeout: 120
Logfiles: - -
Error: No module named 'airflow.www'
Fixed for me
pip3 uninstall -y gunicorn
pip3 install gunicorn==19.4.0
I got this problem this morning, and I found a strange solution, may it helps you. I think maybe you just need to change the command running directory.
I install airflow basic dependence in my virtualenv directory venv with PyCharm help, and I use PyCharm build-in Terminal tab to directly access my venv, and I use airflow initdb to init sqlite database to store all my logs and ops, then according to the official tutorial I use airflow webserver to start the webserver. But somehow today I use my Mac terminal, and start virtulenv, and start airflow webserver, and I encounter this problem with:
Running the Gunicorn Server with:
Workers: 4 sync
Host: 0.0.0.0:8080
Timeout: 120
Logfiles: - -
=================================================================
Error: No module named 'airflow.www'
[2019-05-26 07:45:27,130] {cli.py:833} ERROR - No response from gunicorn master within 120 seconds
[2019-05-26 07:45:27,130] {cli.py:834} ERROR - Shutting down webserver
And I tried #Evgeniy Sobolev's solution by reinstall gunicorn and nothing changed, but when I still using my PyCharm Terminal, it can still running successfully. I guess maybe it is because the first directory you init your db and running webserver is critical. By default when I use PyCharm Terminal to init db and start webserver is the Project root directory, like:
(venv) root#root:~/GitHub/FakeProject$ airflow webserver
But today I check into venv to start virtualenv, and the root directory changed!
root#root:~/GitHub/FakeProject/SubDir$ source venv/bin/activate
(venv) root#root:~/GitHub/FakeProject/SubDir$ airflow webserver
** Error **
So in this way it encounters Error: No module named 'airflow.www', so I check out the directory, and the webserver running successfully just like PyCharm Terminal:
(venv) root#root:~/GitHub/FakeProject/SubDir$ cd ..
(venv) root#root:~/GitHub/FakeProject$ airflow webserver
** It works **
I thought maybe airflow store some metadata (like setup a PATH, maybe) in the first time init your airflow db, so you can not change your command running directory.
I hope it may help somebody in the future. Just check your directory!
Looks like you have a problem with gunicorn.
Try to execute this two commands:
sudo -H pip3 uninstall -y gunicorn
sudo -H pip3 install gunicorn
It should resolve your problem, cause airflow show you not clear error message related to gunicorn problems
I did this steps for the problem happens:
create a separate virtualenv only for airflow (I use anaconda distribution)
activate this env with conda activate
install airflow: pip install apache-airflow
at this moment the error No module named 'airflow.www' was showed for me
To fix follow this steps:
Look for where is your gunicorn in: whereis gunicorn
gunicorn have to stay only in your virtualenv directory: /home/yourname/anaconda3/envs/airflow_env/bin/gunicorn
If it stay in two directories, let it just in your airflow enviroment. Remove it all from another.
Another way to verify if gunicorn is in another directories is printing your PATH variable: echo $PATH. Look for gunicorn in /home/yourname/.local/bin and another anaconda directories from PATH. Remove all references. Remove gunicorn from conda base env as well: pip uninstall gunicorn.
With this steps, I think your problem will be solved.
I used anaconda distribution, but I think the same process can be done without it. I used airflow 1.10.0 and python 3.6.
If you have defined a custom home directory for airflow other than default one (~/airflow) during the installation:
You need first export the custom path:
export AIRFLOW_HOME=/your/custom/path/airflow
Go to the airflow directory and then Run the webserver
airflow webserver -p 8080
Run scheduler too
airflow scheduler
please check if gunicorn is installed already in server. for me it was installed in /usr/local/bin and it was taking precedence over gunicorn version installed with airflow. uninstall earlier one or fix $PATH variable
I solved this by starting the webserver from the airflow folder itself.
I was previously trying to open the server from the home directory but the required modules could not be found which may be the case here.
Late to the party but could help others who get here.
I got the same issue using latest airflow version 2.5.0
Make sure env variable AIRFLOW_HOME is pointing to right location
Thanks all for sharing
I added sudo and it actually worked just fine.
I got the same error today and a sudo did the trick to me

How to use Airflow scheduler with systemd?

The docs specify instructions for the integration
What I want is that every time the scheduler stop working it will be restarted by it's own. Usually I start it manually with airflow scheduler -D but sometimes it stops when I'm not available.
Reading the docs I'm not sure about the configs.
The GitHub contains the following files:
airflow
airflow-scheduler.service
airflow.conf
I'm running Ubuntu 16.04
Airflow is installed on:
home/ubuntu/airflow
I have path of:
etc/systemd
The docs says to:
Copy (or link) them to /usr/lib/systemd/system
Copy which of the files?
copy the airflow.conf to /etc/tmpfiles.d/
What is tmpfiles.d ?
What is # AIRFLOW_CONFIG= in the airflow file?
Or in another words... a more "down to earth" guide on how to do it?
Integrating Airflow with systemd files makes watching your daemons easy as systemd can take care of restarting a daemon on failure. This also enables to automatically start airflow webserver and scheduler on system start.
Edit the airflow file from systemd folder in Airflow Github as per the current configuration to set the environment variables for AIRFLOW_CONFIG, AIRFLOW_HOME & SCHEDULER.
Copy the services files (the files with .service extension) to /usr/lib/systemd/system in the VM.
Copy the airflow.conf file to /etc/tmpfiles.d/ or /usr/lib/tmpfiles.d/. Copying airflow.conf ensures /run/airflow is created with the right owner and permissions (0755 airflow airflow). Check whether /run/airflow exist with airflow:airflow owned by airflow user and airflow group if it doesn't create /run/airflowfolder with those permissions.
Enable this services by issuing systemctl enable <service> on command line as shown below.
sudo systemctl enable airflow-webserver
sudo systemctl enable airflow-scheduler
airflow-scheduler.service file should be as below:
[Unit]
Description=Airflow scheduler daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/bin/airflow scheduler
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
Your question dates a little, but I just discovered it, because I'm interested at the moment in the same subject. I think the answer to your question is here.
https://medium.com/#shahbaz.ali03/run-apache-airflow-as-a-service-on-ubuntu-18-04-server-b637c03f4722

Airflow will keep showing example dags even after removing it from configuration

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?
There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.
Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )
Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]
Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.

How to gracefully reload a spawn-fcgi script for nginx

My stack is nginx that runs python web.py fast-cgi scripts using spawn-fcgi. I am using runit to keep the process alive as a Daemon. I am using unix sockets fior the spawed-fcgi.
The below is my runit script called myserver in /etc/sv/myserver with the run file in /etc/sv/myserver/run.
exec spawn-fcgi -n -d /home/ubuntu/Servers/rtbTest/ -s /tmp/nginx9002.socket -u www-data -f /home/ubuntu/Servers/rtbTest/index.py >> /var/log/mylog.sys.log 2>&1
I need to push changes to the sripts to the production servers. I use paramiko to ssh into the box and update the index.py script.
My question is this, how do I gracefully reload the index.py using best practice to update to the new code.
Do I use:
sudo /etc/init.d/nginx reload
Do I restart the the runit script:
sudo sv start myserver
Or do I use both:
sudo /etc/init.d/nginx reload
sudo sv start myserver
Or none of the above?
Basically you have to re-start the process that's loaded your Python script. This is spawn-cgi and not nginx itself. nginx only communicates with spawn-cgi via the Unix socket and will happily re-connect if the connection is lost due to a restart of the spawn-cgi process.
Therefore I'd suggest a simple sudo sv restart myserver. No need to re-start/re-load nginx itself.

Resources