Airflow will keep showing example dags even after removing it from configuration - airflow

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?

There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.

Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )

Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]

Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.

Related

Airflow - Can't backfill via CLI

I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?

How do you access Airflow Web Interface?

Hi I am taking a datacamp class on how to use Airflow and it shows how to create dags once you have access to an Airflow Web Interface.
Is there an easy way to create an account in the Airflow Web Interface? I am very lost on how to do this or is this just an enterprise tool where they provide you access to it once you pay?
You must do this on terminal. Run these commands:
export AIRFLOW_HOME=~/airflow
AIRFLOW_VERSION=2.2.5
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
airflow standalone
Then, in there, you can see the username and password provided.
Then, open Chrome and search for:
localhost:8080
And write the username and password.
airflow has a web interface as well by default and default user pass is : airflow/airflow
you can run it by using :
airflow webserver --port 8080
then open the link : http://localhost:8080
if you want to make a new username by this command:
airflow create_user [-h] [-r ROLE] [-u USERNAME] [-e EMAIL] [-f FIRSTNAME]
[-l LASTNAME] [-p PASSWORD] [--use_random_password]
learn more about Running Airflow locally
You should install it , it is a python package not a website to register on.
The easiest way to install Airflow is:
pip install apache-airflow
if you need extra packages with it:
pip install apache-airflow[postgres,gcp]
finally run the webserver and the scheduler in different cmd :
airflow webserver # it is by default 8080
airflow scheduler

Airflow dags and PYTHONPATH

I have some dags that can't seem to locate python modules. Inside of the Airflow UI, I see a ton of these message variations.
Broken DAG: [/home/airflow/source/airflow/dags/test.py] No module named 'paramiko'
Inside of a file I can directly modify the python sys.path and that seems to mitigate my issue.
import sys
sys.path.append('/home/airflow/.local/lib/python2.7/site-packages')
That doesn't feel right though having to set my path in my code directly. I've tried exporting PYTHONPATH in the Airflow user accounts .bashrc but doesn't seem to be read when the dag jobs are executed. What's the correct way to go about this?
Thanks.
----- update -----
Thanks for the responses.
below is my systemctl scripts.
::::::::::::::
airflow-scheduler-airflow2.service
::::::::::::::
[Unit]
Description=Airflow scheduler daemon
[Service]
EnvironmentFile=/usr/local/airflow/instances/airflow2/etc/envars
User=airflow2
Group=airflow2
Type=simple
ExecStart=/usr/local/airflow/instances/airflow2/venv/bin/airflow scheduler
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
::::::::::::::
airflow-webserver-airflow2.service
::::::::::::::
[Unit]
Description=Airflow webserver daemon
[Service]
EnvironmentFile=/usr/local/airflow/instances/airflow2/etc/envars
User=airflow2
Group=airflow2
Type=simple
ExecStart=/usr/local/airflow/instances/airflow2/venv/bin/airflow webserver
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
this is the EnvironentFile Contents uses from above
more /usr/local/airflow/instances/airflow2/etc/envars
PATH=/usr/local/airflow/instances/airflow2/venv/bin:/usr/local/bin:/usr/bin:/bin
AIRFLOW_HOME=/usr/local/airflow/instances/airflow2/home
AIRFLOW_CONFIG=/usr/local/airflow/instances/airflow2/etc/airflow.cfg
I had similar issue:
Python wasn't loaded from virtualenv for running airflow (this fixed airflow deps not being fetched from virtualenv)
Submodules under dags path wasn't loaded due different base path (this fixed importing own modules under dags folder
I added following strings to the environemnt file for systemd service
(/usr/local/airflow/instances/airflow2/etc/envars in your case)
source /home/ubuntu/venv/airflow/bin/activate
PYTHONPATH=/home/ubuntu/venv/airflow/dags:$PYTHONPATH
It looks like your python environment is degraded - you have multiple instances of python on your vm (python 3.6 and python 2.7) and multiple instances of pip. There is a pip with python3.6 that is trying to be used, but all of your modules are actually with your python 2.7.
This can be solved easily by using symbolic links to redirect to 2.7.
Type the commands and see which instance of python is used (2.7.5, 2.7.14, 3.6, etc):
python
python2
python2.7
or type which python to find which python instance is being used by your vm. You can also do which pip to see what pip instance is being used.
I am going to assume python and which python leads to python 3 (which you do not want to use), but python2 and python2.7 lead to the instance you do want to use.
To create a symbolic link so that /home/airflow/.local/lib/python2.7/ is used, do the following and create the following symbolic links:
cd home/airflow/.local/lib/python2.7
ln -s python2 python
ln -s /home/airflow/.local/lib/python2.7 python2
Symbolic link structure is: ln -s #PATHDIRECTED #LINKNAME
You are essentially saying when you run the command python, go to python2. When python2 is then ran, go to /home/airflow/.local/lib/python2.7. Its all being redirected.
Now re run the three commands above (python, python2, python2.7). All should lead to the python instance you want.
Hope this helps!
You can add this directly to the Airflow Dockerfile, such as the example below. If you have a .env file you can do ENV PYTHONPATH "${PYTHONPATH}:${AIRFLOW_HOME}".
FROM puckel/docker-airflow:1.10.6
RUN pip install --user psycopg2-binary
ENV AIRFLOW_HOME=/usr/local/airflow
# add persistent python path (for local imports)
ENV PYTHONPATH "/home/jovyan/work:${AIRFLOW_HOME}"
COPY ./airflow.cfg /usr/local/airflow/airflow.cfg
CMD ["airflow initdb"]
I still have the same problem when I try to trigger a dag from UI (cant locate python local modules i.e my_module.my_sub_module... etc), but when I test with :
airflow test my_dag my_task 2021-04-01
It works fine !
I also have in my .bashrc the line (where it supposed to find python local modules):
export PYTHONPATH="/home/my_user"
Sorry guys this topics is very old but i have a lot of problem for launch airflow as daemon, i share my solution
first i installed anaconda in /home/myuser/anaconda3 and i installed all library that i using in my dags, next create follow files:
/etc/systemd/system/airflow-webserver.service
[Unit]
Description=Airflow webserver daemon
After=network.target
[Service]
Environment="PATH=/home/ubuntu/anaconda3/envs/airflow/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
RuntimeDirectory=airflow
RuntimeDirectoryMode=0775
User=myuser
Group=myuser
Type=simple
ExecStart=/bin/bash -c 'source /home/myuser/anaconda3/bin/activate; airflow webserver -p 8080 --pid /home/myuser/airflow/webserver.pid'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
same for daemon scheduler
/etc/systemd/system/airflow-schedule.service
[Unit]
Description=Airflow schedule daemon
After=network.target
[Service]
Environment="PATH=/home/ubuntu/anaconda3/envs/airflow/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
RuntimeDirectory=airflow
RuntimeDirectoryMode=0775
User=myuser
Group=myuser
Type=simple
ExecStart=/bin/bash -c 'source /home/myuser/anaconda3/bin/activate; airflow scheduler'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
next exec command of systemclt:
sudo systemctl daemon-reload
sudo systemctl enable airflow-webserver.service
sudo systemctl enable airflow-schedule.service
sudo systemctl start airflow-webserver.service
sudo systemctl start airflow-schedule.service

how to clear failing DAGs using the CLI in airflow

I have some failing DAGs, let's say from 1st-Feb to 20th-Feb. From that date upword, all of them succeeded.
I tried to use the cli (instead of doing it twenty times with the Web UI):
airflow clear -f -t * my_dags.my_dag_id
But I have a weird error:
airflow: error: unrecognized arguments: airflow-webserver.pid airflow.cfg airflow_variables.json my_dags.my_dag_id
EDIT 1:
Like #tobi6 explained it, the * was indeed causing troubles.
Knowing that, I tried this command instead:
airflow clear -u -d -f -t ".*" my_dags.my_dag_id
but it's only returning failed task instances (-f flag). -d and -u flags don't seem to work because taskinstances downstream and upstream the failed ones are ignored (not returned).
EDIT 2:
like #tobi6 suggested, using -s and -e permits to select all DAG runs within a date range. Here is the command:
airflow clear -s "2018-04-01 00:00:00" -e "2018-04-01 00:00:00" my_dags.my_dag_id.
However, adding -f flag to the command above only returns failed task instances. is it possible to select all failed task instances of all failed DAG runs within a date range ?
If you are using an asterik * in the Linux bash, it will automatically expand the content of the directory.
Meaning it will replace the asterik with all files in the current working directory and then execute your command.
This will help to avoid the automatic expansion:
"airflow clear -f -t * my_dags.my_dag_id"
One solution I've found so far is by executing sql(MySQL in my case):
update task_instance t left join dag_run d on d.dag_id = t.dag_id and d.execution_date = t.execution_date
set t.state=null,
d.state='running'
where t.dag_id = '<your_dag_id'
and t.execution_date > '2020-08-07 23:00:00'
and d.state='failed';
It will clear all tasks states on failed dag_runs, as button 'clear' pressed for entire dag run in web UI.
In airflow 2.2.4 the airflow clear command was deprecated.
You could now run:
airflow tasks clear -s <your_start_date> -e <end_date> <dag_id>

How deployment works with Airflow?

I'm using the Celery Executor and the setup from this dockerfile.
I'm deploying my dag into /usr/local/airflow/dags directory into the scheduler's container.
I'm able to run my dag with the command:
$ docker exec airflow_webserver_1 airflow backfill mydag -s 2016-01-01 -e 2016-02-01
My dag contains a simple bash operator:
BashOperator(command = "test.sh" ... )
The operator runs the test.sh script.
However if the test.sh refers to other files, like callme.sh, then I receive a "cannot find file" error.
e.g
$ pwd
/usr/local/airflow/dags/myworkflow.py
$ ls
myworkflow.py
test.sh
callme.sh
$ cat test.sh
echo "test file"
./callme.sh
$ cat callme.sh
echo "got called"
When running myworkflow, the task to call test.sh is invoked but fails for not finding the callme.sh.
I find this confusing. Is it my responsibility to share the code resource files with the worker or airflow's responsibility? If it's mine, then what is the recommended approach to do so? I'm looking at using EFS with it mounted on all container but it looks very expensive to me.
For celery executor, it is your responsibility to make sure that each worker has all the files it needs to run a job.

Resources