I am trying to migrate from google cloud composer composer-1.16.4-airflow-1.10.15 to composer-2.0.1-airflow-2.1.4, However we are getting some difficulties with the libraries as each time I upload the libs, the scheduler fails to work.
here is my requirements.txt
flashtext
ftfy
fsspec==2021.11.1
fuzzywuzzy
gcsfs==2021.11.1
gitpython
google-api-core
google-api-python-client
google-cloud
google-cloud-bigquery-storage==1.1.0
google-cloud-storage
grpcio
sklearn
slackclient
tqdm
salesforce-api
pyjwt
google-cloud-secret-manager==1.0.0
pymysql
gspread
fasttext
spacy
click==7.1.2
papermill==2.1.1
tornado>=6.1
jupyter
Here is the code I use to update the libs :
gcloud composer environments update $AIRFLOW_ENV \
--update-pypi-packages-from-file requirements.txt \
--location $AIRFLOW_LOCATION
It works with success but then the dag tasks are not scheduled anymore and the scheduler heartbeat becomes read.
I have tried to remove all the libs and it is scheduled again some times after. I have tried to only add via the interface simple libraries : pandas or flashtext but right after the update, the schedule becomes red again and the tasks stays unscheduled.
I can't find any error log in the log interface. Would you have an idea on how I could see some logs regarding those errors or if you know why those libs are making my env fail ?
Thanks
We have found out what was happening. The root cause was the performances of the workers. To be properly working, composer expects the scanning of the dags to take less than 15% of the CPU ressources. If it exceeds this limit, it fails to schedule or update the dags. We have just taken bigger workers and it has worked well
When there is a task running, Airflow will pop a notice saying the scheduler does not appear to be running and it kept showing until the task finished:
The scheduler does not appear to be running. Last heartbeat was received 5 minutes ago.
The DAGs list may not update, and new tasks will not be scheduled.
Actually, the scheduler process is running, as I have checked the process. After the task finished, the notice will disappear and everything back to normal.
My task is kind of heavy, may running for couple hours.
I think it is expected for Sequential Executor. Sequential Executor runs one thing at a time so it cannot run heartbeat and task at the same time.
Why do you need to use Sequential Executor / Sqlite? The advice to switch to other DB/Executor make perfect sense.
You have started airflow webserver and you haven't started your airflow scheduler.
Run airflow scheduler in background
airflow scheduler > /console/scheduler_log.log &
I had the same issue.
I switch to postgresql by updating airflow.cfg file > sql_alchemy_conn =postgresql+psycopg2://airflow#localhost:5432/airflow
and executor = LocalExecutor
This link may help how to set this up locally
https://medium.com/#taufiq_ibrahim/apache-airflow-installation-on-ubuntu-ddc087482c14
A quick fix could be to run the airflow scheduler separately. Perhaps not the best solution but it did work for me. To do so, run this command in the terminal:
airflow scheduler
I had a similar issue and have been trying to troubleshoot this for a while now.
I managed to fix it by setting this value in airflow.cfg:
scheduler_health_check_threshold = 240
PS: Based on a recent conversation in Airflow Slack Community, it could happen due to contention at the Database side. So, another workaround suggested was to scale up the database. In my case, this was not a viable solution.
EDIT:
This was last tested with Airflow Version 2.3.3
I have solved this issue by deleting airflow-scheduler.pid file.
then
airflow scheduler -D
Check the airflow-scheduler.err and airflow-scheduler.log files.
I got an error like this:
Traceback (most recent call last):
File "/home/myVM/venv/py_env/lib/python3.8/site-packages/lockfile/pidlockfile.py", ine 77, in acquire
write_pid_to_pidfile(self.path)
File "/home/myVM/venv/py_env/lib/python3.8/site-packages/lockfile/pidlockfile.py", line 161, in write_pid_to_pidfile
pidfile_fd = os.open(pidfile_path, open_flags, open_mode)
FileExistsError: [Errno 17] File exists: '/home/myVM/venv/py_env/airflow-scheduler.pid'
I removed the existing airflow-scheduler.pid file and started the scheduler again by airflow scheduler -D. It was working fine then.
Our problem is that the file "logs/scheduler.log" is too large, 1TB. After cleaning this file everything is fine.
I had the same issue while using sqlite. There was a special message in Airflow logs: ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1. If you use only 1 thread, the scheduler will be unavailable while executing a dag.
So if use sqlite, try to switch to another database. If you don't, check max_threads value in your airflow.cfg.
On Composer page, click on your environment name, and it will open the Environment details, go to the PyPIPackages tab.
Click on Edit button, increase the any package version.
For example:
I increased the version of pymsql packages, and this restarted the airflow environment, it took a while for it to update. Once it is done, I'm no longer have this error.
You can also add a Python package, it will restart the airflow environment.
I've had the same issue after changing the airflow timezone. I then restarted the airflow-scheduler and it works. You can also check if the airflow-scheduler and airflow-worker are on different servers.
In simple words, using LocalExecutor and postgresql could fix this error.
Running Airflow locally, following the instruction, https://airflow.apache.org/docs/apache-airflow/stable/start/local.html.
It has the default config
executor = SequentialExecutor
sql_alchemy_conn = sqlite:////Users/yourusername/airflow/airflow.db
It will use SequentialExecutor and sqlite by default, and it will have this "The scheduler does not appear to be running." error.
To fix it, I followed Jarek Potiuk's advice. I changed the following config:
executor = LocalExecutor
sql_alchemy_conn = postgresql://postgres:masterpasswordforyourlocalpostgresql#localhost:5432
And then I rerun the "airflow db init"
airflow db init
airflow users create \
--username admin \
--firstname Peter \
--lastname Parker \
--role Admin \
--email spiderman#superhero.org
After the db inited. Run
airflow webserver --port 8080
airflow scheduler
This fixed the airflow scheduler error.
This happens to me when AIRFLOW_HOME is not set.
By setting AIRFLOW_HOME to the correct path, the indicated executor will be selected.
If it matters: somehow, the -D flag causes a lot of problems for me. The airflow webserver -D immediately crashes after starting, and airflow scheduler -D somehow does next to nothing for me.
Weirdly enough, it works without the detach flag. This means I can just run the program normally, and make it run in the background, with e.g. nohup airflow scheduler &.
After change executor from SequentialExecutor to LocalExecutor, it works!
in airflow.cfg:
executor = LocalExecutor
We have installed airflow from service account say 'ABC' using sudo root in virtual environment, but we are facing few issues.
Calling python script using bash operator. Python script uses some
environmental variables from unix account 'ABC'.While running from
airflow, environmental variables are not picked. In order to find the
user of airflow, created dummy dag with bashoperator command
'whoami', it returns the ABC user. So airflow is using the same 'ABC'
user. Then why environmental variables are not picked?
Then tried sudo -u ABC python script. Environmental variables are not picked, due to sudo usage. Did the workaround without environmental variables and it ran well in development environment without issues. But while moving to different environment, got the below error and we don't have permission to edit sudoers file. Admin policy didn't comply.
sudo: sorry, you must have a tty to run sudo
Then used 'impersonation=ABC' option in .cfg file and ran the airflow without sudo. This time, bash command fails for environmental variables and it's asking all the packages used in script in virtual environment.
My Questions:
Airflow is installed through ABC after sudoing root. Why ABC was not
treated while running the script.
Why ABC environmental variables are not picked?
Even Impersonation option is not picking the environmental
variables?
Can airflow be installed without virtual environment?
Which is the best approach to install airflow? Using separate user
and sudoing root? We are using dedicated user for running python
script.Experts kindly clarify.
It's always a good idea to use virtualenv for installing any python packages. So, you should always prefer installing airflow in a virtaulenv.
You can use systemd or supervisord and create programs for airflow webserver and scheduler. Example configuration for supervisor:
[program:airflow-webserver]
command=sh /home/airflow/scripts/start-airflow-webserver.sh
directory=/home/airflow
autostart=true
autorestart=true
startretries=3
stderr_logfile=/home/airflow/supervisor/logs/airflow-webserver.err.log
stdout_logfile=/home/airflow/supervisor/logs/airflow-webserver.log
user=airflow
environment=AIRFLOW_HOME='/home/airflow/'
[program:airflow-scheduler]
command=sh /home/airflow/scripts/start-airflow-scheduler.sh
directory=/home/airflow
autostart=true
autorestart=true
startretries=3
stderr_logfile=/home/airflow/supervisor/logs/airflow-scheduler.err.log
stdout_logfile=/home/airflow/supervisor/logs/airflow-scheduler.log
user=airflow
environment=AIRFLOW_HOME='/home/airflow/'
We got the same issue as.
sudo: sorry, you must have a tty to run sudo
The solution we got is,
su ABC python script
I have ran airflow initdb for the second time and received the following error:
alembic.util.exc.CommandError: Can't locate revision identified by '9635ae0956e7'
So I understand that I need to remove the already registered version, but I cannot seem to find it. when I open the mysql cli (using sudo mysql) I only see 4 dbs: information_schema ,mysql, performance_schem, sys
How to do I remove this revision and start fresh?
So one solution, that worked for me, is to remove the file airflow.db under $ARIFLOW_HOME/ and run airflow initdb again. This recreates the file with a new revision
While bootstapping on AWS EMR - I am getting the following. Any clues how to resolve it?
/mnt/var/lib/bootstrap-actions/1/STAR: /lib/libc.so.6: version 'GLIBC_2.14' not found (required by /mnt/var/lib/bootstrap-actions/1/STAR)
It's probably caused by not having high enough version of libc6.
You can SSH into the EC2 instance the EMR job created by following this: Open an SSH Tunnel to the Master Node
Then update the packages, for example, if your instance uses ubuntu, you should do sudo apt-get update. The command depends on which distribution of linux you are creating on your ec2 instance. The default emr job uses Debian, and Amazon linux is built based on redhat.
See if this would work.
If this is actually the problem, you can add this update package command (with ignoring Y/N prompt) at the start of your bootstrap script.