airflow webserver cpu usage high even when idle - airflow

I setup airflow instance using docker compose defined in quickstart. I switched to a LocalExecutor and removed Celery and worker instance. One other change was to increase healthcheck interval to 3600s. Apart from this all default settings. Airflow image version is 2.0.1
This setup on a ec2 t3a.medium instance has an average 20% CPU utilization even when idle, this simply eats up cpu credits. Looking at cpu utilization I see a gunicorn processes popping up regularly. I stopped webserver and the utilization drops to 2%. Is there any configuration change that can be done to lower the cpu usage and what are the tradeoff involved with that?
Webserver logs looks like this.
airflow-webserver_1 | [2021-04-12 14:21:09 +0000] [17] [INFO] Handling signal: ttou
airflow-webserver_1 | [2021-04-12 14:21:09 +0000] [17222] [INFO] Worker exiting (pid: 17222)
airflow-webserver_1 | [2021-04-12 14:21:28 +0000] [17] [INFO] Handling signal: ttin
airflow-webserver_1 | [2021-04-12 14:21:28 +0000] [17237] [INFO] Booting worker with pid: 17237
airflow-webserver_1 | [2021-04-12 14:21:40 +0000] [17] [INFO] Handling signal: ttou
airflow-webserver_1 | [2021-04-12 14:21:40 +0000] [17225] [INFO] Worker exiting (pid: 17225)
airflow-webserver_1 | [2021-04-12 14:21:59 +0000] [17] [INFO] Handling signal: ttin
airflow-webserver_1 | [2021-04-12 14:21:59 +0000] [17240] [INFO] Booting worker with pid: 17240
Thanks

was able to reduce cpu usage by increasing refresh and timeout intervals. Added these environment variables to airflow-webserver service
AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL: 600
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT: 1200

Related

SQLite Error disk I/O when running Airflow commands

Upon running:
airflow scheduler
I get the following error:
[2022-08-10 13:26:53,501] {scheduler_job.py:708} INFO - Starting the scheduler
[2022-08-10 13:26:53,502] {scheduler_job.py:713} INFO - Processing each file at most -1 times
[2022-08-10 13:26:53,509] {executor_loader.py:105} INFO - Loaded executor: SequentialExecutor
[2022-08-10 13:26:53 -0400] [1388] [INFO] Starting gunicorn 20.1.0
[2022-08-10 13:26:53,540] {manager.py:160} INFO - Launched DagFileProcessorManager with pid: 1389
[2022-08-10 13:26:53,545] {scheduler_job.py:1233} INFO - Resetting orphaned tasks for active dag runs
.
.
.
[2022-08-10 13:26:53 -0400] [1391] [INFO] Booting worker with pid: 1391
Process DagFileProcessor10-Process:
Traceback (most recent call last):
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 998, in _commit_impl
self.engine.dialect.do_commit(self.connection)
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 672, in do_commit
dbapi_connection.commit()
sqlite3.OperationalError: disk I/O error
I get this 'disk I/O error' as well when I run airflow webserver --port 8080 command as so:
Workers: 4 sync
Host: 0.0.0.0:8080
Timeout: 120
Logfiles: - -
Access Logformat:
=================================================================
[2022-08-10 14:42:28 -0400] [2759] [INFO] Starting gunicorn 20.1.0
[2022-08-10 14:42:29 -0400] [2759] [INFO] Listening at: http://0.0.0.0:8080 (2759)
[2022-08-10 14:42:29 -0400] [2759] [INFO] Using worker: sync
.
.
.
[2022-08-10 14:42:55,149] {app.py:1455} ERROR - Exception on /static/appbuilder/datepicker/bootstrap-datepicker.css [GET]
Traceback (most recent call last):
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 998, in _commit_impl
self.engine.dialect.do_commit(self.connection)
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 672, in do_commit
dbapi_connection.commit()
sqlite3.OperationalError: disk I/O error
Any ideas as to what might be causing this and possible fixes?
It seems like airflow doesn't find the database on the disk, try to initialize it:
airflow db init

Problems using MariaDB in Docker Swarm with NFS

I have problems using MariaDB within a Docker Swarm using an nfs share. The database suddenly stops accepting new connections after fdatasync() failed. This happens randomly. Aftera few hours or after a few days. If I remove the service and start it again, everything ist running fine. The service seems not to repair itself. But I think this error should not even occur, even if the service should heal itself. I run the database as a persistence layer for the nextcloud app.
This is my docker-compose file:
version: '3.3'
services:
nextcloud_db:
image: mariadb:10.7.4
#container_name: nextcloud-db
command:
- "--transaction-isolation=READ-COMMITTED"
- "--log-bin=ROW"
- "--innodb_read_only_compressed=OFF"
- "--character-set-server=utf8mb4"
- "--collation-server=utf8mb4_unicode_ci"
#- "--innodb-rollback-on-timeout=ON" # Tested this but did not help
deploy:
replicas: 1
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
labels:
- traefik.enable=false
volumes:
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
- db:/var/lib/mysql
environment:
- MYSQL_ROOT_PASSWORD=myrootpassword
- MYSQL_PASSWORD=mymysqlpassword
- MYSQL_DATABASE=nextcloud
- MYSQL_USER=nextcloud
- MYSQL_INITDB_SKIP_TZINFO=1
networks:
- nextcloud
### other services for running nextcloud ###
volumes:
db:
driver_opts:
type: "nfs"
o: "addr=<storage-server-ip>,nolock,soft,rw"
device: ":/mnt/storage/nextcloud/db"
networks:
traefik-public:
external: true
nextcloud:
driver: overlay
# driver_opts:
# encrypted: "true"
These are the logs from the moment the db died:
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-06-29 19:51:17 4671 [ERROR] [FATAL] InnoDB: fdatasync() returned 5
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 220629 19:51:17 [ERROR] mysqld got signal 6 ;
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | This could be because you hit a bug. It is also possible that this binary
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | or one of the libraries it was linked against is corrupt, improperly built,
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | or misconfigured. This error can also be caused by malfunctioning hardware.
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 |
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | To report this bug, see https://mariadb.com/kb/en/reporting-bugs
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 |
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | We will try our best to scrape up some info that will hopefully help
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | diagnose the problem, but since we have already crashed,
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | something is definitely wrong and this may fail.
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 |
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | Server version: 10.7.4-MariaDB-1:10.7.4+maria~focal-log
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | key_buffer_size=134217728
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | read_buffer_size=131072
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | max_used_connections=10
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | max_threads=153
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | thread_count=11
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | It is possible that mysqld could use up to
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467995 K bytes of memory
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | Hope that's ok; if not, decrease some variables in the equation.
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 |
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | Thread pointer: 0x55d81db99108
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | Attempting backtrace. You can use the following information to find out
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | where mysqld died. If you see no messages after this, something went
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | terribly wrong...
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | stack_bottom = 0x7fcf10137d98 thread_stack 0x49000
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | mariadbd(my_print_stacktrace+0x32)[0x55d81b24de52]
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | mariadbd(handle_fatal_signal+0x485)[0x55d81ad282b5]
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-06-29 21:49:49 4673 [Warning] Aborted connection 4673 to db: 'nextcloud' user: 'nextcloud' host: '10.0.7.189' (Got an error reading communication packets)
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-06-29 21:49:49 4672 [Warning] Aborted connection 4672 to db: 'nextcloud' user: 'nextcloud' host: '10.0.7.189' (Got an error reading communication packets)
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-06-29 21:49:49 4674 [Warning] Aborted connection 4674 to db: 'nextcloud' user: 'nextcloud' host: '10.0.7.189' (Got an error reading communication packets)
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-06-29 22:16:02 4676 [Warning] Aborted connection 4676 to db: 'nextcloud' user: 'nextcloud' host: '10.0.7.189' (Got an error reading communication packets)
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-06-29 22:18:13 4678 [Warning] Aborted connection 4678 to db: 'nextcloud' user: 'nextcloud' host: '10.0.7.189' (Got an error reading communication packets)
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-06-29 22:24:46 4679 [Warning] Aborted connection 4679 to db: 'nextcloud' user: 'nextcloud' host: '10.0.7.189' (Got an error reading communication packets)
nc_nextcloud_db.1.1mfx9xkwd1sd#v220210169548138574 | 2022-07-01 21:49:02 7148 [Warning] Aborted connection 7148 to db: 'nextcloud' user: 'nextcloud' host: '10.0.7.189' (Got an error reading communication packets)
I found no other logs related to the isse.
Anyone has a clue what's going on here?
Maybe the NFS share is unavailable for a few seconds and so the database has problems reading/writing? Is it possible to self-heal the mariadb service after this error occurs? There are no other problems as long as the database service is running. I can upload and delete files etc. So it is not a permissions issue on the nfs share.
Further MariaDB metrics:
https://jpst.it/2TX-F
Host system info:
Docker node VM with Ubuntu:
Ubuntu 20.04.4 LTS
2 vCPUs
8 GB RAM
160 GB SSD System-Storage (Raid 10)
MySQL does not support mounting NFS to initialize data

Airflow server constantly restarting - Signal 15

I launch a airflow webserver command in my local machine to start an airflow instance on port 8081. The server starts, however the pŕompt constantly shows some warning messages, as a loop. No error message appears, but the server doesn't works. Those are the messages:
/usr/local/lib/python3.8/dist-packages/airflow/configuration.py:361 DeprecationWarning: The default_queue option in [celery] has been moved to the default_queue option in [operators] - the old setting has been used, but please update your config.
/usr/local/lib/python3.8/dist-packages/airflow/configuration.py:361 DeprecationWarning: The dag_concurrency option in [core] has been renamed to max_active_tasks_per_dag - the old setting has been used, but please update your config.
/usr/local/lib/python3.8/dist-packages/airflow/configuration.py:361 DeprecationWarning: The processor_poll_interval option in [scheduler] has been renamed to scheduler_idle_sleep_time - the old setting has been used, but please update your config.
[2022-06-13 15:11:57,355] {manager.py:779} WARNING - No user yet created, use flask fab command to do it.
[2022-06-13 15:12:01,925] {manager.py:512} WARNING - Refused to delete permission view, assoc with role exists DAG Runs.can_create User
[2022-06-13 15:12:19 +0000] [1117638] [INFO] Handling signal: ttou
[2022-06-13 15:12:19 +0000] [1120256] [INFO] Worker exiting (pid: 1120256)
[2022-06-13 15:12:19 +0000] [1117638] [WARNING] Worker with pid 1120256 was terminated due to signal 15
[2022-06-13 15:12:22 +0000] [1117638] [INFO] Handling signal: ttin
[2022-06-13 15:12:22 +0000] [1121568] [INFO] Booting worker with pid: 1121568
Do you know what can could be happening?
Thank you in advance!

Airflow webserver not starting while using helm chart on minikube

I'm trying to run Airflow locally (to test it before deployment) using minikube and helm chart stable/airflow. But airflow-webserver doesn't start due to gunicorn issue.
Helm: v2.14.3
Kubernetes: v1.15.2
Minikube: v1.3.1
Helm chart image: puckel/docker-airflow
These are the steps:
minikube start
helm install --namespace "airflow" --name "airflow" stable/airflow
Logs are:
Thu Sep 12 07:29:54 UTC 2019 - waiting for Postgres... 1/20
Thu Sep 12 07:30:00 UTC 2019 - waiting for Postgres... 2/20
waiting 60s...
executing webserver...
[2019-09-12 07:31:05,745] {{settings.py:213}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=1
/usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
[2019-09-12 07:31:06,030] {{__init__.py:51}} INFO - Using executor CeleryExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
[2019-09-12 07:31:06,585] {{dagbag.py:90}} INFO - Filling up the DagBag from /usr/local/airflow/dags
Running the Gunicorn Server with:
Workers: 4 sync
Host: 0.0.0.0:8080
Timeout: 120
Logfiles: - -
=================================================================
[2019-09-12 07:31:07,676] {{settings.py:213}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=21
/usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
[2019-09-12 07:31:07 +0000] [21] [INFO] Starting gunicorn 19.9.0
[2019-09-12 07:31:07 +0000] [21] [INFO] Listening at: http://0.0.0.0:8080 (21)
[2019-09-12 07:31:07 +0000] [21] [INFO] Using worker: sync
[2019-09-12 07:31:07 +0000] [25] [INFO] Booting worker with pid: 25
[2019-09-12 07:31:07 +0000] [26] [INFO] Booting worker with pid: 26
[2019-09-12 07:31:07 +0000] [27] [INFO] Booting worker with pid: 27
[2019-09-12 07:31:07 +0000] [28] [INFO] Booting worker with pid: 28
[2019-09-12 07:31:08,444] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-09-12 07:31:08,446] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-09-12 07:31:08,545] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-09-12 07:31:08,669] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-09-12 07:31:10,047] {{dagbag.py:90}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2019-09-12 07:31:20,932] {{cli.py:825}} ERROR - [0 / 0] some workers seem to have died and gunicorndid not restart them as expected
[2019-09-12 07:31:22,095] {{dagbag.py:90}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2019-09-12 07:31:22 +0000] [25] [INFO] Parent changed, shutting down: <Worker 25>
[2019-09-12 07:31:22 +0000] [25] [INFO] Worker exiting (pid: 25)
[2019-09-12 07:31:32 +0000] [28] [INFO] Parent changed, shutting down: <Worker 28>
[2019-09-12 07:31:32 +0000] [28] [INFO] Worker exiting (pid: 28)
[2019-09-12 07:31:33,289] {{dagbag.py:90}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2019-09-12 07:31:33,324] {{dagbag.py:90}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2019-09-12 07:31:35 +0000] [26] [INFO] Parent changed, shutting down: <Worker 26>
[2019-09-12 07:31:35 +0000] [26] [INFO] Worker exiting (pid: 26)
[2019-09-12 07:31:35 +0000] [27] [INFO] Parent changed, shutting down: <Worker 27>
[2019-09-12 07:31:35 +0000] [27] [INFO] Worker exiting (pid: 27)
[2019-09-12 07:33:32,017] {{cli.py:832}} ERROR - No response from gunicorn master within 120 seconds
[2019-09-12 07:33:32,018] {{cli.py:833}} ERROR - Shutting down webserver
I can run that docker image locally with docker-compose with no issues, but no luck using helm, it fails and restarts constantly.
Turns out that the issue was that the minikube configuration wasn't making postgres' pod available, editing the pod deployment with the ip of the postgres instance it worked.

How do I set up a stand-alone Gunicorn App server if I already have an Nginx proxy set up?

I'm trying to set up multiple servers that look like:
Client Request ----> Nginx (Reverse-Proxy / Load-Balancer)
|
/|\
| | `-> App. Server I. 10.128.xxx.yy1:8080 # Our example
| `--> App. Server II. 10.128.xxx.yy2:8080
`----> ..
I understand that I need to put the App servers (Gunicorn in this case) behind an Nginx Proxy, but how do I set up the App servers by themselves?
I'm trying to set up the App server with systemd, and my configuration looks like:
[Unit]
Description=gunicorn daemon
After=network.target
[Service]
User=kyle
Group=www-data
WorkingDirectory=/home/kyle/do_app_one
ExecStart=/home/kyle/do_app_one/venv/bin/gunicorn --workers 3 --bind unix:/home/kyle/do_app_one/do_app_one.sock do_app_one.wsgi:application
[Install]
WantedBy=multi-user.target
I know the socket is being created because I can see it:
but I can't access the Gunicorn server by itself when I hit the IP address, with or without the :8000 port attached to it. Without the systemd configuration, I can access the site if I do:
gunicorn --bind 0.0.0.0:8000 myproject.wsgi:application
but I want to do this the right way with an init system like systemd, and I don't think I'm supposed to be binding it directly to a port because I've read it's less efficient/secure than using a socket. Unless binding to a port is the only way, then I guess that's what I have to do.
Every tutorial I see says I need an Nginx server in front of my Gunicorn server, but I already have an Nginx server in front of them. Do I need another one in front of each server such that it looks like:
Client Request ----> Nginx (Reverse-Proxy / Load-Balancer)
|
/|\
| | `-> Nginx + App. Server I. 10.128.xxx.yy1:8080 # Our example
| `--> Nginx + App. Server II. 10.128.xxx.yy2:8080
`----> ..
If Nginx is an HTTP server, and Gunicorn is an HTTP server, why would I need another Nginx server in front of each App Server? It seems redundant.
And if I don't need another Nginx server in front of each Gunicorn server, how do I set up the Gunicorn server with systemd such that it can stand alone?
Edit:
I was curious as to why the binding to a physical port was working, but the socket wasn't, so I ran gunicorn status and got errors:
kyle#ubuntu-512mb-tor1-01-app:~/do_app_one$ . venv/bin/activate
(venv) kyle#ubuntu-512mb-tor1-01-app:~/do_app_one$ gunicorn status
[2016-12-03 20:19:49 +0000] [11050] [INFO] Starting gunicorn 19.6.0
[2016-12-03 20:19:49 +0000] [11050] [INFO] Listening at: http://127.0.0.1:8000 (11050)
[2016-12-03 20:19:49 +0000] [11050] [INFO] Using worker: sync
[2016-12-03 20:19:49 +0000] [11053] [INFO] Booting worker with pid: 11053
[2016-12-03 20:19:49 +0000] [11053] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 557, in spawn_worker
worker.init_process()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/workers/base.py", line 126, in init_process
self.load_wsgi()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/workers/base.py", line 136, in load_wsgi
self.wsgi = self.app.wsgi()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 65, in load
return self.load_wsgiapp()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 52, in load_wsgiapp
return util.import_app(self.app_uri)
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/util.py", line 357, in import_app
__import__(module)
ImportError: No module named 'status'
[2016-12-03 20:19:49 +0000] [11053] [INFO] Worker exiting (pid: 11053)
[2016-12-03 20:19:49 +0000] [11050] [INFO] Shutting down: Master
[2016-12-03 20:19:49 +0000] [11050] [INFO] Reason: Worker failed to boot.
Still not sure how to fix the problem though.
The right answer is to just bind Gunicorn to a port instead of a unix socket. I'm not too sure about the details, but unix sockets can only be used within a local network according to:
https://unix.stackexchange.com/questions/91774/performance-of-unix-sockets-vs-tcp-ports
So when I changed the gunicorn.service file ExecStart line to:
ExecStart=/home/kyle/do_app_one/venv/bin/gunicorn --workers 3 --bind 0.0.0.0:8000 do_app_one.wsgi:application
I was able to access the server by itself, and connect it to my Nginx server that was on a different IP.

Resources