Airflow 2 - error MySQL server has gone away - airflow

I am running Airflow with backend mariaDB and periodically when a DAG task is being scheduled, I noticed the following error in airflow worker
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away').
I am not sure if the issue occurres due to misconfiguration of airflow, or it has to do that the backend is mariaDB, which as I saw it is not a recommended database.
Also, in mariaDB logs, I see the following warning repeating almost every minute
[Warning] Aborted connection 305627 to db: 'airflow' user: 'airflow' host: 'hostname' (Got an error reading communication packets)
I've seen some similar issues mentioned, but whatever I tried so far it didn't help.
The question is, Should I change database to MySQL? Or some configuration has to be done in mariaDB's end?
Airflow v2.0.1
MariaDB 10.5.5
SQLAlchemy 1.3.23

Hard to say - you need to look for the reason why your DB connection get aborted. MariaDB for quick testing with single scheduler might work, but there is some reason why your connection to the DB gets disconnected.
There are few things you can do:
airflow has db check command line command and you can run it to test if the DB configuration is working - maybe the errors that you will see will be obious when you try
airflow also has another useful command db shell - it will allow you to connect to the DB and run sql queries for example. This migh tell you if your connection is "stable". You can connect and run some queries and see if your connection is not interrupted in the meantime
You can see at more logs and your network connectivity to see if you have problems
finally check if you have enough resources to run airflow + DB. Ofthen things like that happen when you do not have enough memory for example. Airflow + DB requires at least 4GB RAM minimum from the experience (depends on the DB configuration) and if you are on Mac or Windows and using Docker, Docker VM by default has less memory than that available and you need to increase it.
look at other resources - disk space, memory, number of connections, etc. all can be your problem.

Related

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Airflow webserver not starting(first time)

My environment is RHEL VM, Python 3.6, Postgres DB on AWS RDS.
I am a newbie starting to learn usage of Airflow.
I followed guidance from: https://medium.com/#klogic/introduction-of-airflow-tool-for-create-etl-pipeline-dc37ad049211
At airflow init db stage, I created a new admin user using command: FLASK_APP=airflow.www.app flask fab create-admin
Next step is, airflow webserver -p 8080 but it's not working. The error it shows is:
From where are these 4 workers coming from? How to resolve this issue? I checked my postgres database and there are new tables added namely - 'job', 'dag' and 'dag_pickle'
Thanks.
Check memory available for your VM. Airflow starts 4 workers by default and they are pretty memory hungry (depends on your configuration). Increasing memory should help

MariaDB has stopped responding - [ERROR] mysqld got signal 6

MariaDB service was stopped responding all of a sudden. It was running for more than 5 months continuously without any issues. When we check the MariaDB service status at the time of the incident, it showed as active (running) ( service mariadb status ). But we could not log into the MariaDB server, each logging attempt was just hanged without any response. All our web applications were also failed to communicate with the MariaDB service. Also, we checked the max_used_connections, and it was below the maximum value.
When we going through the logs, we saw the below error (this had been triggered at the time of the incident).
210623 2:00:19 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.34-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=139
max_threads=752
thread_count=72
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1621655 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f4c008501e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4c458a7d30 thread_stack 0x49000
2021-06-23 2:04:20 139966788486912 [Warning] InnoDB: A long semaphore wait:
--Thread 139966780094208 has waited at btr0sea.cc line 1145 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x55e1838d5ab0 created in file btr0sea.cc line 191
a writer (thread id 139966610978560) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1145
Last time write locked in file btr0sea.cc line 1218
We could not even stop the MariaDB service using general stopping commands ( service MariaDB stop). But we were able to forcefully kill the MariaDB process and then we could get the MariaDB service back online.
What could be the reason for this failure. If you have already faced similar issues please share your experience, what actions you got to prevent such failures (in the future). Your feedback is much much appreciated.
Our Environment Details are as follows
Operating system: Red Hat Enterprise Linux 7
Mariadb version: 10.2.34-MariaDB-log MariaDB Server
I also face this issue on an aws instance (c5a.4xlarge) hosting my database.
Server version: 10.5.11-MariaDB-1:10.5.11+maria~focal
It happened already 3 times occasionnaly. Like you, no possibility to stop the service but reboot the machine to get it working again.
Logs at restart suggest some tables crashed and should be repaired.

mariadb 10.3.13 table_open_cache problems

Just upgraded mysql 5.6 to mariadb 10.3.13 - now when the server hits open_tables = 2000, my php queries stop working - if I do a flush tables it starts working correctly again. This never happened when I was using mysql - now I can't go a day without having to login and do a flush tables to get things working again
Use WHM / Cpanel to administer my VPS and on the last WHM release it started warning me that the version of MySql (really can't remember what version it was - it was what was loaded when I got my VPS) that I was running was soon coming to an end and I would need to upgrade to SQL 5.7 or MariaDB xxx. Had been wanting to move to MariaDB for awhile anyway, so that is what I did - WHM recommended the 10.3.13 version.
After some more watching and looking it appears that what makes my open_tables hits the 2000 max was the automatic CPANEL backup routines - which also backup all of my databases at one get go. Doesn't crash anything just causes problems with my PHP application connections - I don't thing the connections get rejected - they just don't return any data .... Turned all of the automatice WHM/CPANEL backups off and things have settled down a little.
table_definition_cache 400
table_open_cache 2000
I still do a mysqldump via cron to do my database backups - only two live databases and they still make the tables_open grow to that 2000 max - just not as fast.
I now run a script that runs every hour to show me some of the variables and here is what I am seeing
after doing a flush tables command both open_tables and open_table_definitions start increasing until open_table_definitions hits 400 it stops increasing while open_tables keeps increasing thru the day.
then when the mysqldumps happen in the early morning hours tables_open hits 2000 (the max setting) and my php queries are not executed
I do not get a PHP error.
I ran the following command so that I could see what was happening on the db side.
SET GLOBAL general_log = 'ON'
Looking at the log, when everything is running OK I see my application connecting, preparing the statement, executing the statement and then disconnecting ....
I did the same thing when it started acting up (i.e. my php application starts to not get a result again)
Looking at the log I see my application connecting, then preparing the statement and then instead of seeing it execute the statement, it prepares the same statement 2 more times and then disconnects ...
I logged into mysql and did a flush tables command and everything goes back to normal - application connects, prepares a statement, executes it, disconnects ...
But this never happened before I moved to MariaDB - I never messed with the MySQL server stuff at all - the only time MySQL was restarted was when I did a CENTOS 6 system update and had to reboot the server - would go months without doing a thing on the server ....
Looks like the system was the culprit - I changed the open_table_cache to 400 and my php application is no longer having any issues preparing statements, even after the nightly backups of the databases. Looking at older mysql documentation, mysql 5.6.7 had a table_open_cache setting of 400, so when I upgraded to mariadb 10.3.13 that default setting was changed to 2000 which is when I started having problems.
Not quite sure what the following is telling me, but might be of interest ....
su - mysql
-bash-4.1$ ulimit -Hn
100
-bash-4.1$ ulimit -Sn
100
-bash-4.1$ exit
logout
[~]# ulimit -Hn
4096
[~]# ulimit -Sn
4096

MariaDB 10.1.33 keeps crashing

I have a standard master/multiple slave set up on CentOS 7 running on EC2.
I actually have three identical slaves (all spawned off the same AMI), but only one is crashing about once a day. I've posted the dump from error.log below, as well as the query log referencing the connection ID in the error dump.
I've tried looking at the MariaDB docs, but all it points to is resolve_stack_dump but no real help on trying to figure it out from there.
The slave that is crashing is running many batch-like queries, but according to the dump logs, the last connection id is never one of the connections running queries.
For this slave, I have the system turn off slave updates (SQL_THREAD), run queries for 15 minutes, stop the queries, start the slave until caught up, stop the slave updates, and restart the queries. Repeat. This code has been working pretty much non-stop/crash-free for years now when I had a colo set up before I moved to AWS.
My other two cloned slaves only run the replication queries as a hot-spare of the master (which I've never needed to use). Those servers never crash.
thanks.
Error.log crash dump:
180618 13:12:46 [ERROR] mysqld got signal 11 ; This could be because
you hit a bug. It is also possible that this binary or one of the
libraries it was linked against is corrupt, improperly built, or
misconfigured. This error can also be caused by malfunctioning
hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, something is
definitely wrong and this may fail.
Server version: 10.1.33-MariaDB
key_buffer_size=268431360
read_buffer_size=268431360
max_used_connections=30
max_threads=42
thread_count=11
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads =
22282919 K bytes of memory Hope that's ok; if not, decrease some
variables in the equation.
Thread pointer: 0x7f4209f1c008 Attempting backtrace. You can use the
following information to find out where mysqld died. If you see no
messages after this, something went terribly wrong... stack_bottom =
0x7f4348db90b0 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55c19a7be10e]
/usr/sbin/mysqld(handle_fatal_signal+0x305)[0x55c19a2e1295]
sigaction.c:0(__restore_rt)[0x7f4348a835e0]
sql/sql_class.h:3406(sql_set_variables(THD*, List,
bool))[0x55c19a0d2ecd]
sql/sql_list.h:179(base_list::empty())[0x55c19a14bcb8]
sql/sql_parse.cc:2007(dispatch_command(enum_server_command, THD,
char*, unsigned int))[0x55c19a15e85a]
sql/sql_parse.cc:1122(do_command(THD*))[0x55c19a160f37]
sql/sql_connect.cc:1330(do_handle_one_connection(THD*))[0x55c19a22d6da]
sql/sql_connect.cc:1244(handle_one_connection)[0x55c19a22d880]
pthread_create.c:0(start_thread)[0x7f4348a7be25]
/lib64/libc.so.6(clone+0x6d)[0x7f4346e1f34d]
Trying to get some variables. Some pointers may be invalid and cause
the dump to abort. Query (0x0): Connection ID (thread ID): 15894
Status: NOT_KILLED
Optimizer switch:
index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off
Query Log for connection ID:
180618 13:11:01
15894 Connect ****#piper**** as anonymous on
15894 Query show status
15894 Prepare show full processlist /* m6clone1 /
15894 Execute show full processlist / m6clone1 */
15894 Close stmt
15894 Query show slave status
15894 Query show variables
15894 Quit

Resources