Intermittent 'Can't connect to MySQL server on <mariadb server>' - mariadb

I am getting an intermittent error message as shown in the subject:
Can't connect to MySQL server on <mariadb server>
Where is a server on the network.
There are only two webservers that connect to this db server and when this happens, they both give this error message.
All the 3 machines (the webservers and the mariadb server) are CentOS 7 VMs, hosted using KVM. The hosts all connect directly into the same switch.
I have set the max_connections for the mariadb server to 8192:
MariaDB [(none)]> show variables where variable_name = 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 8192 |
+-----------------+-------+
1 row in set (0.00 sec)
/proc/sys/net/core/somaxconn on the mariadb server is set to 10000
The mariadb server uses a filesystem that is hosted on two SATA SDDs, in a RAID0 configuration.
"sar" shows that the mariadb server rarely exceeds 10% of %user, 10% %system, 3% of %iowait or .1% of %steal.
Overall, it doesn't appear to be limited in resources. There are no instances of the "CPU X stuck for Y seconds" in the logs.
There are no firewall issues (iptables and nftables are not configured).
The max_connections was set to 200 before, so I think 8192 should be sufficient.
Note that this only occurs in some circumstances, which probably result in a higher number of concurrent connections, normally, everything works fine (and was working fine with 200 max_connections.

Related

mariadb slow query not logged

As the title suggests, no log is recorded in the log file even though the related settings have been completed.
slow_query_log_file = /var/log/mysql/mariadb-slow.log
slow_query_log = 1
long_query_time = 1
log_slow_rate_limit = 1000
log_slow_verbosity = query_plan
log-queries-not-using-indexes
This is mariadb's conf content.
When you open the log file, only the basics exist.
Tcp port: 3306 Unix socket: /run/mysqld/mysqld.sock
Time Id Command Argument
logrotate seems to work fine.
After connecting to mysql, I used select sleep(); but it did not work properly.
The result after using the command is 0, which seems to be normal, but the log is not recorded.
Why wouldn't it work?
The new settings will apply only if the MariaDB server instance is restarted. Therefore, the solution, as mentioned in the comment, is to restart the MariaDB server instance in order to apply the new settings.

Connection string for MariaDB

I'm running CentOS v7.9 with MariaDB v5.5.68. I'm trying to access the MariaDB databases from a Win10 machine using Visual Studio Code with SQLTools & MySQL/MariaDB extensions.
I have configured MariaDB for remote access per this link: Configuring MariaDB for Remote Client Access
[mysqld]
skip-networking=0
skip-bind-address
I created the users and added the privileges - tested by logging in locally with 'bob' and viewing permissions in mysql.user. (BTW, in case not readily apparent, the UID, host, and PWD aren't real.)
CREATE USER 'bob'#'1.2.3.%' IDENTIFIED BY 'myPWD';
GRANT ALL PRIVILEGES ON *.* TO 'bob'#'1.2.3.%' IDENTIFIED BY 'myPWD';
However, when I try to log in remotely (from another Linux box) using mysql -u userID -h hostIP -p, I get the error:
ERROR 2003 (HY000): Can't connect to MySQL server on '1.2.3.4' (110)
When I try to make the database connection using VS Code, SQLTools tells me I've connected, but it won't show any tables, I'm not able to make any queries, and I get this error: Request connection/GetChildrenForTreeItemRequest failed with message: Handshake inactivity timeout.
I have reviewed this SO page and others, but still can't get the connection to work.
UPDATED for clarity - provides mysql.user and netstat info:
MariaDB [(none)]> select user, host from mysql.user;
+------+-------------+
| user | host |
+------+-------------+
| bob | 10.0.2.15 | # Can't connect
| rob | 127.0.0.1 | # Logs in locally via command line
| root | 127.0.0.1 | # Logs in locally via command line
| bob | 192.168.0.% | # Can't connect
| root | 192.168.0.% | # Can't connect
| root | ::1 | # Logs in locally via command line
| rob | localhost | # Logs in locally via command line
| root | localhost | # Logs in locally via command line
+------+-------------+
8 rows in set (0.00 sec)
$ > netstat -tulpen
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 27 33813 -
Any help is much appreciated as I've been working this problem for 2+ days and have not made any headway.

airflow webserver suddenly stopped after long time of no issues, "No response from gunicorn"

Have had airflow webserver -D deamon process (v1.10.7) running on machine (CentOS 7) for long time. Suddenly saw that the webserver could no longer be accessed and checking the airflow-webserver.log saw...
[airflow#airflowetl airflow]$ cat airflow-webserver.log
2020-10-23 00:57:15,648 ERROR - No response from gunicorn master within 120 seconds
2020-10-23 00:57:15,649 ERROR - Shutting down webserver
(nothing of note in airflow-webserver.err)
[airflow#airflowetl airflow]$ cat airflow-webserver.err
/home/airflow/.local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
The airflow.cfg values for the webserver section looks like...
[webserver]
# The base url of your website as airflow cannot guess what domain or
# cname you are using. This is used in automated emails that
# airflow sends to point links to the right web server
#base_url = http://localhost:8080
base_url = http://airflowetl.co.local:8080
# The ip specified when starting the web server
web_server_host = 0.0.0.0
# The port on which to run the web server
web_server_port = 8080
# Paths to the SSL certificate and key for the web server. When both are
# provided SSL will be enabled. This does not change the web server port.
web_server_ssl_cert =
web_server_ssl_key =
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120
# Number of seconds the gunicorn webserver waits before timing out on a worker
#web_server_worker_timeout = 120
web_server_worker_timeout = 300
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
# Secret key used to run your flask app
secret_key = my_key
# Number of workers to run the Gunicorn web server
workers = 4
# The worker class gunicorn should use. Choices include
# sync (default), eventlet, gevent
worker_class = sync
Ultimately, just restarted the process as a daemon again (airflow webserver -D (should I have deleted the old airflow-webserer.log and .err files first?)), but not sure what would make this happen, since it had had no problems running for months before this.
Could anyone with more experience explain what could have happened after all this time and how I could prevent it in the future? Any issues with running dags or anything else that I should check for that this temporary unexpected shutdown of the websever may have caused?
I am experiencing the same issue, and it only started (very unfrequently) when I changed the following two config parameters in the webserver.
worker_refresh_interval = 120
workers = 2
However, my parameters are also set quite differently than yours, will share them here.
rbac = True
web_server_host = 0.0.0.0
web_server_port = 8080
web_server_master_timeout = 600
web_server_worker_timeout = 600
default_ui_timezone = Europe/Amsterdam
reload_on_plugin_change = True
After comparing the two, as your settings of the two I changed were set to the default (same as me before changing them), it seems that it is a combination of more parameters.

Understanding Docker container resource usage

I have server running Ubuntu 16.04 with Docker 17.03.0-ce running an Nginx container. That server also has ConfigServer Security & Firewall installed. Shortly after starting the Nginx container I start receiving emails about "Excessive resource usage" with the following details:
Time: Fri Mar 24 00:06:02 2017 -0400
Account: systemd-timesync
Resource: Process Time
Exceeded: 1820 > 1800 (seconds)
Executable: /usr/sbin/nginx
Command Line: nginx: worker process
PID: 2302 (Parent PID:2077)
Killed: No
I fully understand that I can add exe:/usr/sbin/nginx to csf.pignore to stop these email alerts but I would like to understand a few things first.
Why is the "systemd-timesync" account being reported? That does not seem to have anything to do with Docker.
Why does the host machine seem to be reporting the excessive resource usage (the extended process time) when that is something running in the container?
Why are other docker containers not running Nginx not resulting in excessive resource usage emails?
I'm sure there are other questions but basically, why is this being reported the way it is being reported?
I can at least answer the first two questions:
Unlike real VMs, Docker containers are simply a collection of processes run under the host system kernel. They just have a different view on certain system resources, including their own file hierarchy, their own PID namespace and their own /etc/passwd file. As a result, they will still show up if you ps aux on the host machine.
The nginx container's /etc/passwd includes a user 'nginx' with UID 104 that runs the nginx worker process. However, in the host's /etc/passwd, UID 104 might belong to a completely different user, such as systemd-timesync.
As a result, if you run ps aux | grep nginx in the container, you might see
nginx 7 0.0 0.0 32152 2816 ? S 11:20 0:00 nginx: worker process
while on the host, you see
systemd-timesync 22004 0.0 0.0 32152 2816 ? S 13:20 0:00 nginx: worker process
even though both are the are the same process (also note the different PID namespaces; in containers, PIDs are counted from 1 again).
As a result, container processes will still be subject to ConfigServer's resource monitoring, but they might show up with random, or even non-existent user accounts.
As to why nginx triggers the emails and other containers don't, I can only assume that nginx is the only one of your containers that crosses ConfigServer's resource thresholds.

JNDI over HTTP on JBoss 4.2.3GA

I've got a remote server on eapps.com that I'm using as my "production" server. I have my own computer at home that I'm using as my "development" server. I'm trying to use JNDI over HTTP to do some batch processing. The following works at home, but not on the eapps machine.
I'm connecting to some EJBs (stateless session), and have my jndi.properties set to this:
(this is for the eapps machine)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://my.prodhost.com:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jboss.naming.client:org.jnp.interfaces
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
(this is for my machine at home)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://localhost:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jnp.interfaces
java.naming.factory.url.pkgs=org.jboss.naming.client
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
As I said, it works at home, but when I try it remotely, I get:
Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://my.prodhost.com:4446//?dataType=invocation&enableTcpNoDelay=true&marshaller=org.jboss.invocation.unified.marshall.InvocationMarshaller&socketTimeout=600000&unmarshaller=org.jboss.invocation.unified.marshall.InvocationUnMarshaller]
...
Caused by: java.net.ConnectException: Connection timed out: connect
Am I doing something wrong here, or is it possibly a firewall issue? To the best of my knowledge, port 4446 is not blocked.
Are the differences in the jndi.properties intentional (at the java.naming.factory.url.pkgs property level)?
Also, can you run a netstat -a | grep 4446 on both machines and update the question with the output?
Update: If the netstat command didn't return anything for port 4446 (JBoss was running, right?), then the JBoss Remoting Connector for the UnifiedInvoker service is very likely not listening on your eApps host, hence the connection timeout. Maybe this service has been disabled by eApps, you should contact the support and discuss this with them.
Just in case, a sample Connector configuration can be found in the jboss-service.xml under the server node's conf directory. Maybe compare the remote one (if you have access to it) with your local file to confirm this (but if it's disable, there must be a reason, discuss it with the support).
And by the way, this is what I get when I run the netstat command with JBoss 4.2.3.GA started on my GNU/Linux machine (default configuration):
$ netstat -a | grep 4446
tcp 0 0 localhost:4446 *:* LISTEN

Resources