Galera Cluster Connection refused and gcs_group_handle_join_msg (): 736: Will never receive state. Need to abort - mariadb

I'm doing an example with Galera cluster using 3 virtual machines with Debian 9 and MariaDB 10.1 database.
The replica works well using rsync method, even if a node is dropped when it recovers, it synchronizes normally. The problem arose when I turned off two nodes to see what happened. Inserting data in the node that remained and it worked fine but when I have started the other nodes it throws the following error:
ERROR 2002 (HY000): Can not connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111 "Connection refused")
And the log of one of the nodes that I try to raise appears:
 
[ERROR] WSREP: gcs / src / gcs_group.cpp: gcs_group_handle_join_msg (): 736: Will never receive state. Need to abort.
I have no idea what could happen. I need help to solve this problem. Thank you.

Don't use sockets to talk between VMs. Only use TCP/IP.

Related

Mariadb: MySQL server has gone away

In my application, I have an issue where I receive the "MySQL server has gone away" error during a quite long-running transaction. I know this has already been asked a lot, but I tried my test to go through all possible causes.
The one thing that baffles me a lot, is this error message in the log of the MariaDB server:
[Warning] Aborted connection 6 to db: 'default' user: 'root' host: '10.0.0.18' (Got timeout reading communication packets)
This would explain why the client reports a broken connection, but this error occurs 10-15 minutes before the client reports the "MySQL server has gone away" error. In the meantime, the client is happily running insert statements without an issue. But as soon the client runs a select statement, the statement fails practically immediately.
I have already checked for these possible causes:
The server was running all the time
wait_timeout is set to 8 hours, which is way longer than the time the transaction needs to fail
max_allowed_packet is set to 512M which should be more than enough since the query is a very short select statement
The server does not run out of memory
I'm pretty sure the issue must be related to the "Got timeout reading communication packets" error from the MariaDB log. But I cannot wrap my head around why the client still can write data. And why this timeout occurs in the first place, since the wait_timeout is super high.
Some system information:
I'm running on MariaDB 10.5.1
The client uses python 3.6 and mysqlclient, which uses libmysql, is used for the database connection
I hope maybe some of you have an idea what I should look for, because this is really driving me nuts.

How to recover with a retry from gremlin NoHostAvailableException

I am using Gremlin Java driver to connect to a local gremlin server.
Simple code flow
Creating client
Cluster cluster = Cluster.build().addContactPoint(<endp>).port(<port>).enableSsl(false).create()
Client client = cluster.connect();
Submit Script
client.submit("g.V().count()");
If when i submit the first time the Gremlin server is down, on subsequent retries after bringing back gremlin server, connection still fails to create.
Exception First attempt when Gremlin Server is down:
org.apache.tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions. Check the error log to find the actual reason
Exception After Gremlin server is brought back up:
tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions
One thing to note is i do not create client on retry just do
Submit Script
client.submit("g.V().count()");
It is quite possible that Gremlin server may go down anytime, how to recover in such circumstances. Fundamentally is
NoHostAvailableExceptio
recoverable?
Normally, the Client should attempt to reconnect to a host that is previously marked unavailable. By default, it should be retrying the host at 1 second intervals as governed by this configuration: connectionPool.reconnectInterval. In your case, however I think you've uncovered a bug where the reconnect attempts are not started because the Client was never able to reach the host in the first place. As of 3.4.11, you can only remedy this by recreating the Client as you noted in your comments. I've created an issue to track this problem here: TINKERPOP-2569

Single Node Artifactory - deploy using AWS ECS fails with current node still available

Maybe Im just approaching this wrong.
Single Instance mode (non-HA)
AWS-RDS Postgres Database
Deploying via ECS
Currently have Artifactory-Pro building a docker container and deploying to ECS via CI/CD. The initial deploy goes fine. Everything stands up, database migrations occur, and the instance runs.
However, when doing an update to the task, a new task spins up. It then adds entries to the access_topology with the new container-ip and unique node-id, but they stay unhealthy. The logs just then bomb out with failure messages (below - due to existing heartbeat of other node).
If I first stop the running task, and start a new task, it spins up properly (Probably due to heartbeat loss).
In typical ECS world, the new task is spun up till its deemed healthy, and then the older task is killed off.
Either scenario creates orphaned NODE records that stay healthy -- trying to also figure out how to garbage collect on those and purge.
Any thoughts on this?
Errors are below – it appears that it wont properly join because of an active heartbeat, and not being HA. However, I want this node to stand up so I can topple the other. Thanks –
Cluster join: Successfully joined jfmd#01es5dmfhar6gcy5abyj4rwpkc with node id ip-10-10-3-248.us-XXXX-1.compute.internal
Application could not be initialized: Current Artifactory node last heartbeat is: 1607609142483. Stopping Artifactory since the local server is running as PRO/OSS but found other servers in registry
Error occurred when refreshing domain cache all domain endpoint failed : Fetch domains from http://localhost:8046/distribution/api/v1/events/domains failed (returned 404), Fetch domains from http://localhost:8046/artifactory/api/events/domains failed (returned 404), [domain_client]"
Retry 20 Elapsed 16.84 secs failed: Couldn't access another access peer. [localhost:8046]. Status code: UNAVAILABLE. HTTP status code 503
Status code: UNAVAILABLE. HTTP status code 503
1607609184634,invalid content-type: text/plain; charset=utf-8
1607609184634,"headers: Metadata(:status=503,content-type=text/plain; charset=utf-8,content-length=19,date=Thu, 10 Dec 2020 14:06:24 GMT)"
1607609184634,DATA-----------------------------
1607609184634,Service Unavailable. Trying again
This is not possible without an HA configuration. Since this is not an HA configuration, the application will not start up if there is another application still "alive". In this case, "alive" is defined as having written the heartbeat within X amount of seconds (I believe this is 10 by default).

IBM MQ :: Remote Configuration - Can't Start Sender Channel

I am working with IBM MQ. I managed to get a basic Handshake / Put Message(s) / Get Message(s) / Disconnect .net solution going on, a couple of days ago, but it only works on a local level, and I now need to update the solution so it works remotely as well.
After reading and experimenting for a while, I decided to follow IBM Knowledge Center's Point to Point scenario step by step. However, I can't start the Sender Channel as instructed by the guide's last step; the Sender Channel's status ping-pongs between Binding and Retrying, and the logs come up with the following error codes; AMQ9002, AMQ9202 and AMQ9999, meaning, as far as I can tell, there is some kind of trouble finding and/or connecting with the host, as explained by the error logs.
I have looked through a lot of questions regarding these errors in particular, but while I have followed most of the proposed solutions (I made sure the Receiver's listener is running, I tried turning off Firewalls, I tried with different ports, I have performed tests Telnet, I have stopped/restarted/resolved the Sender channel a few times, and I have tried setting this up from both, the command line and MQ Explorer), I have yet to get a successful communication going on between two different PCs.
I am aware the error could be either temporary, or the result of problems within the Network itself, but I have been trying to establish a successful connection for almost three days now, and before I pass this unto my bosses I would like to make sure I have exhausted every other possibility.
How can I complete IBM's Point To Point set up guide, or is there anything that could point me towards a different / better approach to get two PCs talking with each other via IBM MQ v9?
Although hastily translated from Japanese, you can find the detailed error logs below.
2017/09/19 17:34:09 - Process (234212.1) User (MUSR_MQADMIN) Program
(runmqchl.exe)
Host (DESKTOP - UP 4 D 363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 09.201 Z)
AMQ9002: Channel 'TO.QM2' is starting.
Description: Channel 'TO.QM2' is starting.
ACTION: None.
2017/09/19 17:34:30 - Process (234212.1) User (MUSR_MQADMIN) Program
(runmqchl.exe)
Host (DESKTOP - UP4D363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 30.824Z)
AMQ 9202: The remote host 'DESKTOP-1AV4LM3 (The correct ip address) (1415)' can not be used.Please try again later.
Description: Using TCP / IP to host 'DESKTOP-1AV4LM3 (The correct ip
address) of channel TO.QM2 (1415) 'trying to allocate a conversation,
but it did not succeed. However, It is temporary and there is also the
possibility that TCP / IP conversation can be allocated normally
later.
If the remote host can not be determined, '????' is displayed. .
ACTION: Please try the connection later. If the failure persists,
record the error value Please contact the stem administrator. The
return code from TCP / IP is 10060 (X'274C ').The cause of this
failure may be that the host can not reach the destination host.
Alternatively, There is a possibility that the host 'DESKTOP-1AV4LM3
(The correct ip address) (1415)' listener isn't running. If that is
the case, start the listener and try again.
2017/09/19 17:34:30 - Process (234212.1) User (MUSR_MQADMIN) Program (runmqchl.exe)
Host (DESKTOP - UP 4 D 363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 30.825Z)
AMQ9999: Channel 'TO.QM2' for host 'DESKTOP-1AV4LM3 (1415)' terminated abnormally
Description: The host 'DESKTOP-1AV4LM3 (1415)' cannot be determined.
ACTION: Check the error log for the preceding error message for
this channel program Please determine the cause of failure....
".
The 'interesting' bit of the error messages above is that the sender is attempting to start a channel to port 1415 on the destination and is getting a 10060 return code (WSAETIMEDOUT). This is different from an immediate rejection because the other end doesnt have a socket open, for example.
You will also note its timing out after about 21 seconds if your times are to be believed. The only time I've seen this kind of things is DNS resolution - There was an APAR for example showing that reverse DNS can cause delays in channel startup, and this could be for a successful or unsuccessful startup
http://www-01.ibm.com/support/docview.wss?uid=swg1IC96408
A new attribute was added to MQ to disable reverse DNS lookups if its the cause - See https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_8.0.0/com.ibm.mq.pro.doc/q113120_.htm#q113120___chlauth
If this is the case, on the receiving end (or both!) try runmqsc , 'ALTER QMGR REVDNS(DISABLED)'. You might have to restart the qmgr for it to be effective (I'm not sure, sorry)
I'd also echo the comment added to your question by JoshMc, to check the receiving end logs for messages (both global errors but more likely the qmgr specific AMQERR01.LOG files) when this occurs - I have a feeling that the timeout is only part of your problem.

Unable to execute RSeval() in remotely connected client in R

I was executing RSeval(c,"4+5"); and encountered :
Error in RSeval(c,"4+5") : remote evaluation failed
in a remote machine connected to a linux server which is running daemon Rserve, where c is connection object. The connection was successful though.
Please share your insights.Thanks in advance
That could be many things, e.g. if Rserve requires authentication. If that's the case, you can use RS.connect and RS.login from the more modern implementation of RSclient (see http://cran.r-project.org/web/packages/RSclient/RSclient.pdf).

Resources