`ERROR 9001 (HY000): PD server timeoutstart timestamp may fall behind safepoint` in TiDB - distributed-database

When I was testing TiDB, I bumped into the 9001 error. I didn't figure out the cause. How to deal with it?

Generally, this error occurs when TiDB fails to access PD. There is a “worker” in the background of TiDB that continuously queries the safepoint from PD. If the query is not successful within 100s, this error is reported.
Possible causes: PD failure or the network issue between TiDB and PD. Solution: Check the state/monitor/log of the PD server and the network between the TiDB server and the PD server.

Related

Mariadb: MySQL server has gone away

In my application, I have an issue where I receive the "MySQL server has gone away" error during a quite long-running transaction. I know this has already been asked a lot, but I tried my test to go through all possible causes.
The one thing that baffles me a lot, is this error message in the log of the MariaDB server:
[Warning] Aborted connection 6 to db: 'default' user: 'root' host: '10.0.0.18' (Got timeout reading communication packets)
This would explain why the client reports a broken connection, but this error occurs 10-15 minutes before the client reports the "MySQL server has gone away" error. In the meantime, the client is happily running insert statements without an issue. But as soon the client runs a select statement, the statement fails practically immediately.
I have already checked for these possible causes:
The server was running all the time
wait_timeout is set to 8 hours, which is way longer than the time the transaction needs to fail
max_allowed_packet is set to 512M which should be more than enough since the query is a very short select statement
The server does not run out of memory
I'm pretty sure the issue must be related to the "Got timeout reading communication packets" error from the MariaDB log. But I cannot wrap my head around why the client still can write data. And why this timeout occurs in the first place, since the wait_timeout is super high.
Some system information:
I'm running on MariaDB 10.5.1
The client uses python 3.6 and mysqlclient, which uses libmysql, is used for the database connection
I hope maybe some of you have an idea what I should look for, because this is really driving me nuts.

How to recover with a retry from gremlin NoHostAvailableException

I am using Gremlin Java driver to connect to a local gremlin server.
Simple code flow
Creating client
Cluster cluster = Cluster.build().addContactPoint(<endp>).port(<port>).enableSsl(false).create()
Client client = cluster.connect();
Submit Script
client.submit("g.V().count()");
If when i submit the first time the Gremlin server is down, on subsequent retries after bringing back gremlin server, connection still fails to create.
Exception First attempt when Gremlin Server is down:
org.apache.tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions. Check the error log to find the actual reason
Exception After Gremlin server is brought back up:
tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions
One thing to note is i do not create client on retry just do
Submit Script
client.submit("g.V().count()");
It is quite possible that Gremlin server may go down anytime, how to recover in such circumstances. Fundamentally is
NoHostAvailableExceptio
recoverable?
Normally, the Client should attempt to reconnect to a host that is previously marked unavailable. By default, it should be retrying the host at 1 second intervals as governed by this configuration: connectionPool.reconnectInterval. In your case, however I think you've uncovered a bug where the reconnect attempts are not started because the Client was never able to reach the host in the first place. As of 3.4.11, you can only remedy this by recreating the Client as you noted in your comments. I've created an issue to track this problem here: TINKERPOP-2569

Single Node Artifactory - deploy using AWS ECS fails with current node still available

Maybe Im just approaching this wrong.
Single Instance mode (non-HA)
AWS-RDS Postgres Database
Deploying via ECS
Currently have Artifactory-Pro building a docker container and deploying to ECS via CI/CD. The initial deploy goes fine. Everything stands up, database migrations occur, and the instance runs.
However, when doing an update to the task, a new task spins up. It then adds entries to the access_topology with the new container-ip and unique node-id, but they stay unhealthy. The logs just then bomb out with failure messages (below - due to existing heartbeat of other node).
If I first stop the running task, and start a new task, it spins up properly (Probably due to heartbeat loss).
In typical ECS world, the new task is spun up till its deemed healthy, and then the older task is killed off.
Either scenario creates orphaned NODE records that stay healthy -- trying to also figure out how to garbage collect on those and purge.
Any thoughts on this?
Errors are below – it appears that it wont properly join because of an active heartbeat, and not being HA. However, I want this node to stand up so I can topple the other. Thanks –
Cluster join: Successfully joined jfmd#01es5dmfhar6gcy5abyj4rwpkc with node id ip-10-10-3-248.us-XXXX-1.compute.internal
Application could not be initialized: Current Artifactory node last heartbeat is: 1607609142483. Stopping Artifactory since the local server is running as PRO/OSS but found other servers in registry
Error occurred when refreshing domain cache all domain endpoint failed : Fetch domains from http://localhost:8046/distribution/api/v1/events/domains failed (returned 404), Fetch domains from http://localhost:8046/artifactory/api/events/domains failed (returned 404), [domain_client]"
Retry 20 Elapsed 16.84 secs failed: Couldn't access another access peer. [localhost:8046]. Status code: UNAVAILABLE. HTTP status code 503
Status code: UNAVAILABLE. HTTP status code 503
1607609184634,invalid content-type: text/plain; charset=utf-8
1607609184634,"headers: Metadata(:status=503,content-type=text/plain; charset=utf-8,content-length=19,date=Thu, 10 Dec 2020 14:06:24 GMT)"
1607609184634,DATA-----------------------------
1607609184634,Service Unavailable. Trying again
This is not possible without an HA configuration. Since this is not an HA configuration, the application will not start up if there is another application still "alive". In this case, "alive" is defined as having written the heartbeat within X amount of seconds (I believe this is 10 by default).

Galera Cluster Connection refused and gcs_group_handle_join_msg (): 736: Will never receive state. Need to abort

I'm doing an example with Galera cluster using 3 virtual machines with Debian 9 and MariaDB 10.1 database.
The replica works well using rsync method, even if a node is dropped when it recovers, it synchronizes normally. The problem arose when I turned off two nodes to see what happened. Inserting data in the node that remained and it worked fine but when I have started the other nodes it throws the following error:
ERROR 2002 (HY000): Can not connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111 "Connection refused")
And the log of one of the nodes that I try to raise appears:
 
[ERROR] WSREP: gcs / src / gcs_group.cpp: gcs_group_handle_join_msg (): 736: Will never receive state. Need to abort.
I have no idea what could happen. I need help to solve this problem. Thank you.
Don't use sockets to talk between VMs. Only use TCP/IP.

network error (Tcp error)

I am inside a network where I need proxy settings to access the internet.
I have a weird problem.
The internet is working fine.
But it is one particular instance when i get this error:
Network Error (tcp_error)
A communication error occurred: "Operation timed out"
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.
For assistance, contact your network support team.
This happens when I use hadoop in local mode.
I can access the UI interface. I can see the jobs running. but when I try to see the logs of each task.. i am not able to access those logs.
UI--> job-->map--> task--> all <-- this is where the error is..
Any clues?
THanks
Not sure about exactly what your tcp action is, or about Hadoop or your proxy setup, but if you can reliably repeat the error, and the timeout error happens at approximately the same time each time you test, and that time is on the order of minutes, my guess would be that you've got a true processing delay (perhaps caused by blocking somewhere) at the server, but not necessarily.

Resources