How to recover with a retry from gremlin NoHostAvailableException - gremlin

I am using Gremlin Java driver to connect to a local gremlin server.
Simple code flow
Creating client
Cluster cluster = Cluster.build().addContactPoint(<endp>).port(<port>).enableSsl(false).create()
Client client = cluster.connect();
Submit Script
client.submit("g.V().count()");
If when i submit the first time the Gremlin server is down, on subsequent retries after bringing back gremlin server, connection still fails to create.
Exception First attempt when Gremlin Server is down:
org.apache.tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions. Check the error log to find the actual reason
Exception After Gremlin server is brought back up:
tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions
One thing to note is i do not create client on retry just do
Submit Script
client.submit("g.V().count()");
It is quite possible that Gremlin server may go down anytime, how to recover in such circumstances. Fundamentally is
NoHostAvailableExceptio
recoverable?

Normally, the Client should attempt to reconnect to a host that is previously marked unavailable. By default, it should be retrying the host at 1 second intervals as governed by this configuration: connectionPool.reconnectInterval. In your case, however I think you've uncovered a bug where the reconnect attempts are not started because the Client was never able to reach the host in the first place. As of 3.4.11, you can only remedy this by recreating the Client as you noted in your comments. I've created an issue to track this problem here: TINKERPOP-2569

Related

Continuously running send pipeline instance

An instance of a BizTalk send pipeline has started to run continuously. On 09/12/2021 an attempt was made to send a file via SFTP, which retried several times but ultimately failed due to a network issue. The error from the event logs is:
The adapter failed to transmit message going to send port "Deliver Outgoing - SFTP" with URL "sftp://xxx.xxxxxx.co.nz:22/To_****/%SourceFileName%". It will be retransmitted after the retry interval specified for this Send Port. Details:"WinSCP.SessionRemoteException: Network error: Software caused connection abort.
For some reason BizTalk made another send attempt at 1:49pm on 10/12/2021 which succeeded as confirmed by the administrator of the SFTP site. Despite this, BizTalk continued making intermittent send attempts and the pipeline instance is still running. The same file has been sent 4 times to the SFTP server.
The pipeline instance in theory should have suspended at 9:47pm on 09/12/2021. I have been able to confirm definitively whether anybody resumed it, but it seems unlikely at this stage. In any case, after sending successfully the pipeline instance should have terminated and should not be re-executing intermittently.
Does anybody know what could account for this behaviour? This is occurring on BTS2020 with CU2 applied.
I've sent messages over SFTP where the WinSCP interpretation of the date-modified attribute doesn't work with a specific type of SFTP server.
With the WinSCP GUI a dialogue box appears and you can disregard this error, but this option isn't available with BizTalk's GUI. This error appears when a file with the same filename already exists on the server and is supposed to be overwritten.
My solution was to create a pipeline component that removed %SourceFileName% on the server. The pipeline component (just like WinSCP GUI) can disregard the modified-date.

handle server shutdown while serving http request

Scenario : The server is in middle of processing a http request and the server shuts down. There are multiple points till where the code has executed. How are such cases typically handled ?. A typical example could be that some downstream http calls had to be made as a part of the incoming http request. How to find whether such calls were made or not made when the shutdown occurred. I assume that its not possible to persist every action in the code flow. Suggestions and views are welcome.
There are two kinds of shutdowns to consider here.
There are graceful shutdowns: when the execution environment politely asks your process to stop (e.g. systemd sends a SIGTERM) and expects it to exit on its own. If your process doesn’t exit within a few seconds, the environment proceeds to kill the process in a more forceful way.
A typical way to handle a graceful shutdown is:
listen for the signal from the environment
when you receive the signal, stop accepting new requests...
...and then wait for all current requests to finish
Exactly how you do this depends on your platform/framework. For instance, Go’s standard net/http library provides a Server.Shutdown method.
In a typical system, most shutdowns will be graceful. For example, when you need to restart your process to deploy a new version of code, you do a graceful shutdown.
There can also be unexpected shutdowns: e.g. when you suddenly lose power or network connectivity (a disconnected server is usually as good as a dead one). Such faults are harder to deal with. There’s an entire body of research dedicated to making distributed systems robust to arbitrary faults. In the simple case, when your server only writes to a single database, you can open a transaction at the beginning of a request and commit it before returning the response. This will guarantee that either all the changes are saved to the database or none of them are. But if you call multiple downstream services as part of one upstream HTTP request, you need to coordinate them, for example, with a saga.
For some applications, it may be OK to ignore unexpected shutdowns and simply deal with any inconsistencies manually if/when they arise. This depends on your application.

Sandbox error after multiple flex clients from one http session

Dealing with a Flex client that needs to connect through BlazeDS to a Java backend. Currently, we have a requirement for fifty-plus clients to be attached at any given time. We need to test the load of this requirement against the server to see if we are going to have any performance issues.
So, I have written a client emulator that will act like real clients and connect to the server though Blazeds. The emulator was to make this test require less hardware for I would connect 50 client from one machine (or two 25, basically minimize hardware needs). Instead of having fifty different machines to run clients on. The issue I am running into is a limitation of emulated clients allowed from one session. There seems to be a five client limit. This goes for both IExplorer and FireFox browsers. The problem is with the JMS subscriptions. The JMS topic seem to get connected but never subscribed.
I played around with some settings on the BlazeDS server side to no avail.
- max-streaming-connections-per-session
- max-streaming-clients
After the sixth connection I start getting a
16:18:21.578 [ERROR] com.ray.sv.flex.util.SocketLogTarget SocketLogTarget failed with SecurityError: [SecurityErrorEvent type="securityError" bubbles=false cancelable=false eventPhase=2 text="Error #2048: Security sandbox violation: http://xxx.xxx.xxx.xxx:8080/ClientEmulator/clientemulator.swf cannot load data from 127.0.0.1:1337."]
Very strange, but what is stranger is that client needs to recover from server disconnects. Which was a totally different issue all together with the remote object calls and subscribing timing issues, kept getting Duplicate Session Id errors. But now I have that ironed out, but I see the same issue when the client fails a connection five times.
The sixth time always fails with the same error. And this is with only one client connected.
Has anyone seen this same issue?
Thanks for your time.

FMS server died daily while enable rtmfp

I have a fms 3 server runing a video chat room application. It goes well except everyday it will die once or twice. After restarting the fms server, everything goes working again.
I really need to know the reason why fms server can die.
I checked its log, i saw many
"Server rejected an invalid flow."
Any hint will be greatest welcome.
This error can be caused by making an attempt to make a P2P connection to the server's peer ID. Connections to the server need to use
http://forums.adobe.com/thread/845685
i believe the problem is that you are attempting to make a P2P connection to the server's peer ID; that is, something like
var ns:NetStream = new NetStream(netConnection, netConnection.farID);
ns.play(...);
under the covers, this will open a new RTMFP flow to the server that will appear to the server as a new incoming client, but the initial handshake will be incorrect (the first/only command message is "play" instead of "connect"). i see this on Cirrus all the time.
it's possible that FMS doesn't account properly when rejecting these flows (leaving the connection count higher than it should be), or perhaps it leaves the flow open waiting for a "connect" message that will never come, so the connection count is legitimately higher than you think it is.
in any case, make sure you're not opening a P2P stream to the server's peer ID.
However, this error may not actually be related to the crashes. Additionally, are you even sure FMS is crashing and not just your application? If it's just your application, review your application logs (instead of the core FMS logs) and if you don't have anything useful add more logging to your application.

TcpListener stops accepting or accepts broken connections

We currently experience a problem with a self-written server application running on Windows (occurs on different versions). The server listens at a TCP port, accepts connections, exchanges some data and then closes the connections again. There are about 100 clients that connect from time to time.
Sometimes the server stops to work: Log files show that connections are still accepted, but that at the first read attempt a socket error (10054 - Connection reset by peer) occurs. I don't think it is a client issue because it suddenly stops working for all clients.
Now we found out, that the same problem occurs with our old server software, that is even written in another programming language. So it doesn't seem to be an error in our program - I think it has to be some kind of OS / firewall issue? Of course, firewalls have been deactivated, which didn't solve the issue yet.
Any ideas where to look into? Wireshark logs will follow soon..
Excerpt from the log (Timestamp, Thread Id, message)
11:37:56.137 T#3960 Connection from 10.21.13.3
11:37:56.138 T#3960 Client Exception: Socket Error # 10054
Connection reset by peer.
11:37:56.138 T#3960 ClientDisconnected
11:38:00.294 T#4144 Connection from 10.21.13.3
You can see that the exception occurs almost at the same time as the connection is accepted, in this case the client reconnects after a few seconds.
A "stateful" firewall or NAT keeps track of connections, and ought to send RSTs for connectiosn it doesn't know about. If the firewall loses track of connections for some reason, then you'll probably see random connections being reset.
Our router at work does this — it forgets about connections when the PPP connection dies, which is remarkably unhelpful when it rains and the DSL restart takes a bit too long. However, instead of resetting connections, it just drops packets (even more unhelpful!).
Sounds like a firewall or routing issue - maybe stale connections get disconnected after a timeout period. Are you using a ping/keepalive inside your protocol.
Otherwise you may ask Wireshark to see what is going on.
First, thanks for many hints - I'm afraid the problem was a completely different one which you couldn't possibly solve by reading my question.
The server application uses log4net, configured with a log file an ImmediateFlush = true. If every log statement is directly written into the file and multiple socket connections occur this slows down the whole application.
The server needed about a minute to really accept the connection. This was far more than the timeout on clientside. So in the log there was only shown "accepted" followed by "disconnected" - even the log was delayed!
Sorry for the inconvenience...
Have you tried changing the backlog and then see how much time or how many clients are served before this problem occurs
You don't say what Windows versions you're using for the server, but you should be aware that the Windows TCP/IP stack behaves differently in server and client OSes. There are limits on how many simultaneous incoming connections a client OS will allow, and they are significantly lower than you might expect.
What do the logs look like from the client side?
Since the error is stating that the client is dropping the connection; if you see the same error on the client side then it is a firewall or proxy that is dropping the connection (both side seeing the opposite side dropping the connection is indicative of a proxy/firewall).
If the error is not present on the client side; then I would say that your client side is where you will see the actual error.

Resources