Using QuickFixN, if I restart my trading application I occasionally am unable to logon, getting a "an existing connection was forcibly closed by the remote host" error.
The QuickFix engine retries to connect every 30secs, but always gets the same error.
If I close my application and re-open, it will connect correctly.
Speaking to my broker, it seems that they are rejecting my logins because they did not recognize my connection as being closed first time. 2nd time around, me forcing the application to close will tear-down the TCP connection, meaning that 3rd time logins work.
So my question is: is there a way to close and re-open the TCP connection without restarting the application?
Sounds like the problem is kinda on their end. Since the problem happens when you don't formally log out (e.g crash or abnormal termination), that means that their implementation apparently doesn't recognize the TCP termination.
At a higher-than-TCP layer, their FIX engine should somewhat compensate. If a few heartbeat durations occur after your disconnect, their implementation should realize you're not there anymore, since you're not responding to heartbeats.
So, neither their low-layer TCP handlers nor their FIX engine are able to set the right flag somewhere in their system that says you've gone offline. That's weird. I don't see what you can do about that, aside from intentionally doing a startup/shutdown to kludge their state flag for you.
I'm usually really hesitant to blame the other side (especially because I run the QF/n project), but that's where I'm at with the information provided.
Related
First off, there a similar question with the same issue here, but there is no answer, so I rewrote the question once again in more detail.
I am connected to an SMSC, and I noticed that there are a lot of messages are not delivered to us, we asked the SMSC to check the routing and it was fine, but SMSC noticed that there are too many connections established from your side to his side, although, we have one single connection only.
I was using NowSMS SMPP Client application to handle the connectivity, then, the SMSC asked me to change the application although I was thinking that NowSMS had no issues as I am using it 7 years ago, however, I asked NowSMS's team to investigate by opening a support ticket.
Later, I had to change NowSMS and install Kannel on a new Linux machine, after getting connected over Kannel to the SMSC, we got the same issue once again, and when I read all Kannel's logs, I found "System error (104): Connection reset by peer" which makes me, logically, to open a new connection with the SMSC. Accordingly, I suggested to have a live TCP trace from both sides at the same time, and I found the below packet in Wireshark trace file:
As you see, this is a RST/ACK from SMSC to me without requesting RST or anything from my side, and when I asked them why do you send RST/ACK or why do you RST the connection, I didn't get any useful answere, but they told me to read more about the RST/ACK and RST and I have no idea about networking, but when I read, I found that I had no control on RST connection as there was no requests from my side to the SMSC asking for the same. They always guid me to this post and what I see that it doesn't belong to me.
NOW: I just need to know what should I do or what should I ask whom about? As, I asked the Data Center's team about the same, and they confirmed that the VPN between me and the SMSC works normally without any exceptions. I believe, that there is no issue in application layer, but I cannot recognize the root of the issue.
P.S. Kannel's log file, and both TCP Trace file are here
Ask them to activate the Enquire link packet in order to drop inactive connections. It's clearly a problem from their side.
We currently experience a problem with a self-written server application running on Windows (occurs on different versions). The server listens at a TCP port, accepts connections, exchanges some data and then closes the connections again. There are about 100 clients that connect from time to time.
Sometimes the server stops to work: Log files show that connections are still accepted, but that at the first read attempt a socket error (10054 - Connection reset by peer) occurs. I don't think it is a client issue because it suddenly stops working for all clients.
Now we found out, that the same problem occurs with our old server software, that is even written in another programming language. So it doesn't seem to be an error in our program - I think it has to be some kind of OS / firewall issue? Of course, firewalls have been deactivated, which didn't solve the issue yet.
Any ideas where to look into? Wireshark logs will follow soon..
Excerpt from the log (Timestamp, Thread Id, message)
11:37:56.137 T#3960 Connection from 10.21.13.3
11:37:56.138 T#3960 Client Exception: Socket Error # 10054
Connection reset by peer.
11:37:56.138 T#3960 ClientDisconnected
11:38:00.294 T#4144 Connection from 10.21.13.3
You can see that the exception occurs almost at the same time as the connection is accepted, in this case the client reconnects after a few seconds.
A "stateful" firewall or NAT keeps track of connections, and ought to send RSTs for connectiosn it doesn't know about. If the firewall loses track of connections for some reason, then you'll probably see random connections being reset.
Our router at work does this — it forgets about connections when the PPP connection dies, which is remarkably unhelpful when it rains and the DSL restart takes a bit too long. However, instead of resetting connections, it just drops packets (even more unhelpful!).
Sounds like a firewall or routing issue - maybe stale connections get disconnected after a timeout period. Are you using a ping/keepalive inside your protocol.
Otherwise you may ask Wireshark to see what is going on.
First, thanks for many hints - I'm afraid the problem was a completely different one which you couldn't possibly solve by reading my question.
The server application uses log4net, configured with a log file an ImmediateFlush = true. If every log statement is directly written into the file and multiple socket connections occur this slows down the whole application.
The server needed about a minute to really accept the connection. This was far more than the timeout on clientside. So in the log there was only shown "accepted" followed by "disconnected" - even the log was delayed!
Sorry for the inconvenience...
Have you tried changing the backlog and then see how much time or how many clients are served before this problem occurs
You don't say what Windows versions you're using for the server, but you should be aware that the Windows TCP/IP stack behaves differently in server and client OSes. There are limits on how many simultaneous incoming connections a client OS will allow, and they are significantly lower than you might expect.
What do the logs look like from the client side?
Since the error is stating that the client is dropping the connection; if you see the same error on the client side then it is a firewall or proxy that is dropping the connection (both side seeing the opposite side dropping the connection is indicative of a proxy/firewall).
If the error is not present on the client side; then I would say that your client side is where you will see the actual error.
The client use ssh login and start up a server on remote machine, then the clinet create a tcp connect to the server.
The server need exit when the client has exit normally or crashed or network is dropped.
So the question is how to detect if the client which the server has connected to is crashed.
The first try is using error() signal, catch QAbsoluteSocket::NetworkError to determine the network has dropped. But I can't receive error() signal at all even if i pull out the network cable.
The second try is using the SocketState, i think whenever SocketState is UnconnectedState,the client may has exit normally and the server should exit too. This way works fine for "normal exit", but I don't know how to deal with "crash" and "dead network".
Help me, thanks!
I'd recommend using TCP keep alive. It is not exposed through the public QTcpSocket interface, but you can use setsockopt with QAbstractSocker::socketDescriptor to activate the SO_KEEPALIVE feature.
EDIT: It appears that keep alive was added to QAbstractSocket at some point. So, simply call QAbstractSocket::setSocketOption with QAbstractSocket::KeepAliveOption.
You can find information about adjusting the timeout of keep alive request here: http://www.gnugk.org/keepalive.html
Most of the time, the only way you will know there is a problem with a socket connection is when you try to read or write with it. There are some exceptions: Windows will change the state of sockets if the network cable is unplugged, Linux (in my experience) will not.
The most reliable way to detect connection problems is to have the client regularly send a small message at an agreed upon interval with the server. If the server does not see this message within a reasonable time, it should consider the client dead and drop the connection. This will also give both sides regular opportunities to detect a problem via reads and writes.
I have a tcpip socket interface to a third party software app. I've implemented this interface for several customer sites with no problem. The latest customer, though... problems. We've turned on logging in the apps on either end, and also installed Wireshark on the PC to log raw tcpip traffic. With that, we've proved that my server app successfully sends the message out, the pc receives the message, but the client app doesn't see it. (This is a totally intermittent problem, which is why it's such a pain to troubleshoot.)
The socket details are as simple as they come: one socket handling two way communications between the server and the pc. The messages are plain ascii text and fairly short (not XML). The server initiates communications by sending the first message, and then the client responds with several messages. The socket is kept open at all times while the apps are running. The client app is designed so that the end user can only process one case at a time, which prevents message collisions from happening. They have some sort of polling set up, their app "hibernates" until it sees the initiating message from the server.
The third party vendor has advised me to add a few second delay before I send them the initiating message. I can't see how that helps. If the client is "sleeping", just polling the socket waiting for a message, how does adding a delay before the first message help? It's not like we send two messages and the second one gets lost. It's losing the first message. So I don't see how it matters if we send that message now or two seconds from now.
I've asked them and they haven't given me details. It could be some proprietary details in their coding that they don't want to disclose to me, and that's fair. So I'm asking here because I'm always learning new things about socket programming. Maybe you guys can shed some light on how polling a tcpip socket can be affected by message timing?
Since its someone else's client and they won't tell you what its doing (other than saying 'insert a delay'), the answer is probably that their client is reading and discarding the message because its not yet in a state to deal with it. The delay will allow the client time to get into a state where it can respond to the message properly.
In other words, the client has a race condition. One easy way this can happen is if they have one thread for reading messages and another for dealing with them.
Short of running strace(1) on the client to see what system calls it is making, its tough to tell what the client is actually doing.
How long can I expect a client/server TCP connection to last in the wild?
I want it to stay permanently connected, but things happen, so the client will have to reconnect. At what point do I say that there's a problem in the code rather than there's a problem with some external equipment?
I agree with Zan Lynx. There's no guarantee, but you can keep a connection alive almost indefinitely by sending data over it, assuming there are no connectivity or bandwidth issues.
Generally I've gone for the application level keep-alive approach, although this has usually because it's been in the client spec so I've had to do it. But just send some short piece of data every minute or two, to which you expect some sort of acknowledgement.
Whether you count one failure to acknowledge as the connection having failed is up to you. Generally this is what I have done in the past, although there was a case I had wait for three failed responses in a row to drop the connection because the app at the other end of the connection was extremely flaky about responding to "are you there?" requests.
If the connection fails, which at some point it probably will, even with machines on the same network, then just try to reestablish it. If that fails a set number of times then you have a problem. If your connection persistently fails after it's been connected for a while then again, you have a problem. Most likely in both cases it's probably some network issue, rather than your code, or maybe a problem with the TCP/IP stack on your machine (has been known: I encountered issues with this on an old version of QNX--it'd just randomly fall over). Having said that you might have a software problem, and the only way to know for sure is often to attach a debugger, or to get some logging in there. E.g. if you can always connect successfully, but after a time you stop getting ACKs, even after reconnect, then maybe your server is deadlocking, or getting stuck in a loop or something.
What's really useful is to set up a series of long-running tests under a variety of load conditions, from just sending the keep alive are you there?/ack requests and responses, to absolutely battering the server. This will generally give you more confidence about your software components, and can be really useful in shaking out some really weird problems which won't necessarily cause a problem with your connection, although they might result in problems with the transactions taking place. For example, I was once writing a telecoms application server that provided services such as number translation, and we'd just leave it running for days at a time. The thing was that when Saturday came round, for the whole day, it would reject every call request that came in, which amounted to millions of calls, and we had no idea why. It turned out to be because of a single typo in some date conversion code that only caused a problem on Saturdays.
Hope that helps.
I think the most important idea here is theory vs. practice.
The original theory was that the connections had no lifetimes. If you had a connection, it stayed open forever, even if there was no traffic, until an event caused it to close.
The new theory is that most OS releases have turned on the keep-alive timer. This means that connections will last forever, as long as the system on the other end responds to an occasional TCP-level exchange.
In reality, many connections will be terminated after time, with a variety of criteria and situations.
Two really good examples are: The remote client is using DHCP, the lease expires, and the IP address changes.
Another example is firewalls, which seem to be increasingly intelligent, and can identify keep-alive traffic vs. real data, and close connections based on any high level criteria, especially idle time.
How you want to implement reconnect logic depends a lot on your architecture, the working environment, and your performance goals.
It shouldn't really matter, you should design your code to automatically reconnect if that is the desired behavior.
There really is no way to tell. There is nothing inherent to TCP that would cause the connection to just drop after a certain amount of time. Someone on a reliable connection could have years of uptime, while someone on a different connection could have to reconnect every 5 minutes. There is no way to tell or even guess.
You will need some data going over the connection periodically to keep it alive - many OS's or firewalls will drop an inactive connection.
Pick a value. One drop every hour is probably fine. Ten unexpected connection drops in 5 minutes probably indicates a problem.
TCP connections will generally last about two hours without any traffic. Either end can send keep-alive packets, which are, I think, just an ACK on the last received packet. This can usually be set per socket or by default on every TCP connection.
An application level keep-alive is also possible. For a telnet style protocol like FTP, SMTP, POP or IMAP something like sending return, newline and getting back a command prompt.