What are the causes for os_error: message too long - tcp

We have a client and server communicating each other with grpc. Previously the server was running on Windows Server, and the client running on Linux or MacOS. Everything works perfectly until we migrate the server from Windows Server to a docker container.
Then we observed some weird tcp broken when we send a large amount of request from client to server.
Then we dig into the grpc arena and run our client and server with GRPC_VERBOSITY=info and GRPC_TRACE=tcp. Then we found that the disconnection was caused from the server side, with error message below:
tcp_custom.cc:218] write complete on 029FCC20: error={"created":"#1594210168.896000000","description":"TCP Write failed","file":"d:\a\grpc-node\grpc-node\packages\grpc-native-core\deps\grpc\src\core\lib\iomgr\tcp_uv.cc","file_line":72,"grpc_status":14,"os_error":"message too long"}
So my question is what does the os_error: message too long really means? What is the next step for me to investigate?
Linked issue

The string "message too long" is the error message for the error code UV_EMSGSIZE, which corresponds to the Linux error code EMSGSIZE. That appears to generally mean that gRPC is trying to write a buffer that is too large to the socket, but I'm not sure what exactly could trigger that.

Related

Continuously running send pipeline instance

An instance of a BizTalk send pipeline has started to run continuously. On 09/12/2021 an attempt was made to send a file via SFTP, which retried several times but ultimately failed due to a network issue. The error from the event logs is:
The adapter failed to transmit message going to send port "Deliver Outgoing - SFTP" with URL "sftp://xxx.xxxxxx.co.nz:22/To_****/%SourceFileName%". It will be retransmitted after the retry interval specified for this Send Port. Details:"WinSCP.SessionRemoteException: Network error: Software caused connection abort.
For some reason BizTalk made another send attempt at 1:49pm on 10/12/2021 which succeeded as confirmed by the administrator of the SFTP site. Despite this, BizTalk continued making intermittent send attempts and the pipeline instance is still running. The same file has been sent 4 times to the SFTP server.
The pipeline instance in theory should have suspended at 9:47pm on 09/12/2021. I have been able to confirm definitively whether anybody resumed it, but it seems unlikely at this stage. In any case, after sending successfully the pipeline instance should have terminated and should not be re-executing intermittently.
Does anybody know what could account for this behaviour? This is occurring on BTS2020 with CU2 applied.
I've sent messages over SFTP where the WinSCP interpretation of the date-modified attribute doesn't work with a specific type of SFTP server.
With the WinSCP GUI a dialogue box appears and you can disregard this error, but this option isn't available with BizTalk's GUI. This error appears when a file with the same filename already exists on the server and is supposed to be overwritten.
My solution was to create a pipeline component that removed %SourceFileName% on the server. The pipeline component (just like WinSCP GUI) can disregard the modified-date.

network error (Tcp error)

I am inside a network where I need proxy settings to access the internet.
I have a weird problem.
The internet is working fine.
But it is one particular instance when i get this error:
Network Error (tcp_error)
A communication error occurred: "Operation timed out"
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.
For assistance, contact your network support team.
This happens when I use hadoop in local mode.
I can access the UI interface. I can see the jobs running. but when I try to see the logs of each task.. i am not able to access those logs.
UI--> job-->map--> task--> all <-- this is where the error is..
Any clues?
THanks
Not sure about exactly what your tcp action is, or about Hadoop or your proxy setup, but if you can reliably repeat the error, and the timeout error happens at approximately the same time each time you test, and that time is on the order of minutes, my guess would be that you've got a true processing delay (perhaps caused by blocking somewhere) at the server, but not necessarily.

ZeroMQ Client Lose Connection

I have a client (PULL) connect to the server (PUSH). At first they work just fine. But later the connection is broken, and client-side ZeroMQ doesn't try to reconnect to server.
One mysterious thing is that if I do netstat in client side and server side, the client side shows the connection is still ESTABLISHED, while the server side doesn't have the corresponding entry. I suppose this is the reason why client-side doesn't do reconnecting.
PS: client and server are in differenct IDC, and there is a band limit between them. But when the disconnection happens, our monitor shows it does not hit the band limit.
And, when I do netstat in server side (when the connection is fine), sometimes the Send-Q column is very big, and then drop down to 0.
That's all the information I have. If you need more details please tell me.
I realize this is a very old question, but I ran into almost the exact same issue and found this while trying to find a fix. I believe I have fixed my issue so hopefully this helps someone at some point.
I had the same scenario but with ROUTER -> ROUTER. Everything worked great at first but after ~15 minutes of not sending any messages, messages would no longer make it. Then I found: http://api.zeromq.org/3-2:zmq-setsockopt. The three socket options that worked for me were ( using pyzmq ):
# self.client is my socket here
self.client.setsockopt(zmq.TCP_KEEPALIVE, 1)
self.client.setsockopt(zmq.TCP_KEEPALIVE_IDLE, 300)
self.client.setsockopt(zmq.TCP_KEEPALIVE_INTVL, 300)
These override the OS settings and I'm no longer seeing the connection timeout or drop.

TcpListener stops accepting or accepts broken connections

We currently experience a problem with a self-written server application running on Windows (occurs on different versions). The server listens at a TCP port, accepts connections, exchanges some data and then closes the connections again. There are about 100 clients that connect from time to time.
Sometimes the server stops to work: Log files show that connections are still accepted, but that at the first read attempt a socket error (10054 - Connection reset by peer) occurs. I don't think it is a client issue because it suddenly stops working for all clients.
Now we found out, that the same problem occurs with our old server software, that is even written in another programming language. So it doesn't seem to be an error in our program - I think it has to be some kind of OS / firewall issue? Of course, firewalls have been deactivated, which didn't solve the issue yet.
Any ideas where to look into? Wireshark logs will follow soon..
Excerpt from the log (Timestamp, Thread Id, message)
11:37:56.137 T#3960 Connection from 10.21.13.3
11:37:56.138 T#3960 Client Exception: Socket Error # 10054
Connection reset by peer.
11:37:56.138 T#3960 ClientDisconnected
11:38:00.294 T#4144 Connection from 10.21.13.3
You can see that the exception occurs almost at the same time as the connection is accepted, in this case the client reconnects after a few seconds.
A "stateful" firewall or NAT keeps track of connections, and ought to send RSTs for connectiosn it doesn't know about. If the firewall loses track of connections for some reason, then you'll probably see random connections being reset.
Our router at work does this — it forgets about connections when the PPP connection dies, which is remarkably unhelpful when it rains and the DSL restart takes a bit too long. However, instead of resetting connections, it just drops packets (even more unhelpful!).
Sounds like a firewall or routing issue - maybe stale connections get disconnected after a timeout period. Are you using a ping/keepalive inside your protocol.
Otherwise you may ask Wireshark to see what is going on.
First, thanks for many hints - I'm afraid the problem was a completely different one which you couldn't possibly solve by reading my question.
The server application uses log4net, configured with a log file an ImmediateFlush = true. If every log statement is directly written into the file and multiple socket connections occur this slows down the whole application.
The server needed about a minute to really accept the connection. This was far more than the timeout on clientside. So in the log there was only shown "accepted" followed by "disconnected" - even the log was delayed!
Sorry for the inconvenience...
Have you tried changing the backlog and then see how much time or how many clients are served before this problem occurs
You don't say what Windows versions you're using for the server, but you should be aware that the Windows TCP/IP stack behaves differently in server and client OSes. There are limits on how many simultaneous incoming connections a client OS will allow, and they are significantly lower than you might expect.
What do the logs look like from the client side?
Since the error is stating that the client is dropping the connection; if you see the same error on the client side then it is a firewall or proxy that is dropping the connection (both side seeing the opposite side dropping the connection is indicative of a proxy/firewall).
If the error is not present on the client side; then I would say that your client side is where you will see the actual error.

Service not available, closing transmission channel. The server response was: 4.4.2 Timeout while waiting for command

I'm trying to send a message and we sometime get this error :
Service not available, closing transmission channel. The server response was: 4.4.2 Timeout while waiting for command.
Anyone know what to do with this? Because it only happens "sometime" and apperently, for no specific reason.
I saw many article saying :
442 The server started to deliver the message but then the connection was broke (Source : http://www.sorkincomputer.net/SMTP%20errors.htm)
This is typically a server side (the SMTP server you're delivering to) error or a network connectivity error. There isn't anything you can do about it via your code, you would need to get the related IT staff involved to figure out why your connection is getting closed or interrupted.

Resources