Websphere 8.5 TCP Channel exceeded the maximum number of open connections

Websphere 8.5 TCP Channel exceeded the maximum number of open connections - spring-mvc

We are running a CXF 2.7.11 application on WAS 8.5.5.2 server. Application has classloading parent last property also we disabled IBM JaxWS engine as instructed on CXF documentation.
Application is running fine a couple of days, after that we get below exceptions and TCP channel seems to be full.
From the stack trace that have ws classes I suspect CXF for this problem but that may be a result of another problem
The application is also a Spring MVC application that exposes REST resources..
[10.11.2014 05:00:20:887 EET] 00000049 TCPChannel W TCPC0004W: TCP Channel TCP_2 has exceeded the maximum number of open connections 20000.
[10.11.2014 05:02:16:343 EET] 0000023f SSLHandshakeE E SSLC0008E: Unable to initialize SSL connection. Unauthorized access was denied or security settings have expired. Exception is javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
at com.ibm.jsse2.b.a(b.java:56)
at com.ibm.jsse2.nc.a(nc.java:90)
at com.ibm.jsse2.nc.unwrap(nc.java:292)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:26)
at com.ibm.ws.ssl.channel.impl.SSLConnectionLink.readyInbound(SSLConnectionLink.java:535)
at com.ibm.ws.ssl.channel.impl.SSLConnectionLink.ready(SSLConnectionLink.java:295)
at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.sendToDiscriminators(NewConnectionInitialReadCallback.java:214)
at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.complete(NewConnectionInitialReadCallback.java:113)
at com.ibm.ws.tcp.channel.impl.AioReadCompletionListener.futureCompleted(AioReadCompletionListener.java:175)
at com.ibm.io.async.AbstractAsyncFuture.invokeCallback(AbstractAsyncFuture.java:217)
at com.ibm.io.async.AsyncChannelFuture.fireCompletionActions(AsyncChannelFuture.java:161)
at com.ibm.io.async.AsyncFuture.completed(AsyncFuture.java:138)
at com.ibm.io.async.ResultHandler.complete(ResultHandler.java:204)
at com.ibm.io.async.ResultHandler.runEventProcessingLoop(ResultHandler.java:775)
at com.ibm.io.async.ResultHandler$2.run(ResultHandler.java:905)
at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1864)

So this is a bit tricky. You can simply increase the number of connections which can be done from the console:
Servers > WebSphere Application Servers > SERVER_NAME > web container > web container transport chains > TCP CHANNEL
The reason I said this is tricky because there could be a larger underlying issue, for example, a connection leak. To get to the point where you are using up 20K connections is quite a lot, however, I don't know how much load you're expecting on this server. If this is simply a test environment then you need to start looking into a possible connection leak.
Unable to initialize SSL connection. Unauthorized access was denied or security settings have expired. Exception is javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
This portion of the error message means that plain-text non-SSL connections are being made to an SSL port. You might also want to take a look at that and see who's making these calls because it's an overhead.

Using 20000 connections is extremely high. You probably have bugs in your client code which is leaking connections. If you are using CXF in client you may take a look at this https://issues.apache.org/jira/browse/CXF-5144.
Increasing connections number will not solve your issue, it will just delay it.

Related

Jmeter testing Asp.net application get timeout

Problem:
I testing my asp.net webapi application in my server (use IIS) and Concurrency number is set to 2000,loop count is forever,and alter several second i get Connection timed out: connect error
what i have tried:
set http connect timeout and response timeout as 200000ms in jmeter gui.
set requestQueueLimit to 65535 and min process to 15 in IIS manager.
set minWorkerThread and minIoThread to 200 and timeout to 20 miniutes in web.config file and restart my application in IIS
None of the above worked,and i found the server's cpu usage has been low ,here is the screenshot when using jmeter to test:
cpu usage
jmeter screen shot
here is the error log:
org.apache.http.conn.HttpHostConnectException: Connect to XXX.XXX.com:80 [XXX.XXXX.com/XXXX] failed: Connection timed out: connect
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl$JMeterDefaultHttpClientConnectionOperator.connect(HTTPHC4Impl.java:408)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.executeRequest(HTTPHC4Impl.java:939)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:650)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:66)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1301)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1290)
at org.apache.jmeter.threads.JMeterThread.doSampling(JMeterThread.java:651)
at org.apache.jmeter.threads.JMeterThread.executeSamplePackage(JMeterThread.java:570)
at org.apache.jmeter.threads.JMeterThread.processSampler(JMeterThread.java:501)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:268)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.connect0(Native Method)

Check Maximum Concurrent Connections and other limits in your web site configuration: Advanced Settings->Limits->Maximum Concurrent Connections
It may be not connected with IIS at all and the timeout can happen at your website level due to incorrect database configuration or inefficient algorithms used. Consider re-running your test with profiler tool telemetry like YourKit or dotTrace - it will give you full information regarding what's going on under the hood
Don't run load tests using JMeter GUI, it's only for tests development and debugging, when it comes to execution you should be running your JMeter tests in command-line non-GUI mode
Remove all the listeners, they don't add any value and just consume valuable resources

gRPC call, channel, connection and HTTP/2 lifecycle

I read the gRPC Core concepts, architecture and lifecycle, but it doesn't go into the depth I like to see. There is the RPC call, gRPC channel, gRPC connection (not described in the article) and HTTP/2 connection (not described in the article).
I'm interested in knowing how these come together. For example, what happens to the channel when a RPC throws an exception? What happens to the gRPC connection when the channel is closed? When is the channel closed? When is the gRPC connection closed? Heart beats? What if the deadline is exceeded?
Can anyone answer these questions, or point me to resources that can?

The connection is not a gRPC concept. It is not part of the normal API and is an implementation detail. This should be seen as fairly normal, like HTTP libraries providing details about HTTP exchanges but not exposing connections.
It is best to view RPCs and connections as two mostly-separate systems.
The only real guarantee is that "connections are managed by channels," for varying definitions of "managed." You must shut down channels when no longer used if you want connections and other resources to be freed. Other details are either an implementation detail or an advanced API detail.
There is no "gRPC connection." A "gRPC connection" would just be a standard "HTTP/2 connection." Except that is even an implementation detail of the transport in many gRPC implementations. That allows having alternative "connection" types like "inprocess" or QUIC (via Cronet, where there is not a classic "connection" at all).
It is the channel's job to hold all the connections and reconnect as necessary. It delegates part of that responsibility to load balancers and the load balancing APIs do have a concept of connections (subchannels). By not exposing connections to the application, load balancers have a lot of freedom to operate.
I'll note that gRPC C-core based implementations share connections across channels.
What happens to the channel when a RPC throws an exception?
The channel and connection is not impacted by a failed RPC. Note that connection-level failures typically cause RPCs to fail. But things like retries could allow the RPC to be re-sent on a new connection.
What happens to the gRPC connection when the channel is closed?
The connections are closed, eventually. Channel shutdown isn't instantaneous because existing RPCs can continue, and connection shutdown isn't instantaneous as well. But once all RPCs complete the connections are closed. Although C-core won't shut down a connection until no channels are using it.
When is the channel closed?
Only when the user closes it.
When is the gRPC connection closed?
Lots of times. The client may close it when no longer needed. For example, let's say the server IP address changes and the client need to connect to 1.1.1.2 instead of 1.1.1.1. A new connection will be created and new RPCs will go to the new IP address. The client may also close connections it thinks are dead (e.g., via keepalive timeouts).
Servers have a lot of say of when to close connections. They may close them simply because they are old, or because they have been idle, or because the server is overloaded. But those are simply use-cases; the server can shut down a connection at-will.
What if the deadline is exceeded?
Deadline only applies to RPCs and doesn't impact the channel or a connection.

I was actually waiting for Eric to answer this as he is the expert in this!
I also have been playing with gRPC for a while now, I would like to add few things here for beginners. Anyone more experienced, please feel free to edit!
Channel is an abstraction over a long-lived connection! The client application will create a channel on start up. The channel can be reused/shared among multiple threads. It is thred safe. One channel is enough (for most of the use cases) for multiple threads and multiplexing concurrent requests. It is channel's responsibility to close / reconnect / keep the connection alive etc. We as the users do not have to worry about this in general. The client application can close the channel anytime it wants. Channel creation seems to be an expensive process. So we would not open/close for every RPC.
When you use gRPC loadbalancer/nameresolver for a domain name and the nameresolver resolves the domain with multiple ip addresses, a channel creates multiple subchannels where each subchannel is an abstraction over a connection to 1 server. So a channel can also represent multiple connections!!
Adding some points to note from Eric's comment.
adding the default load balancer still only creates (approximately)
one connection if the name resolver returns multiple addresses, as the
default is pick_first. But if you change the load balancer to
round_robin or virtually any other policy, then yes, there will be
multiple connections in a channel. Even if a name resolver returns one
address, the load balancer is free to create multiple connections
(e.g., for higher throughput), but that's not common today
An underlying connection can be closed any time for any reason. For ex: remote server is shutting down gracefully for a scheduled maintenance or a connection is idle for longer duration. In that case, the server could send GOAWAY signal to the client and client might disconnect and reconnect to some other server. or Server might crash due to OOM error. In this case channel will detect connection failure and will retry for new connection for some other server etc.
A channel can keep sending PING frame to the server to keep the connection alive. These are all configurable via channel builder.
With these information above, if we look at your questions,
what happens to the channel when a RPC throws an exception?
Nothing happens to the channel. The unhandled exception on the server might the fail the RPC on the client side. But channel is still usable for any RPC calls.
What happens to the gRPC connection when the channel is closed?
Channel is an abstraction over the connection. So it will be closed. (again there is no gRPC connection as such as Eric had mentioned. It would be a HTTP2 connection)
When is the channel closed?
Any time you want. But normally when the application shuts down.
When is the gRPC connection closed?
It is not our problem. Channel takes care of this.
Heart beats?
Channel sends PING frames periodicaly to keep the connection alive.
What if the deadline is exceeded?
It is something like timeout on the client side. When the deadline exceeds, the client might cancel the request. Once again nothing happens to the channel. (But it might trigger exception on the server side which I had noticed few times. (Received DATA frame for an unknown stream. https://github.com/grpc/grpc-java/issues/3548). It seems to have been fixed now).

ODP.NET Connection Pooling Issues - Fault Tollerance After Database Goes Down

I have an WebAPI service using ODP.NET to make connections to several oracle databases. Normally the web service would be hit several times a second and will never have long periods on inactivity. In our test site however, we did not use it for 2-3 days. This morning, we hit the service and got "connection request timeout" exceptions from ODP.NET, suggesting that the connection pool was out of available connections. We are closing the connections after use. The service was working fine before the period, but today the very first query got the timeout exception. Our app pool in IIS is configured to never reset.
My question then is, what can cause the connection pool to fill with bad connections after a period of inactivity, where these connections are not cleaned up in the usual 3 minute cycle? It only happened to 2 out of the 3 of our databases, and Validate Connection=true is set for all of them.
EDIT
So after talking to the DBA, there is some different between a connection/session being killed manually or by timeout and the database server severing the TCP connections. In this case, the TCP connection was severed as part of a regular backup (why is not important for this). I guess this happens when the whole database server goes offline at once. The basis of the question still applies I think though: why is ODP.NET unable to cleanup severed connections overtime? There is a performance counter that refers to "Stasis" connections, could those connections be stuck in that state? I would think that it should be able to see that a connection is no longer active (Validate Connection=True), kill it and not return it to the pool.
Granted, this problem can be solved by just resetting the app pool everything the database goes down. I would still like to configure ODP.NET connection pooling to be more fault tolerant.

I have run into this same issue, and the only solution I have found is to use the Connection Lifetime connection string parameter in conjunction with Validate Connection.
In my particular case, the connection timeout was set at the server and the connections in the pool would timeout, but not be sniped out of the pool, resulting in errors.
Setting both the Connection Lifetime and the Validate Connection parameters has resolved the issue.
Make sure the Connection Lifetime value that you choose is less than the server connection inactivity timeout.

The recommended solution is to use ODP.NET Fast Connection Failover (FCF). FCF will automatically remove invalid connections from the pool such that you don't need to use Validate Connection, Connection Lifetime, nor clear the pool.
To use FCF, set "HA events=true", use connection pooling, and have your DBA set up Fast Application Notification (FAN) on the server side. FAN is what alerts the ODP.NET pool when a DB service or node goes down or rebooted. Upon receiving the message, ODP.NET knows which connections to remove from the pool and removes them, leaving all other valid connections untouched.

Something else is going on here. Min Pool Size and some of the other settings help when the connection is severed from things like DBA configured idle timeouts and firewall tcp idle timeouts, 'connection request timeout' occurs when created a new connection.
This could be simple network problem. There could be something interfering with dns resolution of the servers. Another case is not having fully qualified entries in tnsnames. I've been bit by the latter a couple of times.
The other issue is the one you've already recognized - full pool.
Double check that you don't have a connection leak somewhere. A missing .Close is one thing but if you're not using a 'using' statement, a try/finally is required as an unhandled exception could be thrown prior to the .Close.
I would use perfmon to monitor some of the connection statistics to start - NumberOfPooledConnections, NumberOfActiveConnections, etc:

TcpListener stops accepting or accepts broken connections

We currently experience a problem with a self-written server application running on Windows (occurs on different versions). The server listens at a TCP port, accepts connections, exchanges some data and then closes the connections again. There are about 100 clients that connect from time to time.
Sometimes the server stops to work: Log files show that connections are still accepted, but that at the first read attempt a socket error (10054 - Connection reset by peer) occurs. I don't think it is a client issue because it suddenly stops working for all clients.
Now we found out, that the same problem occurs with our old server software, that is even written in another programming language. So it doesn't seem to be an error in our program - I think it has to be some kind of OS / firewall issue? Of course, firewalls have been deactivated, which didn't solve the issue yet.
Any ideas where to look into? Wireshark logs will follow soon..
Excerpt from the log (Timestamp, Thread Id, message)
11:37:56.137 T#3960 Connection from 10.21.13.3
11:37:56.138 T#3960 Client Exception: Socket Error # 10054
Connection reset by peer.
11:37:56.138 T#3960 ClientDisconnected
11:38:00.294 T#4144 Connection from 10.21.13.3
You can see that the exception occurs almost at the same time as the connection is accepted, in this case the client reconnects after a few seconds.

A "stateful" firewall or NAT keeps track of connections, and ought to send RSTs for connectiosn it doesn't know about. If the firewall loses track of connections for some reason, then you'll probably see random connections being reset.
Our router at work does this — it forgets about connections when the PPP connection dies, which is remarkably unhelpful when it rains and the DSL restart takes a bit too long. However, instead of resetting connections, it just drops packets (even more unhelpful!).

Sounds like a firewall or routing issue - maybe stale connections get disconnected after a timeout period. Are you using a ping/keepalive inside your protocol.
Otherwise you may ask Wireshark to see what is going on.

First, thanks for many hints - I'm afraid the problem was a completely different one which you couldn't possibly solve by reading my question.
The server application uses log4net, configured with a log file an ImmediateFlush = true. If every log statement is directly written into the file and multiple socket connections occur this slows down the whole application.
The server needed about a minute to really accept the connection. This was far more than the timeout on clientside. So in the log there was only shown "accepted" followed by "disconnected" - even the log was delayed!
Sorry for the inconvenience...

Have you tried changing the backlog and then see how much time or how many clients are served before this problem occurs

You don't say what Windows versions you're using for the server, but you should be aware that the Windows TCP/IP stack behaves differently in server and client OSes. There are limits on how many simultaneous incoming connections a client OS will allow, and they are significantly lower than you might expect.

What do the logs look like from the client side?
Since the error is stating that the client is dropping the connection; if you see the same error on the client side then it is a firewall or proxy that is dropping the connection (both side seeing the opposite side dropping the connection is indicative of a proxy/firewall).
If the error is not present on the client side; then I would say that your client side is where you will see the actual error.

Sandbox violation on second socket send

I have a Flex client using a Flash binary (TCP) socket for communication with a Java server. I have a localhost (Apache) server providing a crossdomain.xml file which is wide open just while I am testing.
My code successfully loads the policy file on startup.
I then connect the socket to the server without any difficulty and send a message and get a response. All good so far.
However, when I send a second message through the same socket I get a pause of about 12 seconds then a sandbox violation error:
Security Error: Error #2048: Security sandbox violation: file:///C:/apache_root/ttt1/ttt1.swf cannot load data from localhost:45455.
This is the same port and socket through which the first message succeeded.
I tried re-loading the policy file before every send, but I get the same result.
Any idea why this might be happening? I clearly have an open socket at one point. I am flushing the socket after each send and I tried doing that after each read as well, but the same result.
Thanks in advance
EDIT:
If I recreate the socket prior to every call my code works. I am struggling to believe that this is correct, but maybe there is a Socket setting I am missing.

As far as I know if you're doing binary sockets the crossdomain.xml is not loaded via http.
Have you checked your apache's access logs if the crossdomain is even queried?
You might get connection from flash via tcp from flash asking for the file on your java server (not using http. It just sends the string "" or similar). Look out for them. If you don't answer them within 3 seconds (or so) flash throws an sandbox violation.

The first thing you have to do when you want to make a socket connection is to load the policy file. This only has to be done once per load of the SWF.
Security.allowDomain(host);
Security.loadPolicyFile("xmlsocket://"+host+":"+port);
The request will be made on the assigned port(45455 in your case) your server will have to listen on that port for a request "<policy-file-request/>" without the quotes.
When that request is found then you need to return to the client the crossdomain.xml
with node <allow-access-from domain="*" to-ports="*" />
After the cross domain is sent you need to close the socket on the server side
On the client side you need to ignore the domain response as Flex will handle that however at that time you can reconnect to the socket server.
At this time you can do your data send/receive.
I have a feeling the reason it actually worked for you is because you were using the connection for the policy file to transmit your data before it timed out.
I would suggest reading up on the new style of crossdomain policies and also reading up on the protocol you are using for your socket server

I think it depends on the sandbox-policy you used in the compilation process of your swf not on your crossdomain.xml... maybe this docu helps you:Security sandboxes
But I'm not 100% sure

This sort of sounds like a cache problem. Perhaps you're pulling the first socket connection out of cache and the second one gets rejected because it's getting a 200 from the server.
You might want to add localhost to your flash security exceptions list for debugging. that will quiet the sandbox errors until you get your piece to it's production environment.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex