netty client + keep-alive=true

netty client + keep-alive=true - http

I'm confused for how to deal with lots of connections in netty (3.6.2.FINAL) and keep-alive=true.
For work on a netty client as a server side connector, making http calls to another service, it wants to always keep the connection open for performance (keep-alive=true).
The issue: there is a hard limit for number of open channels, after which the client will hang when attempting to open a channel. Why no exception just hangs? Is this a setting in terms of channel timeout?
I can't seem to understand Netty in terms of overall managing of connections within worker threads:
With a blocking write/read client ChannelHandler (http request/response), how do you detect that the connection pool is empty?
The handler can receive ChannelEvent(s) but nothing about the overall count available in the connection pool (its very non-deterministic anyway). And if the channel is not open, does it make sense for the handler to initiate opening a new channel given its running in a worker thread?
But if the connection pool is exhausted, how do you go and cleanup some idle connections (within the handler)?

I had to completely rip apart my handler to get the client blocking call to work without hanging. The issue was mostly resolved by not holding onto local channel ref within the handler.
Now we just pass a ConnectionInterface#openConnection() [returns a new ChannelFuture] into the shared custom ChannelHandler#call( ConnectionInterface connectionInterface, HttpRequest request ).
Better to open-channel within the handler call method, and to pass that channel along with checks on its state before channel.write(x), if !channel.isWritable() then recycle the channel (from a new client connection eg. ConnectionInterface#openConnection()) and retry the write. There isn't even a need to close the channel (it gets handled in the pool).
Just ran it with 500 threads / 5000 requests and it succeeds fine.

Related

gRPC call, channel, connection and HTTP/2 lifecycle

I read the gRPC Core concepts, architecture and lifecycle, but it doesn't go into the depth I like to see. There is the RPC call, gRPC channel, gRPC connection (not described in the article) and HTTP/2 connection (not described in the article).
I'm interested in knowing how these come together. For example, what happens to the channel when a RPC throws an exception? What happens to the gRPC connection when the channel is closed? When is the channel closed? When is the gRPC connection closed? Heart beats? What if the deadline is exceeded?
Can anyone answer these questions, or point me to resources that can?

The connection is not a gRPC concept. It is not part of the normal API and is an implementation detail. This should be seen as fairly normal, like HTTP libraries providing details about HTTP exchanges but not exposing connections.
It is best to view RPCs and connections as two mostly-separate systems.
The only real guarantee is that "connections are managed by channels," for varying definitions of "managed." You must shut down channels when no longer used if you want connections and other resources to be freed. Other details are either an implementation detail or an advanced API detail.
There is no "gRPC connection." A "gRPC connection" would just be a standard "HTTP/2 connection." Except that is even an implementation detail of the transport in many gRPC implementations. That allows having alternative "connection" types like "inprocess" or QUIC (via Cronet, where there is not a classic "connection" at all).
It is the channel's job to hold all the connections and reconnect as necessary. It delegates part of that responsibility to load balancers and the load balancing APIs do have a concept of connections (subchannels). By not exposing connections to the application, load balancers have a lot of freedom to operate.
I'll note that gRPC C-core based implementations share connections across channels.
What happens to the channel when a RPC throws an exception?
The channel and connection is not impacted by a failed RPC. Note that connection-level failures typically cause RPCs to fail. But things like retries could allow the RPC to be re-sent on a new connection.
What happens to the gRPC connection when the channel is closed?
The connections are closed, eventually. Channel shutdown isn't instantaneous because existing RPCs can continue, and connection shutdown isn't instantaneous as well. But once all RPCs complete the connections are closed. Although C-core won't shut down a connection until no channels are using it.
When is the channel closed?
Only when the user closes it.
When is the gRPC connection closed?
Lots of times. The client may close it when no longer needed. For example, let's say the server IP address changes and the client need to connect to 1.1.1.2 instead of 1.1.1.1. A new connection will be created and new RPCs will go to the new IP address. The client may also close connections it thinks are dead (e.g., via keepalive timeouts).
Servers have a lot of say of when to close connections. They may close them simply because they are old, or because they have been idle, or because the server is overloaded. But those are simply use-cases; the server can shut down a connection at-will.
What if the deadline is exceeded?
Deadline only applies to RPCs and doesn't impact the channel or a connection.

I was actually waiting for Eric to answer this as he is the expert in this!
I also have been playing with gRPC for a while now, I would like to add few things here for beginners. Anyone more experienced, please feel free to edit!
Channel is an abstraction over a long-lived connection! The client application will create a channel on start up. The channel can be reused/shared among multiple threads. It is thred safe. One channel is enough (for most of the use cases) for multiple threads and multiplexing concurrent requests. It is channel's responsibility to close / reconnect / keep the connection alive etc. We as the users do not have to worry about this in general. The client application can close the channel anytime it wants. Channel creation seems to be an expensive process. So we would not open/close for every RPC.
When you use gRPC loadbalancer/nameresolver for a domain name and the nameresolver resolves the domain with multiple ip addresses, a channel creates multiple subchannels where each subchannel is an abstraction over a connection to 1 server. So a channel can also represent multiple connections!!
Adding some points to note from Eric's comment.
adding the default load balancer still only creates (approximately)
one connection if the name resolver returns multiple addresses, as the
default is pick_first. But if you change the load balancer to
round_robin or virtually any other policy, then yes, there will be
multiple connections in a channel. Even if a name resolver returns one
address, the load balancer is free to create multiple connections
(e.g., for higher throughput), but that's not common today
An underlying connection can be closed any time for any reason. For ex: remote server is shutting down gracefully for a scheduled maintenance or a connection is idle for longer duration. In that case, the server could send GOAWAY signal to the client and client might disconnect and reconnect to some other server. or Server might crash due to OOM error. In this case channel will detect connection failure and will retry for new connection for some other server etc.
A channel can keep sending PING frame to the server to keep the connection alive. These are all configurable via channel builder.
With these information above, if we look at your questions,
what happens to the channel when a RPC throws an exception?
Nothing happens to the channel. The unhandled exception on the server might the fail the RPC on the client side. But channel is still usable for any RPC calls.
What happens to the gRPC connection when the channel is closed?
Channel is an abstraction over the connection. So it will be closed. (again there is no gRPC connection as such as Eric had mentioned. It would be a HTTP2 connection)
When is the channel closed?
Any time you want. But normally when the application shuts down.
When is the gRPC connection closed?
It is not our problem. Channel takes care of this.
Heart beats?
Channel sends PING frames periodicaly to keep the connection alive.
What if the deadline is exceeded?
It is something like timeout on the client side. When the deadline exceeds, the client might cancel the request. Once again nothing happens to the channel. (But it might trigger exception on the server side which I had noticed few times. (Received DATA frame for an unknown stream. https://github.com/grpc/grpc-java/issues/3548). It seems to have been fixed now).

Using connection keepalive in grpc for blockingStubs

Want to enable connection keep alive option for the gRPC API calls. Current code makes use of blocking stubs (synchronous calls using java client). I would like to know if the connection keep alive options (described in the link below) are expected to work with he blocking stubs?
https://cs.mcgill.ca/~mxia3/2019/02/23/Using-gRPC-in-Production/
Desired behavior - Blocking API calls should fail in reasonable time if there is any issues with server (say server crashes or killed for some reason)

In grpc-java, the stubs are a thin layer on top of a more advanced API (ClientCall/ServerCall). The stub type does not impact the Channel-level features. The keepalive channel option will work independent of the stub type.
Keepalive will kill pending RPCs on a connection to a remote server that crashes/hangs/etc. It does not kill RPCs when the server just takes a long time to respond to the RPC.

High response time vs queuing

Say I have a webserivce used internally by other webservices with an average response time of 1 minute.
What are the pros and cons of such a service with "synchronous" responses versus making the service return id of the request, process it in the background and make the clients poll for results?
Is there any cons with HTTP connections which stay active for more than one minute? Does the default keep alive of TCP matters here?

Depending on your application it may matter. Couple of things worth mentioning are !
HTTP protocol is sync
There is very wide misconception that HTTP is async. Http is synchronous protocol but your client could deal it async. E.g. when you call any service using http, your http client may schedule is on the background thread (async). However The http call will be waiting until either it's timeout or response is back , during all this time the http call chain is awaiting synchronously.
Sockets
Since HTTP uses socket and there is hard limit on sockets. Every HTTP connection (if created new every time) opens up new socket . if you have hundreds of requests at a time you can image how many http calls are scheduled synchronously and you may run of sockets. Not sure for other operation system but on windows even if you are done with request sockets they are not disposed straight away and stay for couple of mins.
Network Connectivity
Keeping http connection alive for long is not recommended. What if you loose network partially or completely ? your http request would timeout and you won't know the status at all.
Keeping all these things in mind it's better to schedule long running tasks on background process.

If you keep the user waiting while your long job is running on server, you are tying up a valuable HTTP connection while waiting.
Best practice from RestFul point of view is to reply an HTTP 202 (Accepted) and return a response with the link to poll.
If you want to hang the client while waiting, you should set a request timeout at the client end.
If you've some Firewalls in between, that might drop connections if they are inactive for some time.

Higher Response Throughput
Typically, you would want your OLTP (Web Server) to respond quickly as possible, Since your queuing the task on the background, your web server can handle more requests which results to higher response throughput and processing capabilities.
More Memory Friendly
Queuing long running task on background jobs via messaging queues, prevents abusive usage of web server memory. This is good because it will increase the Out of memory threshold of your application.
More Resilient to Server Crash
If you queue task on the background and something goes wrong, the job can be queued to a dead-letter queue which helps you to ultimately fix problems and re-process the request that caused your unhandled exceptions.

Client Reconnection

My understanding of the (JavaScript) hub client is that if a connection is lost, it enters a 'Reconnecting...' phase which attempts to reconnect. If it can't do so, it will enter a 'Disconnected' state which is where it'll stay until asked to start again.
How long is the 'Reconnecting...' phase meant to last before it gives up? I've read 40 seconds before, but my client seems to take much less time - about 10, maybe less. [EDIT: Nevermind this part, I had configured a 10 disconnect on the server as a test... and forgot. I understand this is set by the server during the negotiate. Makes sense!] ... I'd prefer to have the client continually retry until it is told to abort - can this be done, and would it cause issues?
Another question; during the Reconnecting... phase, if I attempt to call a hub method (again, in JS) it never seems to complete. I'm using the returned Deferred to check for 'done' and 'fail' events, but neither seems to get called. Is this by design?
Thanks.

You can definitely have it continually reconnect.
Handle the disconnected event on the client and call connection.start:
$.connection.hub.disconnected(function() {
setTimeout(function() {
$.connection.hub.start();
}, 5000); // Re-start connection after 5 seconds
});
The only issues this would cause is that you could potentially be triggering infinite requests to a server that isn't there for client machines. This becomes even more troublesome when you introduce the mobile market into the situation (drains battery like crazy).
When you attempt to call a hub method while reconnecting SignalR will try to send your command. Since there are 2 channels, one for receiving data and one for sending, (for all transports except web sockets) in some cases it can still be possible to send requests while your offline. Therefore SignalR does not know if a request fails until the browser tells it that it could not successfully make the request.
Hope this helps!

I might have a clue... Touching the Web.config produces an appPool Recycle, meaning that a new worker process will be created for new requests while the existing process will continue for a while until the remaining requests end or the timeout is reached. Request that do not end in the timeout period are terminated.
Signalr client reconnects to the new process while the long running task is running in the old process, so when on the long running task you do
GlobalHost.ConnectionManager.GetHubContext<ForceHub>();
you actually get a reference for "old" hub while the client is connected to the "new" hub.
That's why the test preformed by Wasp worked: he was making a new request to publish on the signalr hub that was processed in the newly created worker process.
You could try to configure a singalr backplane (https://www.asp.net/signalr/overview/performance/scaleout-in-signalr), it’s really easy to configure it using Sql Server (https://www.asp.net/signalr/overview/performance/scaleout-with-sql-server). The backplane should be capable of connect the two worker processes and hopefully you will get the notification on the client.
If this is the problem, notifications generated by new requests will work even without the backplane. Notice that the real purpose of the backplane is to scale out signalr, this is, to connect a farm of WebServers between them.
Also keep in mind that running long-running task inside IIS is as task hard to achieve as, among other things, IIS does regular appPool recycles and has timeout limits for the requests to execute. I recommend that you read the following post: http://www.hanselman.com/blog/HowToRunBackgroundTasksInASPNET.aspx
“If you think you can just write a background task yourself, it's likely you'll get it wrong. I'm not impugning your skills, I'm just saying it's subtle. Plus, why should you have to?”
Hope this helps

Azure Web Role - Long Running Request (Load Balancer Timeout?)

Our front-end MVC3 web application is using AsyncController, because each of our instances is servicing many hundreds of long-running, IO bound processes.
Since Azure will terminate "inactive" http sessions after some pre-determined interval (which seems to vary depending up on what website you read), how can we keep the connections alive?
Our clients MUST stay connected, and our processes will run from 30 seconds to 5 minutes or more. How can we keep the client connected/alive? I initially thought of having a timeout on the Async method, and just hitting the Response object with a few bytes of output, sort of like chunking the response, and then going back and waiting some more. However, I don't think this will work, since MVC3 is handling the hookup of an IIS thread back to the asynchronous response, which will have already rendered a view at that time.
How can we run a really long process on an AsyncController, but have the client not be disconnected by the Azure Load Balancer? Sending an immediate response to the caller, and asking that caller to poll or check another resource URL is not acceptable.

Azure load balancer idle time-out is 4 minutes. Can you try to configure TCP keep-alive on the client side for less than 4 minutes, that should keep the connection alive?
On the other hand, it's pretty expensive to keep a connection open per client for a long time. This will limit the number of clients you can handle per server. Also, I think IIS may still decide to close a connection regardless of keep-alives if it thinks it need the connection to serve other requests.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex