SignalR duplicating responses - nginx

I'm using SignalR with Redis as a message bus on a server that sits behind an Nginx proxy for load balancing. I used SignalR's PersistentConnection class to write a simple chat program that broadcasts messages to users belonging to the same certain group. Users are added to a group in OnConnectedAsync, removed in OnDisconnectAsync, and the user-to-group mapping is deterministic.
Currently, the client side falls back to long polling for whatever reason (I'm not entirely sure why), and whenever the client sets up a new connection after waiting for and receiving a response, seemingly at random, the server will sometimes respond to the new connection immediately with the previous response, despite there having only been one POST.
The message ID's tend to differ by exactly one, (the smaller ID coming first), with the rest of the response remaining the same. I logged some debug info and am quite positive that my override of OnReceivedAsync is sending one response per one request. I tried the same implementation without the Redis message bus, and got the same problem. Running locally (with long polling) however yielded good results so I suspect that the problem might be with the way the message bus might be buffering messages to refresh clients who might not be caught up, and some weird timing with the cutting/setting up of connections with the Nginx load balancer, but beyond that, I am very much at a loss.
Any help would be appreciated.
EDIT: Further investigation reveals that duplication occurs at somewhat regular intervals of approximately 20-30 seconds. I'm led to believe that the message expiration in the message bus might have something to do with the bug.
EDIT: Bug can be seen here: http://tinyurl.com/9q5t3va
The server is simply broadcasting a counter being sent by the client. You will notice some responses are duplicated every 20 or so.

Reducing the number of worker processes in the IIS (6.0) Server Manager from 2 to 1 solved the problem.

Related

Load balancing TCP traffic using Apache Camel with Netty leads to transaction failures

I am new to Apache Camel and Netty and this is my first project. I am trying to use Camel with the Netty component to load balance heavy traffic in a back end load test scenario.This is the setup I have right now:
from("netty:tcp:\\this-ip:9445?defaultCodec=false&sync=true").loadBalance().roundRobin().to("netty:tcp:\\backend1:9445?defaultCodec=false&sync=true,netty:tcp:\\backend2:9445?defaultCodec=false&sync=true)
The issue is unexpected buffer sizes that I am receiving in the response that I see in the client system sending tcp traffic to Camel. When I send multiple requests one after the other I see no issues and the buffer size is as expected. But, when I try running multiple users sending similar requests to Camel on the same port, I intermittently see unexpected buffer sizes, sometimes 0 bytes to sometimes even greater than the expected number of bytes. I tried playing around with multiple options mentioned in the Camel-Netty page like:
Increasing backlog
keepAlive
buffersizes
timeouts
poolSizes
workerCount
synchronous
stream caching (did not work)
disabled useOriginalMessage for performance
System level TCP parameters, etc. among others.
I am yet to resolve the issue. I am not sure if I'm fundamentally missing something. I did take a look at the encoder/decoders and guess if that could be an issue. But, I don't understand why a load balancer needs to encode/decode messages. I have worked with other load balancers which just require endpoint configurations and hence, I am assuming that Camel does not require this. Am I right? Please know that the issue is not with my client/backend as I ran a 2000 user load test from my client to the backend with less than 1% failures but see a large number of failure ( not that there are no successes) with Camel. I have the following questions:
1.Is this a valid use-case for Apache Camel- Netty? Should I be looking at Mina or others?
2.Can I try to route tcp traffic to JMS or other components and then finally to the tcp endpoint?
3.Do I need encoders/decoders or should this configuration work?
4.Should I continue with this approach or try some other load balancer?
Please let me know if you have any other suggestions. TIA.
Edit1:
I also tried the same approach with netty4 and mina components. The route looks similar to the one in netty. The route with netty4 is as follows:
from("netty4:tcp:\\this-ip:9445?defaultCodec=false&sync=true").to("netty4:tcp:\\backend1:9445?defaultCodec=false&sync=true")
I read a few posts which had the same issue but did not find any solution relevant to my issue.
Edit2:
I increased the receive timeout at my client and immediately noticed the mismatch in expected buffer length issue fall to less than 1%. However, I see that the response times for each transaction when using Camel and not using it is huge; almost 10 times higher. Can you help me with reducing the response times for each transaction? The message received back at my client varies from 5000 to 20000 bytes. Here is my latest route:
from("netty:tcp://this-ip:9445?sync=true&allowDefaultCodec=false&workerCount=20&requestTimeout=30000")
.threads(20)
.loadBalance()
.roundRobin()
.to("netty:tcp://backend-1:9445?sync=true&allowDefaultCodec=false","netty:tcp://backend-2:9445?sync=true&allowDefaultCodec=false")
I also used certain performance enhancements like:
context.setAllowUseOriginalMessage(false);
context.disableJMX();
context.setMessageHistory(false);
context.setLazyLoadTypeConverters(true);
Can you point me in the right direction about how I can reduce the individual transaction times?
For netty4 component there is no parameter called defaultCodec. It is called allowDefaultCodec. http://camel.apache.org/netty4.html
Also, try something like this first.
from("netty4:tcp:\\this-ip:9445?textline=true&sync=true").to("netty4:tcp:\\backend1:9445?textline=true&sync=true")
The above means the data being sent is normal text. If you are sending byte or something else you will need to provide decoding/encoding for netty to handle the data.
And a side note. Before running the Camel route, test manually to send test messages via a standard tcp tool like sockettest to verify that everything works. Then implement the same via Camel. You can find sockettest here http://sockettest.sourceforge.net/ .
I finally solved the issue with the same route settings as above. The issue was with the Request and Response Delimiter not configured properly due to which it was either closing the connection too early leading to unexpected buffer sizes or it was waiting too long even after the entire buffer was received leading to high response times.

ASP.Net MVC Delayed requests arriving long after client browser closed

I think I know what is happening here, but would appreciate a confirmation and/or reading material that can turn that "think" into just "know", actual questions at the end of post in Tl,DR section:
Scenario:
I am in the middle of testing my MVC application for a case where one of the internal components is stalling (timeouts on connections to our database).
On one of my web pages there is a Jquery datatable which queries for an update via ajax every half a second - my current task is to display correct error if that data requests times out. So to test, I made a stored procedure that asks DB server to wait 3 seconds before responding, which is longer than the configured timeout settings - so this guarantees a time out exception for me to trap.
I am testing in Chrome browser, one client. Application is being debugged in VS2013 IIS Express
Problem:
Did not expect the following symptoms to show up when my purposeful slow down is activated:
1) After launching the page with the rigged datatable, application slowed down in handling of all requests from the client browser - there are 3 other components that send ajax update requests parallel to the one I purposefully broke, and this same slow down also applied to any actions I made in the web application that would generate a request (like navigating to other pages). The browser's debugger showed the requests were being sent on time, but the corresponding break points on the server side were getting hit much later (delays of over 10 seconds to even a several minutes)
2) My server kept processing requests even after I close the tab with the application. I closed the browser, I made sure that the chrome.exe process is terminated, but breakpoints on various Controller actions were still getting hit for 20 minutes afterward - mostly on the actions that were "triggered" by automatically looping ajax requests from several pages I was trying to visit during my tests. Also breakpoints were hit on main pages I was trying to navigate to. On second test I used RawCap monitor the loopback interface to make sure that there was nothing actually making requests still running in the background.
Theory I would like confirmed or denied with an alternate explanation:
So the above scenario was making looped requests at a frequency that the server couldn't handle - the client datatable loop was sending them every .5 seconds, and each one would take at least 3 seconds to generate the timeout. And obviously somewhere in IIS express there has to be a limit of how many concurrent requests it is able to handle...
What was a surprise for me was that I sort of assumed that if that limit (which I also assumed to exist) was reached, then requests would be denied - instead it appears they were queued for an absolutely useless amount of time to be processed later - I mean, under what scenario would it be useful to process a queued web request half an hour later?
So my questions so far are these:
Tl,DR questions:
Does IIS Express (that comes with Visual Studio 2013) have a concurrent connection limit?
If yes :
{
Is this limit configurable somewhere, and if yes, where?
How does IIS express handle situations where that limit is reached - is that handling also configurable somewhere? ( i mean like queueing vs. immediate error like server is busy)
}
If no:
{
How does the server handle scenarios when requests are coming faster than they can be processed and can that handling be configured anywhere?
}
Here - http://www.iis.net/learn/install/installing-iis-7/iis-features-and-vista-editions
I found that IIS7 at least allowed unlimited number of silmulatneous connections, but how does that actually work if the server is just not fast enough to process all requests? Can a limit be configured anywhere, as well as handling of that limit being reached?
Would appreciate any links to online reading material on the above.
First, here's a brief web server 101. Production-class web servers are multithreaded, and roughly one thread = one request. You'll typically see some sort of setting for your web server called its "max requests", and this, again, roughly corresponds to how many threads it can spawn. Each thread has overhead in terms of CPU and RAM, so there's a very real upward limit to how many a web server can spawn given the resources the machine it's running on has.
When a web server reaches this limit, it does not start denying requests, but rather queues requests to handled once threads free up. For example, if a web server has a max requests of 1000 (typical) and it suddenly gets bombarded with 1500 requests. The first 1000 will be handled immediately and the further 500 will be queued until some of the initial requests have been responded to, freeing up threads and allowing some of the queued requests to be processed.
A related topic area here is async, which in the context of a web application, allows threads to be returned to the "pool" when they're in a wait-state. For example, if you were talking to an API, there's a period of waiting, usually due to network latency, between sending the request and getting a response from the API. If you handled this asynchronously, then during that period, the thread could be returned to the pool to handle other requests (like those 500 queued up requests from the previous example). When the API finally responded, a thread would be returned to finish processing the request. Async allows the server to handle resources more efficiently by using threads that otherwise would be idle to handle new requests.
Then, there's the concept of client-server. In protocols like HTTP, the client makes a request and the server responds to that request. However, there's no persistent connection between the two. (This is somewhat untrue as of HTTP 1.1. Connections between the client and server are sometimes persisted, but this is only to allow faster future requests/responses, as the time it takes to initiate the connection is not a factor. However, there's no real persistent communication about the status of the client/server still in this scenario). The main point here is that if a client, like a web browser, sends a request to the server, and then the client is closed (such as closing the tab in the browser), that fact is not communicated to the server. All the server knows is that it received a request and must respond, and respond it will, even though there's technically nothing on the other end to receive it, any more. In other words, just because the browser tab has been closed, doesn't mean that the server will just stop processing the request and move on.
Then there's timeouts. Both clients and servers will have some timeout value they'll abide by. The distributed nature of the Internet (enabled by protocols like TCP/IP and HTTP), means that nodes in the network are assumed to be transient. There's no persistent connection (aside from the same note above) and network interruptions could occur between the client making a request and the server responding to the request. If the client/server did not plan for this, they could simply sit there forever waiting. However, these timeouts are can vary widely. A server will usually timeout in responding to a request within 30 seconds (though it could potentially be set indefinitely). Clients like web browsers tend to be a bit more forgiving, having timeouts of 2 minutes or longer in some cases. When the server hits its timeout, the request will be aborted. Depending on why the timeout occurred the client may receive various error responses. When the client times out, however, there's usually no notification to the server. That means that if the server's timeout is higher than the client's, the server will continue trying to respond, even though the client has already moved on. Closing a browser tab could be considered an immediate client timeout, but again, the server is none the wiser and keeps trying to do its job.
So, what all this boils down is this. First, when doing long-polling (which is what you're doing by submitting an AJAX request repeatedly per some interval of time), you need to build in a cancellation scheme. For example, if the last 5 requests have timed out, you should stop polling at least for some period of time. Even better would be to have the response of one AJAX request initiate the next. So, instead of using something like setInterval, you could use setTimeout and have the AJAX callback initiate it. That way, the requests only continue if the chain is unbroken. If one AJAX request fails, the polling stops immediately. However, in that scenario, you may need some fallback to re-initiate the request chain after some period of time. This prevents bombarding your already failing server endlessly with new requests. Also, there should always be some upward limit of the time polling should continue. If the user leaves the tab open for days, not using it, should you really keep polling the server for all that time?
On the server-side, you can use async with cancellation tokens. This does two things: 1) it gives your server a little more breathing room to handle more requests and 2) it provides a way to unwind the request if some portion of it should time out. More information about that can be found at: http://www.asp.net/mvc/overview/performance/using-asynchronous-methods-in-aspnet-mvc-4#CancelToken

Timeout vs no response from server, how can I separate these?

This question is regarding a bot of mine which's primary focus is scraping.
The path is mapped out correctly and it does what it needs to do.
Rate limits are tested and I am certain this is not a factor, if it was and where it was we received actual responses.
However, the webpage(s) I am trying to scrape seem to have build in a kind of weird/ unfamiliar security manner, something that I haven't came across before. And here I am wondering, how it's executed and how I deal with it appropriately.
While the scraper/bot is doing it's thing, sending requests getting responses, at random times it will encounter this what I suspect is a security measure. There are simply no responses back from the server, not a 4xx error or any at all.
At first sight the proxies just appear dead, but that's not it, because they are not. The proxies work just fine, and manually I can just browse the page on them, no issues here.
The server just stops giving responses.
Now to find a workaround for this, I would need to be able to tell the difference between a timeout (for my proxies) and a no response. They appear the same, but are not.
Does anyone have insight on this problem, maybe there is a genius way to separate those that I am not aware of.
Now to find a workaround for this, I would need to be able to tell the difference between a timeout (for my proxies) and a no response. They appear the same, but are not.
A timeout is if the server does not respond within a specific time. No response means, that the server either closes the connection either before the timeout occurs or that it will close the connection after the timeout occurred without sending anything back.
The first case can be easily detected by the connection close before timeout. If you want to detect instead if the server will close the connection without response only after your current timeout then your only option is to extend the timeout. There is nothing in the server which will indicate that the server will close the connection without response at some future time.
And since your only connection is with the proxy there is no real way to detect if the problem is at the proxy or the server. Your only hope might be to set your timeout waiting for the proxy larger then the timeout the proxy has waiting for the server. This way you'll maybe get a response from the proxy indicating that the connection to the server timed out.
They appear the same, but are not.
They are the same. There is no difference. A read timeout means that data didn't arrive within the timeout period. For whatever reason. TCP doesn't know, and can't tell you. At the C level, recv() returned -1 with errno == EAGAIN/EWOULDBLOCK. That's all the information there is.
What you are asking is tantamount to 'data didn't arrive: where didn't it arrive from?' It's not a meaningful question.

Signalr LongPollDelay and the buffer

We have Safari mobile clients that are affected by one of their 5 connections being blocked by signalr. We have used the solution propped here: https://github.com/SignalR/SignalR/issues/1406#issuecomment-14284093
Where we have these settings changed to the following for signalR 2.x
GlobalHost.Configuration.ConnectionTimeout =
TimeSpan.FromMilliseconds(1000);
GlobalHost.Configuration.LongPollDelay = TimeSpan.FromMilliseconds(5000);
We are sending notifications from the server to the client with no message queue or acknowledgement framework. We don’t need to guarantee message delivery but we do want there to be a high probability of success. We think this should be possible due to our low message rate and a buffer size of 1000. However we have some questions:
Are messages held in a queue while the LongPollDelay occurs? Should
they be sent during the next long poll using the settings above?
Our tests with a single message being sent during a 2 minute
LongPollDelay suggest that they are not retrieved during the 1
second long poll request that follows. Are there any reasons for
this i.e. buffer flushing after 1 minute?
Does ConnectionTimeout affect all transports?
If ConnectionTimeout applies to all transports is there a way of
setting this for only Safari mobile users i.e. have two connections
available and use agent detection to point to a specific connection?
Is there a way of setting the LongPollDelay so that this also only
applied to only Safari mobile users?
All advice welcome and appreciated, Matt
[FOLLOW-UP QUESTIONS]
Thanks that helps a lot. We have retried with 30secs LongPollDelay and it works as expected. I have a couple of follow-up questions that you/someone might care to comment on:
1) During testing we also see the client sending a ping request to the server roughly every 5 minutes. Why is the ping period set to 5 minutes when the disconnect period is so much shorter, and what is the purpose of the client pinging the server if it assumes it is disconnected via an alternative mechanism.
2) w.r.t. Different configurations for different clients. Could we not set up another SignalR endpoint and point only Safari mobile to this? Something like the response to this post:
Can I reduce the Circular Buffer to "1"? Is that a good idea?
You are correct that the SignalR will queue/buffer messages. Even if there wasn't a LongPollDelay configured, SignalR needs to do this because there is always a chance that messages are sent while clients are repolling/reconnecting.
SignalR assumes that the client has disconnected if the client hasn't been connected to the server within the last DisconnectTimeout. Once the DisconnectTimeout triggers, SignalR will call OnDisconnected and clear any message buffers belonging to the supposedly disconnected client so it doesn't leak memory. The DisconnectTimeout defaults to 30 seconds which is far less than the 2 minute LongPollDelay you configured, so that explains this behavior.
The ConnectionTimeout only affects long polling unless you've disabled keep alives. If keep alives are disabled, it applies to all transports.
There is no way to selectively configure the ConnectionTimeout for specific types of clients. But as I stated, it only affects long polling by default.
There is no way to selective configure the LongPollDelay for specific types of clients.

Delay before sending message over socket - how does that help?

I have a tcpip socket interface to a third party software app. I've implemented this interface for several customer sites with no problem. The latest customer, though... problems. We've turned on logging in the apps on either end, and also installed Wireshark on the PC to log raw tcpip traffic. With that, we've proved that my server app successfully sends the message out, the pc receives the message, but the client app doesn't see it. (This is a totally intermittent problem, which is why it's such a pain to troubleshoot.)
The socket details are as simple as they come: one socket handling two way communications between the server and the pc. The messages are plain ascii text and fairly short (not XML). The server initiates communications by sending the first message, and then the client responds with several messages. The socket is kept open at all times while the apps are running. The client app is designed so that the end user can only process one case at a time, which prevents message collisions from happening. They have some sort of polling set up, their app "hibernates" until it sees the initiating message from the server.
The third party vendor has advised me to add a few second delay before I send them the initiating message. I can't see how that helps. If the client is "sleeping", just polling the socket waiting for a message, how does adding a delay before the first message help? It's not like we send two messages and the second one gets lost. It's losing the first message. So I don't see how it matters if we send that message now or two seconds from now.
I've asked them and they haven't given me details. It could be some proprietary details in their coding that they don't want to disclose to me, and that's fair. So I'm asking here because I'm always learning new things about socket programming. Maybe you guys can shed some light on how polling a tcpip socket can be affected by message timing?
Since its someone else's client and they won't tell you what its doing (other than saying 'insert a delay'), the answer is probably that their client is reading and discarding the message because its not yet in a state to deal with it. The delay will allow the client time to get into a state where it can respond to the message properly.
In other words, the client has a race condition. One easy way this can happen is if they have one thread for reading messages and another for dealing with them.
Short of running strace(1) on the client to see what system calls it is making, its tough to tell what the client is actually doing.

Resources