Why does AWS CodeDeploy AllowTraffic take 3 minutes per instance? - aws-code-deploy

I recently started using AWS CodeDeploy and noticed that the AllowTraffic step consistently takes between 3 and 4 minutes per instance. I've configured the health check interval to be 10 seconds and the health threshold to be 2, so I expect it to take 20 seconds. I'm using a Network Load Balancer.
I have polled on the NLB's deployment group using describe-target-health and confirmed that the target is in the initial state for the 3+ minutes that CodeDeploy is waiting. I have also confirmed that the server on the health check port is responsive at the very beginning of the three minutes.
What are other possible reasons for CodeDeploy / NLB to be so slow?

The extra time you are witnessing is not due to health check but because of initial registration of a target to the NLB takes time.
When you register a new target to your Network Load Balancer, it is expected to take between 30 and 90 seconds (can go upto 120 seconds) to complete the registration process. After registration is complete, the Network Load Balancer health check systems will begin to send health checks to the target. A newly registered target must pass health checks for the configured interval to enter service and receive traffic.
For example, if you configure your health check for a 10 second interval, and require 2 health checks to become healthy, so the the minimum time an instance will start receiving traffic is [30sec -to- 120sec] (registration) + [20sec] (health check)
An ALB is not affected by this initial registration delay so it registers instances much faster. It is just how the NLB operates at this point in time.

Related

Tornado based python game server latency problem on amazon cloud

I have tornado based game server hosted on amazon cloud 1gb ram and 10gn harddisk.I have 500 users per day and cocurrent users are 30+ at agiven time . users are based around the world and I am hosting cloud machine in USA west as most of users are from USA .
I am facing Network latency issue . When i have single user reponse time is 1 second which is also high but as users move to 10+ this reponse time starts dropping to 2 seconds .for for 50+ users its 8 Seconds .
I did test and wrote test script .
test1 .
tested with same test script as mentioned above , itested on my local machine master code running on local and testscript also on local latency less than 1000ms (90% median 220ms)
run the same code on cloud , test script on same cloud same result
running game server on cloud and script on local latency 8 seconds
1. Network I/O
If you're doing network-bound operations in your request handlers (such as connecting to a database, or sending requests to another API), then use the await statements for those tasks so that Python can pause the coroutine and process other requests asynchronously.
2. CPU tasks and Disk I/O
If you're doing cpu-bound operations (any code other than network I/O that takes over a few milliseconds to run), then use IOLoop.run_in_executor to run those tasks in a separate thread. This will free up the main thread so that the CPU can run other tasks.
Tornado is a single-threaded server. That means the CPU can run only one thing at a given time. So, if a single response takes 220ms of CPU time, and if you have 10 connected clients, it would take over 2 seconds to serve the 10th client. And this time just increases *(though not always proportionally, as Python may reuse CPU cache) as the number of clients goes up.
It appears the CPU on the cloud server is not as fast as your personal CPU, hence the increased latency.

Latency spikes on outgoing requests from a cloud run service

I have a node.js service running on CloudRun (docker image based on node:14-alpine - eu west1 region for cloud run and firebase rtdb) with an average load of ~1 req/sec. Some of the requests experience high latency (a few seconds) for outgoing http requests to Homegraph API or firebase realtime database.
The average latency is:
<.21s - 50%
< 1s - 95%
< 2s - 99%
Things I've tried
increasing the memory and the CPUs assigned to the service
setting 1 instance always available (Minimum number of instances)
using keepAlive in nodejs - this does improve things, but not that much
creating a pool of firebase applications so each request has it's own app/db connection
What can/should I try next? Is there an issue with Cloud Run?
For example 1 sec call to homegraph:
Or a 4 sec call to firebase db (getUserDeviceTraits):
This is how a cold start looks (loading secrets, etc.):
Later edit (18 oct. 2021):
It simply solved itself out after a while and now it started happening again about a week ago. So after months where 99th was 1sec top and the average ~ 200msec, now the 99th has spikes of up to 12sec. Already tried updating all the dependencies (just in case). It's definitely not code related since I did no change in that period.

Issues with NCache cluster cache running on a single cluster, after the other instance goes offline

We have a replicated cluster cache setup with two instances, everything runs well when both instances are on-line, and we are using Community Edition 4.8.
When we take an instance offline, cache management becomes very slow and even stopping and starting the cache from NCache Manager GUI takes a very long time and then shows a message stating that there is an instance that is un-reachable.
Also when trying to fetch data from cache or add data to it, it gives an exception of operation timeout, and there is no response from the single instance that is still running.
From what I understand, this scenario should be handled by the cache service it-self since it is replicated, and it should handle failure for an instance going offline.
Thank you,
I would like to explain the cause of slowness on your application when one of the server node gets removed from the cache cluster.
What happens is whenever a node gets removed from the cache cluster, the surviving node/nodes go into recovery process and try to re-establish connection with that downed server node. By default this Connection retry value is set to “2” which means that the surviving nodes would try to reconnect with the downed node two times and after the reconnection has failed, the cache cluster would consider the downed server and offline and the cluster would start handling requests like before. This reconnection process can take up to 90 seconds as this is the default TCP/IP timeout interval and if the connection retry is set to “2” the recovery process could take up to around 200 seconds. Your application(Or NCache Manager calls) could experience slowness or request timeouts during this 2-3 minutes window when the cluster is in the recovery mode but once the recovery process is finished, the application should start working without any issues. If the slowness or request timeouts last more than a few minutes
The Connection retry value can be changed from the NCache “Config.ncconf” file. Increasing the number of connection retries would mean that the cluster would spend more time in the recovery process. The purpose of this feature is that if there is a network glitch in the environment and the server nodes lose connection with eachother, the servers would get reconnected automatically due to this recovery process. This is the reason why it is recommended to keep the Connection Retry interval set to at least 1.

JMS Connection Latency

I am examining an application where a JBOSS Application Server communicates with satellite JBOSS Application Servers (1 main server, hundreds of satellites).
When observing in the Windows Resource Monitor I can view the connections and see the latency by satellite - most are sub-second, but I see 10 over 1 second, of those 4 over 2 seconds and 1 over 4 seconds. This is a "moment in time" view, so as connections expire and rebuild when they need, the trend can shift. I do observe the same pair of systems have a ping latency matching seen on the connection list - so I suspect it's connection related (slow pipe, congestion, or anything in the line between points A and B).
My question is what should be a target latency, keeping in mind the satellites are VPN'd from various field sites. I use 2 seconds as a dividing line to when I need the network team to investigate, I'd like to survey what rule of thumb do you use in evaluating when the latency for a transient connection starts peaking - is it over a second? I do observe the same pair of systems have a ping latency matching seen on the connection list - so I know it's connection related.

Odd Asp.Net threadpool sizing behavior

I am load testing an .Net 4.0 MVC application hosted on IIS 7.5 (default config, in particular processModel autoconfig=true), and am observing odd behavior in how .Net manages the threads.
http://msdn.microsoft.com/en-us/library/0ka9477y(v=vs.100).aspx mentions that "When a minimum is reached, the thread pool can create additional threads or wait until some tasks complete".
It seems the duration that threads are blocked for, plays a role in whether it creates new threads or waits for tasks to complete. Not necessarily resulting in optimal throughput.
Question: Is there any way to control that behavior, so threads are generated as needed and request queuing is minimized?
Observation:
I ran two tests, on a test controller action, that does not do much beside Thread.Sleep for an arbirtrary time.
50 requests/second with the page sleeping 1 second
5 requests/second with the page sleeping for 10 seconds
For both cases .Net would ideally use 50 threads to keep up with incoming traffic. What I observe is that in the first case it does not do that, instead it chugs along executing some 20 odd requests concurrently, letting the incoming requests queue up. In the second case threads seem to be added as needed.
Both tests generated traffic for 100 seconds. Here are corresponding perfmon screenshots.
In both cases the Requests Queued counter is highlighted (note the 0.01 scaling)
50/sec Test
For most of the test 22 requests are handled concurrently (turquoise line). As each takes about a second, that means almost 30 requests/sec queue up, until the test stops generating load after 100 seconds, and the queue gets slowly worked off. Briefly the number of concurrency jumps to just above 40 but never to 50, the minimum needed to keep up with the traffic at hand.
It is almost as if the threadpool management algorithm determines that it doesn't make sense to create new threads, because it has a history of ~22 tasks completing (i.e. threads becoming available) per second. Completely ignoring the fact that it has a queue of some 2800 requests waiting to be handled.
5/sec Test
Conversely in the 5/sec test threads are added at a steady rate (red line). The server falls behind initially, and requests do queue up, but no more than 52, and eventually enough threads are added for the queue to be worked off with more than 70 requests executing concurrently, even while load is still being generated.
Of course the workload is higher in the 50/sec test, as 10x the number of http requests is being handled, but the server has no problem at all handling that traffic, once the threadpool is primed with enough threads (e.g. by running the 5/sec test).
It just seems to not be able to deal with a sudden burst of traffic, because it decides not to add any more threads to deal with the load (it would rather throw 503 errors than add more threads in this scenario, it seems). I find this hard to believe, as a 50 requests/second traffic burst is surely something IIS is supposed to be able to handle on a 16 core machine. Is there some setting that would nudge the threadpool towards erring slightly more on the side of creating new threads, rather than waiting for tasks to complete?
Looks like it's a known issue:
"Microsoft recommends that you tune the minimum number of threads only when there is load on the Web server for only short periods (0 to 10 minutes). In these cases, the ThreadPool does not have enough time to reach the optimal level of threads to handle the load."
Exactly describes the situation at hand.
Solution: Slightly increased the minWorkerThreads in machine.config to handle expected traffic burst. (4 would give us 64 threads on the 16 core machine).

Resources