ASP.Net API App - continual HTTP 502.3 errors - asp.net

My team and I have been at this for 4 full days now, analyzing every log available to us, Azure Application Insights, you name it, we've analyzed it. And we can not get down to the cause of this issue.
We have a customer who is integrated with our API to make search calls and they are complaining of intermittent but continual 502.3 Bad Gateway errors.
Here is the flow of our architecture:
All resources are in Azure. The endpoint our customers call is a .NET Framework 4.7 Web App Service in Azure that acts as the stateless handler for all the API calls and responses.
This API app sends the calls to an Azure Service Fabric Cluster - that cluster load balances on the way in and distributes the API calls to our Search Service Application. The Search Service Application then generates and ElasticSearch query from the API call, and sends that query to our ElasticSearch cluster.
ElasticSearch then sends the results back to Service Fabric, and the process reverses from there until the results are sent back to the customer from the API endpoint.
What may separate our process from a typical API is that our response payload can be relatively large, based on the search. On average these last several days, the payload of a single response can be anywhere from 6MB to 12MB. Our searches simply return a lot of data from ElasticSearch. In any case, a normal search is typically executed and returned in 15 seconds or less. As of right now, we have already increased our timeout window to 5 minutes just to try to handle what is happening and reduce timeout errors for the fact their searches are taking so long. However, we increased the timeout via the following code in Startup.cs:
services.AddSingleton<HttpClient>(s => {
return new HttpClient() { Timeout = TimeSpan.FromSeconds(300) };
});
I've read in some places that you actually have to do this in the web.config file as opposed to here, or at least in addition to it. Not sure if this is true?
So The customer who is getting the 502.3 errors have significantly increased the volumes they are sending us over the last week, but we believe we are fully scaled to be able to handle it. They are still trying to put the issue on us, but after many days of research, I'm starting to wonder if the problem is actually on their side. Could it be possible that they are not equipped to take the increased payload on their side. Can it be that their integration architecture is not scaled enough to take the return payload from the increased volumes? When we observe our resources usages (CPU/RAM/IO) on all of the above applications, they are all normal - all below 50%. This also makes me wonder if this is on their side.
I know it's a bit of a subjective question, but I'm hoping for some insight from someone who may have experienced this before, but even more importantly, from someone who has experience with a .Net API app in Azure which return large datasets in it's responses.
Any code blocks of our API app, or screenshots from Application Insights are available to post upon request - just not sure what exactly anyone would want to see yet as I type this.

Related

Application insights | Sometimes End-to-end transaction details do not show all telemetry

I have .Net core App deployed on azure and enabled application insights.
Sometimes Azure application insights End-to-end transaction details do not display all telemetry.
Here it only logs the error and not request or maybe request logged but both do not display together over here(difficult to find out due to many people use it)
Should be like:
Sometimes request log but with no error log.
What could be the reason for happening this? do I need to look into application insights specific set-up/feature?
Edit:
As suggested by people here, try to disable the Sampling feature but still not works, Here is open question as well.
This usually happens due to sampling. By default, adaptive sampling is enabled in the ApplicationInsights.config which basically means that only a certain percentage of each telemetry item type (Event, Request, Dependency, Exception, etc.) is sent to Application insights. In your example probably one part of the end to end transaction got sent to the server, another part got sampled out. If you want, you can turn off sampling for specific types, or completely remove the
AdaptiveSamplingTelemetryProcessor
from the config which completely disables sampling. Bear in mind that this leads to higher ingestion traffic and higher costs.
You can also configure sampling in the code itself, if you prefer.
Please find here a good overview of how sampling works and can be configured.
This may be related to :
When using SDK 2.x, you have to track all events and send the telemetries to Application insights
When using auto-instrumentation with 3.x agent, in this case the agent collect automatically the traffic, logs ... and you have to pay attention to the sampling file applicationinsights.json where you can filter the events.
If you are using java, below the accepted Logging libraries :
-java.util.logging
-Log4j, which includes MDC properties
-SLF4J/Logback, which includes MDC properties

Microservice Design with SignalR

We have built a microservice architecture but have run into an issue where the messages going on to the bus are too large. (Discovered since moving to Azure Service bus as this only allows 256KB compared to RabbitMQ 4MB)
We have a design as the below diagram. Where we're struggling is with the data being returned.
An example is when performing a search and returning multiple results.
To step through our current process:
Web client sends a http request to the Web Api.
Web api then puts appropriate message on to the bus. (Web api responds to client with an Accepted response)
Microservice picks up this message.
Microservice queries its database for the records matching search criteria.
Results returned from database.
A SearchResult message is added to the bus. (This contains the results)
Our response microservice is listening for this SearchResult message.
The response microservice then posts to our SignalR api.
SignalR Api sends the results back to the web client.
My question is how do we deal with large results sets when designed in this way? If it's not possible how should the design be changed to handle large results sets?
I understand we could page the results but even so one result could be over the 256KB allowance, for example a document or a particularly large object.
There are 2 ways :-
Use Kafka like system which support large size messages.
If you can't go with the 1st approach (that's appear from your question), then Microservices can place 2 types of messages for response service
(1.) If size is small then place the complete message and
(2.) If size is more than supported then place message that contain link to Azure Storage Blob which have result
Based on message, response service can get proper result and return the same to Client.

Understanding App Insights end to end for occassional long response times

Background: I have an ASP.NET Core App and have an API method that takes a file name of a blob that the frontend has uploaded to Azure Blob. It then needs to create a thumbnail version of the blob and return the name of the newly uploaded thumbnail Blob. Sometimes, for exactly the same file size it can take up to 40 seconds to complete. Mostly, it's around 400ms.
Below is the end to end from App Insights, I have a few things I don't understand:
1) The request duration is 37.5 s but yet the other operations add up to nowhere near this time
2) Why are there calls to master db? We are using EF6 with multiple contexts
3) The app is using an Azure App Service and SQL Azure. I don't understand why the response time is so inconsistent.
Any help would be much appreciated!
I've noticed multiple time that the first request after an application is deployed to Azure or after a long period that no requests were made to the application, it takes significantly longer to get a response.
As far as I remember it was related to start-up time of the site (if you're using an App Service on Windows based underlying VM it still uses IIS as a reverse proxy).
I solved the issue by configuring health checks that occasionally perform requests to the app.
Also, in addition to Application Insights (which logs information only after the application has started), you can try the tools listed here to see more information.
Hope it helps!
1.
The way the request timeline is displayed gives you only the time-span for the whole request (37.5s) and the individual time-spans for each dependency.
A dependency being another call that sends its run-time to the application insights.
In your example each call to the database is automatically tracked as a dependency. The code running after each database call is not though.
So e.g. requesting a database entry which takes 200ms and then issuing a Thread.Sleep of 2 seconds and requesting another database entry which takes 300ms would result in a 2 second gap between the two database-call dependencies which will each be listed with 200/300ms respectively.
You can use TelemetryClient.TrackDependency to wrap parts of your own code into its own dependency. This way you will see your own code as an entry on the request timeline.
2.
Depending on your EntityFramework database-initialisier EF will connect to the master db on context creation. (E.g. to create the database if it does not exist).
3.
Try tracking your own code to find out what parts of it are slow. EF has a few performance issues to consider, try to understand the performance caveats of the libs you use. If your calls are inconsistently slow it might be an issue with resources being over-utilized or caches being emptied too early (like for EF warm vs. cold queries).

asp.net infinite loop - can this be done?

This question is about limits imposed to me by ASP.NET (like script timeout etc').
I have a service running under ASP.NET and I want to create a counterpart service for monitoring.
The main service's data is located at a database.
I was thinking about having the monitor service query the database in intervals of 1 second, within a loop, issued by an http request done by the remote client.
Now the actual serving of this monitoring will be done by a client http request, which will make the script loop (written in C#) and when new data is detected it'll aggregate that data into that one looping request output buffer, send it, and exit the loop, thus finishing the request.
The client will have to issue a new request in order to keep getting updates.
This is actually exactly like TCP (precisely like Windows IOCP); You request the service for data and wait for it. When it arrives you fire another request.
My actual question is: Have you done it before? How did it go? Am I limited by some (configurable) limits imposed by the IIS/ASP.NET framework? What are my limits in such situation, or, what are better options without complicating things too much?
Note that I do not expect many such monitoring requests at a time, maybe a few dozens.
This means however that 10 such concurrent monitoring requests will keep 10 threads busy, and the question is; Can it hurt IIS/performance? How will IIS handle 10 busy threads? Will it issue more? What are the limits? This is just one example of a limit I can think of.
I think you main concern in this situation would be timeouts, which are pretty much configurable. But I think that it is a wrong solution - you'd be better of with some background service, running constantly/periodically, and writing the monitoring data to some data store and then your monitoring page would just return it upon request.
if you want your page to display something only if the monitorign data is available- implement it with ajax - on page load query monitoring service, then if some monitoring events are available- render them, if not- sleep and query again.
IMO this would be a much better solution than a reallu long running requests.
I think it won't be a very good idea to monitor a service using ASP.NET due to the following reasons...
What happens when your application pool crashes?
What if you decide to do IISReset? Which application will come up first... the main app, or the monitoring app?
What if the monitoring application hangs due to load?
What if the load is already high on the Main Service. Wouldn't monitoring it every 1 sec, increase the load on the Primary Service, as well as IIS?
You get the idea...

Using a remote, external web service instead of a database

I am building an ASP.NET web application that will be deployed to a 4-node web farm.
My web application's farm is located in California.
Instead of a database for back-end data, I plan to use a set of web services served from a data center in New York.
I have a page /show-web-service-result.aspx that works like this:
1) User requests page /show-web-service-result.aspx?s=foo
2) Page's codebehind queries a web service that is hosted by the third party in New York.
3) When web service returns, the returned data is formatted and displayed to user in page response.
Does this architecture have potential scalability problems? Suppose I am getting hundreds of unique hits per second, e.g.
/show-web-service-result.aspx?s=foo1
/show-web-service-result.aspx?s=foo2
/show-web-service-result.aspx?s=foo3
etc...
Is it typical for web servers in a farm to be using web services for data instead of database? Any personal experience?
What change should I make to the architecture to improve scalability?
You have most definitely a scalability problem: the third-party web service. Unless you have a service-level agreement with that service (agreeing on the number of requests that you can submit per second), chances are real that you overload that service with your anticipated load. That you have four nodes yourself doesn't help you then.
So you should a) come up with an agreement with the third party, and b) test what the actual load is that they can take.
In addition, you need to make sure that your framework can use parallel connections for accessing the remote service. Suppose you have a round-trip time of 20ms from California to New York (which would be fairly good), you can not make more than 50 requests over a single TCP connection. Likewise, starting new TCP connections for every request will also kill performance, so you want pooling on these parallel connections.
I don't see a problem with this approach, we use it quite a bit where I work. However, here are some things to consider:
Is your page rendering going to be blocked while waiting for the web service to respond?
What if the response never comes, i.e. the service is down?
For the first problem I would look into using AJAX to update the page after you get a response back from the web service. You'll also want to consider how to handle the no response or timeout condition.
Finally, you should really think about how you could cache the web service data locally. For example if you are calling a stock quoting service then unless you have a real-time feed, there is no reason to call the web service with every request you get. Store the data locally for a period of time and return that until it becomes stale.
You may have scalability problems but most of these can be carefully engineered around.
I recommend you use ASP.NET's asynchronous tasks so that the web service is queued up, the thread is released while the request waits for the web service to respond, and then another thread picks up when the web service is done to finish off the request.
MSDN Magazine - Wicked Code - Asynchronous Pages in ASP.NET 2.0
Local caching is an absolute must. The fewer times you have to go from California to New York, the better. You might want to look into Microsoft's Velocity (although that's still in CTP) or NCache, or another distributed cache, so that each of your 4 web servers don't all have to make and cache the same data from the web service - once one server gets it, it should be available to all.
Microsoft Project Code Named "Velocity"
NCache
Other things that can go wrong that you should engineer around:
The web service is down (obviously) and data falls out of cache, and you can't get it back. Try to make it so that the data is not actually dropped from cache until you're sure you have an update available. Then the only risk is if the service is down and your application pool is reset, so don't reset it as a first-line troubleshooting maneuver!
There are two different timeouts on web requests, a connect and an overall timeout. Make sure both are set extremely low and you handle both of them timing out. If the service's DNS goes down, this can look like quite a different failure.
Watch perfmon for ASP.NET Queued Requests. This number will rise rapidly if the service goes down and you're not covering it properly.
Research and adjust ASP.NET performance registry settings so you have a highly optimized ASP.NET thread pool. I don't remember the specifics, but I seem to remember that there's a limit on IO Completion Ports and something else of that nature that are absurdly low for the powerful hardware I'm assuming you have on hand.
the trendy answer is REST. Any GET request can be HTTP Response cached (with lots of options on how that is configured) and it will be cached by the internet itself (your ISP, essentially).
Your project has an architecture that reflects they direction that Microsoft and many others in the SOA world want to take us. That said, many people try to avoid this type of real-time risk introduced by the web service.
Your system will have a huge dependency on the web service working in an efficient manner. If it doesn't work, or is slow, people will just see that your page isn't working properly.
At the very least, I would get a web stress tool and performance test your web service to at least the traffic levels you expect to get at peaks, and likely beyond this. When does it break (if ever?), when does it start to slow down? These are good metrics to know.
Other options to look at: perhaps you can get daily batches of data from the web service to a local database and hit the database for your web site. Then, if for some reason the web service is down or slow, you could use the most recently obtained data (if this is feasible for your data).
Overall, it should be doable, but you want to understand and measure the risks, and explore any potential options to minimize those risks.
It's fine. There are some scalability issues. Primarily, with the number of calls you are allowed to make to the external web service per second. Some web services (Yahoo shopping for example) limit how often you can call their service and will lock out your account if you call too often. If you have a large farm and lots of traffic, you might have to throttle your requests.
Also, it's typical in these situations to use an interstitial page that forks off a worker thread to go and do the web service call and redirects to the results page when the call returns. (Think a travel site when you do search, you get an interstitial page while they call out to an external source for the flight data and then you get redirected to a results page when the call completes). This may be unnecessary if your web service call returns quickly.
I recommend you be certain to use WCF, and not the legacy ASMX web services technology as the client. Use "Add Service Reference" instead of "Add Web Reference".
One other issue you need to consider, depending on the type of application and/or data you're pulling down: security.
Specifically, I'm referring to authentication and authorization, both of your end users, and the web application itself. Where are these things handled? All in the web app? by the WS? Or maybe the front-end app is authenticating the users, and flowing the user's identity to the back end WS, allowing that to verify that the user is allowed? How do you verify this? Since many other responders here mention a local data cache on the front end app (an EXCELLENT idea, BTW), this gets even MORE complicated: do you cache data that is allowed to userA, but not for userB? if so, how do you verify that userB cannot access data from the cache? What if the authorization is checked by the WS, how do you cache the permissions then?
On the other hand, how are you verifying that only your web app is allowed to access the WS (and an attacker doesn't directly access your WS data over the Internet, for instance)? For that matter, how do you ensure that your web app contacts the CORRECT WS server, and not a bogus one? And of course I assume that all the connection to the WS is only over TLS/SSL... (but of course also programmatically verify the cert applies to the accessed server...)
In short, its complicated, and many elements to consider here.... but it is NOT insurmountable.
(as far as input validation goes, that's actually NOT an issue, since this should be done by BOTH the front end app AND the back end WS...)
Another aspect here, as mentioned by #Martin, is the need for an SLA on whatever provider/hosting service you have for the NY WS, not just for performance, but also to cover availability. I.e. what happens if the server is inaccessible how quickly they commit to getting it back up, what happens if its down for extended periods of time, etc. That's the only way to legitimately transfer the risk of your availability being controlled by an externality.

Resources