Azure Application Insights Scalability - azure-application-insights

When azure is hammered with several thousand records (for load testing), app insights does not show all traces and exception. The requests are certainly sent because there is corresponding data in the azure table. The querying was done using both Search and Analytics tab on Azure portal. Is this a known issue? What is the way/tool to view all traces, requests and exceptions?

Your telemetry is being sampled. Either by ingestion sampling (happening at the Azure side), or by Adaptive Sampling (at the application side).
https://learn.microsoft.com/en-us/azure/application-insights/app-insights-sampling
The best recommendation I can make is to remove the AdaptiveSamplingTelemetryProcessor or just edit the parameters of it to suit your needs in the ApplicationInsights.config file.
Example 1
This limits the rate of ingestion to 1000 events per second, but does not apply any sampling to Dependency or Trace data
<Add Type="Microsoft.ApplicationInsights.WindowsServer.TelemetryChannel.AdaptiveSamplingTelemetryProcessor, Microsoft.AI.ServerTelemetryChannel">
<MaxTelemetryItemsPerSecond>1000</MaxTelemetryItemsPerSecond>
<ExcludedTypes>Dependency;Trace</ExcludedTypes>
</Add>
Example 2
This limits the rate of ingestion to 500 events per second for Requests only. All other types are not sampled.
<Add Type="Microsoft.ApplicationInsights.WindowsServer.TelemetryChannel.AdaptiveSamplingTelemetryProcessor, Microsoft.AI.ServerTelemetryChannel">
<MaxTelemetryItemsPerSecond>500</MaxTelemetryItemsPerSecond>
<IncludedTypes>Request</IncludedTypes>
</Add>
ExcludedTypes note on GitHub
IncludedTypes note on GitHub

Related

Azure App Insights Operation count is inexplicably high

We are currently monitoring a web API using the data in the Performance page of Application Insights, to give us the number of requests received per operation.
The architecture of our API solution is to use APIM as the frontend and an App Service as the backend. Both instances have App Insights enabled, and we don't see a reasonable correlation between the number of requests to APIM and the requests to the App Service. Also, this is most noticeable only in a couple of operations.
For example,
Apim-GetUsers operation has a count of 60,000 requests per day (APIM's AI instance)
APIM App Insights Performance Page
AS-GetUsers operation has a count of 3,000,000 requests per day (App Service's AI instance)
App Service App Insights Performance Page
Apim-GetUsers routes the request to AS-GetUsers and Apim-GetUsers is the only operation that can call AS-GetUsers.
Given this, I would expect to see ~60,000 requests on the App Service's AI performance page for that operation, instead we see that huge number.
I looked into this issue a little bit and found out about sampling and that some App Insights features use the itemCount property to find the exact number of requests. In summary,
Is my expectation correct, and if so what could cause this? Also, would disabling adaptive sampling and using a fixed sampling rate give me the expected result?
Is my expectation wrong, and if so, what is a good way to get the expected result? Should I not use the Performance page for that metric?
Haven't tried a whole lot yet as I don't have access to play with the settings until I can find a viable solution, but I looked into sampling and itemCount property as mentioned above. APIM sampling is set to 100%.
I ran a query in Log Analytics on the requests table and when I just used the requests count, I got a number that was closer to the one I see in APIM, but when I use a sum of the itemCount, as suggested by some MS docs, I get that huge number as seen in the performance page.
List of NuGet packages and version that you are using:
Microsoft.Extensions.Logging.ApplicationInsights 2.14.0
Microsoft.ApplicationInsights.AspNetCore 2.14.0
Runtime version (e.g. net461, net48, netcoreapp2.1, netcoreapp3.1, etc. You can find this information from the *.csproj file):
netcoreapp3.1
Hosting environment (e.g. Azure Web App, App Service on Linux, Windows, Ubuntu, etc.):
App Service on Windows
Edit 1: Picture of operation_Id and itemCount

Application insights | Sometimes End-to-end transaction details do not show all telemetry

I have .Net core App deployed on azure and enabled application insights.
Sometimes Azure application insights End-to-end transaction details do not display all telemetry.
Here it only logs the error and not request or maybe request logged but both do not display together over here(difficult to find out due to many people use it)
Should be like:
Sometimes request log but with no error log.
What could be the reason for happening this? do I need to look into application insights specific set-up/feature?
Edit:
As suggested by people here, try to disable the Sampling feature but still not works, Here is open question as well.
This usually happens due to sampling. By default, adaptive sampling is enabled in the ApplicationInsights.config which basically means that only a certain percentage of each telemetry item type (Event, Request, Dependency, Exception, etc.) is sent to Application insights. In your example probably one part of the end to end transaction got sent to the server, another part got sampled out. If you want, you can turn off sampling for specific types, or completely remove the
AdaptiveSamplingTelemetryProcessor
from the config which completely disables sampling. Bear in mind that this leads to higher ingestion traffic and higher costs.
You can also configure sampling in the code itself, if you prefer.
Please find here a good overview of how sampling works and can be configured.
This may be related to :
When using SDK 2.x, you have to track all events and send the telemetries to Application insights
When using auto-instrumentation with 3.x agent, in this case the agent collect automatically the traffic, logs ... and you have to pay attention to the sampling file applicationinsights.json where you can filter the events.
If you are using java, below the accepted Logging libraries :
-java.util.logging
-Log4j, which includes MDC properties
-SLF4J/Logback, which includes MDC properties

How do I correlate data from customEvents dataset to the requests dataset in Application Insight logs?

We built a React webapp that makes fetch calls to WebAPI2 services hosted on the same website. I've added Application Insights to the application and the server code. There are some external web requests that run in the WebAPI services and I wanted to track the timings of those calls and compare them to the overall request duration.
I can see Fetches getting populated in the requests data. I also see customEvents being recorded. The problem is that I can't seem to correlate these two datasets. None of the calls in the requests have an operation_Id that match the operation_Id or operation_ParentId in the customEvents. I had thought that the whole purpose of these properties was to associate the calls with each other.
I saw this article that talks about some new W3C distributed tracing that can be used for correlation (https://learn.microsoft.com/en-us/azure/azure-monitor/app/correlation). I think that's for a different issue of dealing with server farms but even so, I've tried enabling those parameters without any luck either.
I also enabled the enableCorsCorrelation on the javascript config without that affecting anything I could tell. But I think that setting is only useful to correlate across different AI resources.
I am using a current version of AI's SDK. I notice that the source of these entries are coming from the different parts of the SDK. Our customEvents are written by dotnet:2.8.1-22898. The Fetch requests are written by web:2.8.1-19196.
Could the issue be sampling? I've tried to open the firehouse... I have 100% Data Sampling on the Dashboard. I have left the defaults for javascript config and applicationInsights.config on the server.
Has anyone had success correlating data in a customEvents dataset with other datasets?

Asp.Net HttpClient Performance

I'm load/stress testing using Visual Studios load testing project with current users set quite low at 10. The application makes multiple System.Net.Http.HttpClient request to a WebAPI service layer/application (which in some cases calls another WebApi layer/application) and it's only getting around 30 request/second when testing a single MVC4 controller (not parsing dependent requests).
These figures are based on a sample project where no real code is executed at any of the layers other then HttpClient calls and to contruct a new collection of Models to return (so some serialization happening too as part of WebApi). I've tried both with async controller and without and the results seemed pretty similar (slight improvement with async controllers)
Caching the calls obviously increases it dramatically (to around 400 RPS) although the nature of the data and the traffic means that a significant amount of the calls wouldn't be cached in real world scenarios.
Is HttpClient (or HTTP layer in general) simply too expensive or are there some ideas you have around getting higher throughput?
It might be worth noting that on my local development machine this load test also maxes out the CPU. Any ideas or suggestions would be very much appreciated?
HttpClient has a built-in default of 2 concurrent connections.
Have you changed the default?
See this answer
I believe if you chase the deps for HTTPClient, you will find that it relies on the settings for ServicePointManager.
See this link
You can change the default connection limit by calling System.Net.ServicePointManager.DefaultConnectionLimit
You can increase number of concurrent connection for http client. Default connection limit is 2. In order to find optimum connection try the formula 12* number of CPU on your machine.
In code : In code:
ServicePointManager.DefaultConnectionLimit = 48;
Or In Api.config or web.config file
<system.net>
<connectionManagement>
<add address="*" maxconnection="48"/>
</connectionManagement>
</system.net>
Check this link for more detail.

Apigee spike arrest applies to each API bundle or all API bundles

When I add a spike arrest policy as pasted below, to my Apigee APIs, does it count all the API calls from that client IP to Apigee to calculate whether the limit was exceeded? Or does it maintain a count per API individually and apply the policy per API/ API bundle?
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<SpikeArrest enabled="true" continueOnError="true" async="false" name="SpikeArrestCheck">
<DisplayName>Spike Arrest Policy</DisplayName>
<FaultRules/>
<Properties/>
<Identifier ref="proxy.client.ip"/>
<Rate>100ps</Rate>
</SpikeArrest>
When I add a spike arrest policy as pasted below, to my Apigee APIs,
does it count all the API calls from that client IP to Apigee to
calculate whether the limit was exceeded? Or does it maintain a count
per API individually and apply the policy per API/ API bundle?
Count is maintained per API bundle, per policy name (org and env is a given). Even if you use the same Identifier across bundles, there is no way to tie different API bundle spike arrests together.
I have tested this using SpikeArrest policy and observing the value of ratelimit.<spike arrest policy name>.used.count as tested across 2 different API bundles, both policies with the same name and same Identifier. The 2 buckets/counters are treated independently
You can set a spike arrest identifier like this:
<SpikeArrest name="SpikeArrest">
<Rate>10ps</Rate>
<Identifier ref="someVariable" />
</SpikeArrest>
The scope of the spike arrest policy above is limited to the current organization, environment, bundle, and policy name. No traffic traveling through a different policy, bundle, environment, or organization will affect the spike arresting of the above policy. In addition, since an identifier is specified, only traffic that has the same value stored in "someVariable" will be "counted" together. If the policy had no identifier specified, all traffic for the same policy, bundle, environment and organization would be counted together.
Note that spike arrests are tracked separately per message processor. They are also currently implemented as rate limiting, not a count. If you specify 100 per second, it means that your requests can only come in one per 10 ms (1/100 sec). A second request within 10 ms on the same message processor will be rejected. A small number is generally not recommended. Even with a large number, if two requests come in nearly simultaneously to the same message processor, one will be rejected.
Some observations for best practice:
Ideally you should track traffic access to your API based on a key that is static regardless the source. Using an IP address leaves room to consumer Apps to be too broad, so Spike Arrest policies never trigger because each mobile device will have a different IP address assigned to it. So, as a best practice either retrieve consumer key through OAuthV2 Policy after validating the token or directly when key is provided in the request. Exceptions to the rule is that API is not publicly accessible to consumer Apps, in which case access is provided to App servers only, which anyway you may want to manage traffic implementing Key Verification.
The counter "bucket" is determined by how you use Identifier. If you don't specify Identifier, then the bucket is the entire API Proxy. Or you can use Identifier Ref to make a more granular bucket. For example, if you wanted to make the bucket be per-developer (assuming you previously did a VerifyApiKey or VerifyAccessToken), you would do this:
<Identifier ref="client_id" />.
And if you wanted to, you could set the bucket to be based on ip address by doing this:
<Identifier ref="client.ip"/>
So the way you did it, it would be per-client ip.

Resources