We are currently monitoring a web API using the data in the Performance page of Application Insights, to give us the number of requests received per operation.
The architecture of our API solution is to use APIM as the frontend and an App Service as the backend. Both instances have App Insights enabled, and we don't see a reasonable correlation between the number of requests to APIM and the requests to the App Service. Also, this is most noticeable only in a couple of operations.
For example,
Apim-GetUsers operation has a count of 60,000 requests per day (APIM's AI instance)
APIM App Insights Performance Page
AS-GetUsers operation has a count of 3,000,000 requests per day (App Service's AI instance)
App Service App Insights Performance Page
Apim-GetUsers routes the request to AS-GetUsers and Apim-GetUsers is the only operation that can call AS-GetUsers.
Given this, I would expect to see ~60,000 requests on the App Service's AI performance page for that operation, instead we see that huge number.
I looked into this issue a little bit and found out about sampling and that some App Insights features use the itemCount property to find the exact number of requests. In summary,
Is my expectation correct, and if so what could cause this? Also, would disabling adaptive sampling and using a fixed sampling rate give me the expected result?
Is my expectation wrong, and if so, what is a good way to get the expected result? Should I not use the Performance page for that metric?
Haven't tried a whole lot yet as I don't have access to play with the settings until I can find a viable solution, but I looked into sampling and itemCount property as mentioned above. APIM sampling is set to 100%.
I ran a query in Log Analytics on the requests table and when I just used the requests count, I got a number that was closer to the one I see in APIM, but when I use a sum of the itemCount, as suggested by some MS docs, I get that huge number as seen in the performance page.
List of NuGet packages and version that you are using:
Microsoft.Extensions.Logging.ApplicationInsights 2.14.0
Microsoft.ApplicationInsights.AspNetCore 2.14.0
Runtime version (e.g. net461, net48, netcoreapp2.1, netcoreapp3.1, etc. You can find this information from the *.csproj file):
netcoreapp3.1
Hosting environment (e.g. Azure Web App, App Service on Linux, Windows, Ubuntu, etc.):
App Service on Windows
Edit 1: Picture of operation_Id and itemCount
Related
I have .Net core App deployed on azure and enabled application insights.
Sometimes Azure application insights End-to-end transaction details do not display all telemetry.
Here it only logs the error and not request or maybe request logged but both do not display together over here(difficult to find out due to many people use it)
Should be like:
Sometimes request log but with no error log.
What could be the reason for happening this? do I need to look into application insights specific set-up/feature?
Edit:
As suggested by people here, try to disable the Sampling feature but still not works, Here is open question as well.
This usually happens due to sampling. By default, adaptive sampling is enabled in the ApplicationInsights.config which basically means that only a certain percentage of each telemetry item type (Event, Request, Dependency, Exception, etc.) is sent to Application insights. In your example probably one part of the end to end transaction got sent to the server, another part got sampled out. If you want, you can turn off sampling for specific types, or completely remove the
AdaptiveSamplingTelemetryProcessor
from the config which completely disables sampling. Bear in mind that this leads to higher ingestion traffic and higher costs.
You can also configure sampling in the code itself, if you prefer.
Please find here a good overview of how sampling works and can be configured.
This may be related to :
When using SDK 2.x, you have to track all events and send the telemetries to Application insights
When using auto-instrumentation with 3.x agent, in this case the agent collect automatically the traffic, logs ... and you have to pay attention to the sampling file applicationinsights.json where you can filter the events.
If you are using java, below the accepted Logging libraries :
-java.util.logging
-Log4j, which includes MDC properties
-SLF4J/Logback, which includes MDC properties
We've been using Visual Studio Load Test to exercise our .NET Framework 4.7.2 telemetry client where we can set up the load test to post metrics to our Rabbit MQ at a rate of about 250 metrics per second. Recently, we've had to migrate our telemetry client to .NET Core and need to run load testing and verify that it can still post metrics at the same rate. Now, Visual Studio Load Test (VSLT) is being deprecated and has no support for .NET Core framework so we've had to look to something like NBomber to use in place of VSLT.
With regards to NBomber, there doesn't seem to be enough documentation or support that I can get because I've tried all I know and cannot get NBomber to post more than 25 metrics per second. At the same time, I'm seeing 100% CPU usage.
Anyone has any insight to share with me? Thanks in advance for your help,
Tien
Turns out, my logic was bad. A senior developer and friend shared with me some insights where I was initializing a telemetry client for each posting of a metric. This was the key to high consumption of CPU and not allowing me to reach the performance I was expecting. I'm in the process of re-coding my test(s) so that NBomber can be used to initialize 250 telemetry clients posting a minimum of 1MM metrics within an hour. I ran a fix yesterday that posted 17K metrics within 56secs with just 1 telemetry client or of about a rate of 300 RPS. I thought VS LT was awesome, but I'm thinking NBomber is quite impressive.
Cheers to Load Testing with NBomber!!
Tien
If single instance of NBomber is consuming 100% of CPU and not conducting the necessary load you will need to set up another machine and run NBomber in distributed cluster mode
Why do you need cluster?
You reached the point that the capacity of one node is not enough to create a relevant load.
You want to delegate running multiple scenarios to different nodes. For example, you want to test the database by sending in parallel read and write queries. In this case, one node can send inserts and another one can send read queries.
You want to simulate a real production load that requires several nodes to participate. For example, you may have one node that periodically writes data to the Kafka broker and two nodes that constantly read this data from the Redis cache.
Also it seems that Microsoft recommends using Apache JMeter™ so it might worth giving it a try. JMeter is capable of sending messages to various MQ implementations and its documentation is more concise, i.e. see Building a JMS Topic Test Plan
My team and I have been at this for 4 full days now, analyzing every log available to us, Azure Application Insights, you name it, we've analyzed it. And we can not get down to the cause of this issue.
We have a customer who is integrated with our API to make search calls and they are complaining of intermittent but continual 502.3 Bad Gateway errors.
Here is the flow of our architecture:
All resources are in Azure. The endpoint our customers call is a .NET Framework 4.7 Web App Service in Azure that acts as the stateless handler for all the API calls and responses.
This API app sends the calls to an Azure Service Fabric Cluster - that cluster load balances on the way in and distributes the API calls to our Search Service Application. The Search Service Application then generates and ElasticSearch query from the API call, and sends that query to our ElasticSearch cluster.
ElasticSearch then sends the results back to Service Fabric, and the process reverses from there until the results are sent back to the customer from the API endpoint.
What may separate our process from a typical API is that our response payload can be relatively large, based on the search. On average these last several days, the payload of a single response can be anywhere from 6MB to 12MB. Our searches simply return a lot of data from ElasticSearch. In any case, a normal search is typically executed and returned in 15 seconds or less. As of right now, we have already increased our timeout window to 5 minutes just to try to handle what is happening and reduce timeout errors for the fact their searches are taking so long. However, we increased the timeout via the following code in Startup.cs:
services.AddSingleton<HttpClient>(s => {
return new HttpClient() { Timeout = TimeSpan.FromSeconds(300) };
});
I've read in some places that you actually have to do this in the web.config file as opposed to here, or at least in addition to it. Not sure if this is true?
So The customer who is getting the 502.3 errors have significantly increased the volumes they are sending us over the last week, but we believe we are fully scaled to be able to handle it. They are still trying to put the issue on us, but after many days of research, I'm starting to wonder if the problem is actually on their side. Could it be possible that they are not equipped to take the increased payload on their side. Can it be that their integration architecture is not scaled enough to take the return payload from the increased volumes? When we observe our resources usages (CPU/RAM/IO) on all of the above applications, they are all normal - all below 50%. This also makes me wonder if this is on their side.
I know it's a bit of a subjective question, but I'm hoping for some insight from someone who may have experienced this before, but even more importantly, from someone who has experience with a .Net API app in Azure which return large datasets in it's responses.
Any code blocks of our API app, or screenshots from Application Insights are available to post upon request - just not sure what exactly anyone would want to see yet as I type this.
We built a React webapp that makes fetch calls to WebAPI2 services hosted on the same website. I've added Application Insights to the application and the server code. There are some external web requests that run in the WebAPI services and I wanted to track the timings of those calls and compare them to the overall request duration.
I can see Fetches getting populated in the requests data. I also see customEvents being recorded. The problem is that I can't seem to correlate these two datasets. None of the calls in the requests have an operation_Id that match the operation_Id or operation_ParentId in the customEvents. I had thought that the whole purpose of these properties was to associate the calls with each other.
I saw this article that talks about some new W3C distributed tracing that can be used for correlation (https://learn.microsoft.com/en-us/azure/azure-monitor/app/correlation). I think that's for a different issue of dealing with server farms but even so, I've tried enabling those parameters without any luck either.
I also enabled the enableCorsCorrelation on the javascript config without that affecting anything I could tell. But I think that setting is only useful to correlate across different AI resources.
I am using a current version of AI's SDK. I notice that the source of these entries are coming from the different parts of the SDK. Our customEvents are written by dotnet:2.8.1-22898. The Fetch requests are written by web:2.8.1-19196.
Could the issue be sampling? I've tried to open the firehouse... I have 100% Data Sampling on the Dashboard. I have left the defaults for javascript config and applicationInsights.config on the server.
Has anyone had success correlating data in a customEvents dataset with other datasets?
We are applying unittests, integration tests and we are practicing test driven and behaviour driven development.
We are also monitoring our applications and servers from outside (with dedicated software in our network)
What is missing is some standard for a live monitoring inside the apllication.
I give an example:
There should be a cron-like process inside the application, that regularily checks some structural health inside our data structures
We need to monitor that users have done some regular stuff that does not endanger the health of the applications (there are some actions and input that we can not prevent them to do)
My question is, what is the correct name for this so I can further research in the literature. I did a lot of searching but I almosdt always find the xunit and bdd / integration test stuff that I already have.
So how is this called, what is the standard in professional application development, I would like to know if there is some standard structure like xunit, or could xunit libraries even bee used for it? I could not even find appropriate tagging for this question, so please if you read this and know some better tags, why not add them to this answer and remove the ones that don't fit.
I need this for applications written in python, erlang or javascript and those are mostly server side applications, web applications or daemons.
What we are already doing is that we created http gateway from inside the applications that report some stuff and this is monitored by the nagios infrastructure.
I have no problem rolling some cron-like controlled self health scheme inside the applications, but I am interested about knowing some professional standardized way of doing it.
I found this article, it already comes close: Link
It looks like you are asking about approaches how to monitor your application. In general, one can distinguish between active monitoring and passive monitoring.
In active monitoring, you create some artificial user load that would mimic real user behavior, and monitor your application based on these artificial responses from a non-existing user (active = you actively cause traffic to your application). Imagine that you have a web application which allows to get weather forecast for specific city. To have active monitoring, you will need to deploy another application that would call your web application with some predefined request ("get weather for Seattle") every N hours. If your application does not respond within the specified time interval, you will trigger alert based on that.
In passive monitoring, you observe real user behavior over time. You can use log parsing to get number of (un)successful requests/responses, or inject some code into your application that would update some values in database whenever successful or not successful response was returned (passive = you only check other users' traffic). Then, you can create graphs and check whether there is a significant deviation in user traffic. For example, if during the same time of the day one week ago your application served 1000 requests, and today you get only 200 requests, it may mean some problem with your software.