I have 3 custom metrics that track the elapsed time of an HTTP request to an external service, so I can see how performant they are. I am able to setup 3 alerts to tell me when they are taking too long, but I would also like to setup alerts to tell me when the request rate is over or under a certain threshold, for each of the 3 calls.
I can see there is a general Request Rate alert, but this applies to the entire app insights resource - which I share with the rest of my company. So if I set up an alert where the request rate >= 100 every 5 seconds, this will be counting not only my 3 requests but also a whole bunch of other requests I don't care about.
I want to end up with something like this, repeated for request B & C:
Does request A take longer than 3 seconds (avg over the last 5 mins)? - done
Are there more than 100 requests for request A (avg over the last 5 mins)?
Are there fewer than 100 requests for request A (avg over the last 24 hours)?
Is this possible? Should I be looking at some other way of dealing with the requests/metrics?
Yes, it is possible. One way to do this is to use the "Custom Log Search" as a signal to an alert, like here:
If you want to test your queries you can do so in Log Analytics (Application Insights -> Search -> Analytics).
Related
I have stackdriver alerts/incidents on metrics like cloud run revision request latencies.
If there were a few calls a long time ago that had high latency, but there have not been any new requests since then which had a low latency, the incident will be permanently firing. This is because when there are no new requests coming in, there are no data points for the metric.
Is there a way to automatically stop an incident from firing when there are no recent data points for the underlying metrics? Or is there an alternative way to have alerts on high request latencies in cloud run that automatically switches off the alarm again when no new requests are coming that have a high latency?
The solution of https://stackoverflow.com/a/63997540/6473907 does not work as-is, because the google cloud run built-in metric for the request count does not go to zero when there are no more requests coming in. Instead, it just stops providing any data points. The solution for us was to create a custom logs-based metric that counts the log entries written for every request by cloud run, because the logs-based metric does indeed go to zero, then combine it with the AND_WITH_MATCHING_RESOURCE as described in https://stackoverflow.com/a/63997540/6473907
The chart compares the request count as obtained from the google pre-defined metric run.googleapis.com/request_count (in violet) with the metric generated by a custom logs-based metric (in blue). Only the latter goes to zero when no more requests are coming in.
Edit: This solution will not work because the request count stops being sent to Stackdriver instead of dropping to zero. As explained in the other (more correct) answer, the solution is to create a logs-based metric for the requests, and this will properly drop to zero when there are no additional requests.
This behaviour is documented in the alerting docs:
If measurements are missing (for example, if there are no HTTP
requests for a couple of minutes), the policy uses the last recorded
value to evaluate conditions.
There are a few recommendations in there to mitigate this issue, but all the suggestions assume you're actually collecting metrics, not your situation where there are no metrics at all (because you stopped receiving requests).
This is probably by design: even if you are not receiving additional requests, you might still want to check why all the latest requests had this increased latency.
To work around this feature, you could try to use multiple conditions in your alert policy:
One condition related to the latency: if latency > X
One condition related to the existence of requests: if request count > 1
If you combine those with AND_WITH_MATCHING_RESOURCE, it should only trigger if there's high latency and there are requests. The incident should be resolved when one of the 2 conditions are not met. Even if no new metrics are ingested related to the latency (so the alerting policy still thinks the latency is high), the request count will stop matching after the duration period specified.
This is one part of my request. And I have 725 those kind of requests for each day in 2 year span.
I am getting analytics for 30 days traffic for certain dataset I am creating.
When I try to query the analytics for all 725 datasets I get quota error Requests per user per 100 seconds even though I put time.pause(2) before each request.
Is there something else I can do to avoid hitting the API quota?
{
"reportRequests":[
{
"viewId":"104649158",
"dateRanges":[
{
"startDate":"2017-12-01",
"endDate":"2017-12-31"
}
],
"metrics":[
{
"expression":"ga:pageviews"
},
{
"expression":"ga:uniquePageviews"
},
{
"expression":"ga:pageviewsPerSession"
},
{
"expression":"ga:timeOnPage"
},
{
"expression":"ga:avgTimeOnPage"
},
{
"expression":"ga:entrances"
},
{
"expression":"ga:entranceRate"
},
{
"expression":"ga:exitRate"
},
{
"expression":"ga:exits"
}
],
"dimensions":[
{
"name":"ga:pagePathLevel2"
}
],
"dimensionFilterClauses":[
{
"filters":[
{
"dimensionName":"ga:pagePathLevel2",
"operator":"REGEXP",
"expressions":[
"23708|23707|23706|23705|23704|23703|23702|23701|23700|23699|23698|23697|23696|23695|23694|23693|23692"
]
}
]
}
]
}
]
}
1) You should increase the user quota to 1000 requests (if not done already) by going into your Coogle Cloud Console -> Top-left Menu -> APIs & Services -> Analytics Reporting API -> Quota:
https://console.cloud.google.com/apis/api/analyticsreporting.googleapis.com/quotas
2) You could increase the time range and use the ga:yearMonth dimension to still get your monthly breakdown. However you might face sampling issues: since your query is "custom" (you use a filter + dimension), sampling will apply if for the given time range the total number of sessions at property level exceeds 500K (regardless of how many are actually included in the response). In this case there is no absolute answer, you have to find the time ranges that suit you best. samplesReadCounts / samplingSpaceSizes will help you detect sampling, and if required you will need to handle pagination.
While it is correct that you can request a quota increase. Which will increase the total number of requests that you can make this is still limited.
In the API Console, there is a similar quota referred to as Requests per 100 seconds per user. By default, it is set to 100 requests per 100 seconds per user and can be adjusted to a maximum value of 1,000. But the number of requests to the API is restricted to a maximum of 10 requests per second per user.
more info
requests per user per 100 seconds
Is a user based quota it is linked to the maximum of 10 requests per second per user quota (10 queries per second (QPS) per IP address). This is basically flood protection. It prevents a single user from making to many requests against the api and there by making it hard for the rest of use to use the the api.
What you need to understand first is that 100 requests per user per second is very subjective. When you run your request there is really no way for you to know what server your request will run on if your the only one running on that server then its possible you could kick off 100 request in 10 seconds and then be blocked for the next 90 seconds.
quotaUser
The second thing you need to know is that user based normally means ip based so if these request may be going against different views but if its all running from the same Ip address this can cause some confusion and it assumes you are the same user. To get around that you can use an alternate parameter called quota User which you can send a random string to this with every request and it can help it wont completely reduce it google tends to catch on to what you are doing eventually.
quotaUser An arbitrary string that uniquely identifies a user.
Lets you enforce per-user quotas from a server-side application even in cases when the user's IP address is unknown. This can occur, for example, with applications that run cron jobs on App Engine on a user's behalf.
You can choose any arbitrary string that uniquely identifies a user, but it is limited to 40 characters.
Learn more about Capping API usage.
Implementing exponential backoff
Google normally recommends that you implement something called exponential backoff this basically means that you try a request if it fails then you wait a few seconds and try again if that fails then you wait twice as long as you waited before and then try again you do this about 10 times and normally you are able to get though.
If you are using one of the official google client libraries most of them have exponential backoff already implemented
flood buster
I while ago i wrote an article about something i called flood buster its a way of keeping track of how fast i was going to try and prevent the user quota error the code is in C# you may find it useful flood buster
not really an issue
while getting these errors may be ugly it doesn't really matter you should just make the request again. Google does not count this error against you unless you do it constantly for hours at a time.
2000 requests per project 100 s
You need to remember that the number of requests your project can make In total per 100 seconds is 2000. This can not be increased.
So if you have two users each eating up 1000 requests per 100 seconds your going to hit the project based quota and there is nothing you can do about that. Allowing a single user to eat all your quota is IMO not a very good idea unless this is a single user application.
We're implementing Google Analytics in retail consumer kiosk software. There is no Javascript or SDKs or web pages involved - we craft a URL per Measurement Protocol and post it. We find that sometimes hits seem to just stop getting counted. If we watch the Real-Time section on the GA web site we can see that our hits continue to get posted, but over in the Behavior / Screens section the number of screen views for this device for today stops incrementing.
It's not just a "sometimes you have to wait 24 hours" thing, because Tuesday and Wednesday of last week still show zero today. If it's a rate limit, I can't see what - we're nowhere near 200k hits per day (per user, but from our point of view each kiosk is a user - we don't have any means to identify individual users); we shouldn't be hitting 500 hits per session because we send a session start (ec=Session&sc=Start) each time the user does something on the main menu and a session end (ec=Session&sc=End) each time the workflow finishes, which shouldn't ever be more than 20 screens - the default 'idle timeout' definition of a session wouldn't work well for us since a user can legitimately be working on a single screen for 10 minutes or more editing a picture whereas also a user can finish and leave and the next user in line start using the kiosk within just a few seconds; we shouldn't be sending events 'too fast' because it takes a couple seconds for a human to read the screen and reach out and touch a button.
What we observe is that some days it counts up to 340-360 and stops and some days it stays at 0 permanently. Any idea what's happening and how to fix it?
11/24: Today it went up to 352 and then stopped. This was about one hour of activity. All of this has been done with "Highest precision" selected.
12/1: Still same, counts for about one hour, to 347 screen views today, then stops incrementing.
When I look at Audience/Overview it says "Sessions 1". There should be dozens of sessions, split up by when we send (ec=Session&sc=Start). I think it must not be recognizing that as a session, it must be using the session timeout (idle), and staying all within a single session, and therefore limiting to 500 hits (we've got some events to go along with the screen views). And this is just wrong. Session should end when we say it does.
12/1: One correction, we actually do send sc=start and sc=end, with the values lower-case, as specified by Google.
My coworker did some experimenting and found that sc=start is ignored on t=event hits. It is recognized on t=pageview hits. I changed my reporting a bit to generate a fake pageview when a session starts, just so I could send the sc=start, and now the counts are accurate.
I am trying to stress test my google analytics system and I have sent around 100,000 request to the GA at the rate of about 3000/s . I have received 200 as the status code for successful ping to GA. All the request sent are exactly similar.
But when I see the real time dashboard the numbers are wrong and only shows about 1/3 the total requests sent. Has anybody observed similar behavior with GA?
Do You know that a standart (free) version of GA has many restrictions? For example it has limited number of hits per second collect by it. It is normal behaviour to limit collected data if You make 3000 hits per second.
As per documentation:
ga.js:
Each ga.js tracker object starts with 10 hits that are replenished at
a rate of 1 hit per second. Applies only to event type hits.
analytics.js:
Each analytics.js tracker object starts with 20 hits that are
replenished at a rate of 2 hit per second. Applies to All hits except
for ecommerce (item or transaction).
Android SDK
For each tracker instance on a device, each app instance starts with
60 hits that are replenished at a rate of 1 hit every 2 seconds.
Applies to All hits except for ecommerce (item or transaction).
iOS SDK
Each property starts with 60 hits that are replenished at a rate of 1
hit every 2 seconds. Applies to All hits except for ecommerce (item or
transaction).
Let's assume we have a goal to setup offline events tracking using measurement protocol, the only limitations from our side is that we need to post the events feed once daily and have a GA setup with correct standart reports from GA UI.
GA limits:
Session timeout limit is 4 hours;
Max time delta between when the hit being reported occurred and the time the hit was sent - the qt parameter, is also limited to 4 hours;
Test case:
"0". Session timeout limit is set to the max 4 hours.
User visits site at 9 a.m first session is created.
It takes him 10 minutes to get the info needed for making a call.
User makes a call and an phone order at 9:10 a.m. Unique, non personally identifiable code is passed with the call to CRM and saved in GA dimension and uid.
At 6 p.m call-report CRM generates the call-report and passes it to GA using measurement protocol event upload HTTP requests.
At 6 p.m call-report CRM generates the transaction-report and passes the phone order value & number to GA using measurement protocol transaction requests.
Questions:
1) Does the qt parameter in request described on the 5'th step of test case needs to be equal to:
1.1) Possible maximum value - 4 hours (because otherwise it may be not processed by this rule "Values greater than four hours may lead to hits not being processed.")
1.2) Actual value - 8 hours & 50 minutes.
2) Does 1.1 result in a first session timeout?
3) Does 1.1 result in a second session being created, which:
start-time is equal to 4 p.m & 50 minutes;
end time is equal to 4 p.m & 50 minutes;
user-agent by-default is equal to the value which has been used in measurement protocol HTTP request;
by-default is not closed, so if a second user visit is on 4 p.m this visit's hits will be sent to this session;
4) Does this second session affect the value of standart report's parameters, such as:
average session length;
average bounce rate & exit rate;
average pages per session;
5) Does the second session affect the flow reports or any other Google reports making in incorrect?
It should be equal to possible max value, or 0, if it's more then 4 hours since the date, and you should write the actual date as a custom field and process the data later.
Yes, a new session will be created, if the last session expired.
Somewhat.
Yes.
Flow reports are based on users not sessions.