Resolve stackdriver incident when no more timeseries with available data violate the policy - stackdriver

I have stackdriver alerts/incidents on metrics like cloud run revision request latencies.
If there were a few calls a long time ago that had high latency, but there have not been any new requests since then which had a low latency, the incident will be permanently firing. This is because when there are no new requests coming in, there are no data points for the metric.
Is there a way to automatically stop an incident from firing when there are no recent data points for the underlying metrics? Or is there an alternative way to have alerts on high request latencies in cloud run that automatically switches off the alarm again when no new requests are coming that have a high latency?

The solution of https://stackoverflow.com/a/63997540/6473907 does not work as-is, because the google cloud run built-in metric for the request count does not go to zero when there are no more requests coming in. Instead, it just stops providing any data points. The solution for us was to create a custom logs-based metric that counts the log entries written for every request by cloud run, because the logs-based metric does indeed go to zero, then combine it with the AND_WITH_MATCHING_RESOURCE as described in https://stackoverflow.com/a/63997540/6473907
The chart compares the request count as obtained from the google pre-defined metric run.googleapis.com/request_count (in violet) with the metric generated by a custom logs-based metric (in blue). Only the latter goes to zero when no more requests are coming in.

Edit: This solution will not work because the request count stops being sent to Stackdriver instead of dropping to zero. As explained in the other (more correct) answer, the solution is to create a logs-based metric for the requests, and this will properly drop to zero when there are no additional requests.
This behaviour is documented in the alerting docs:
If measurements are missing (for example, if there are no HTTP
requests for a couple of minutes), the policy uses the last recorded
value to evaluate conditions.
There are a few recommendations in there to mitigate this issue, but all the suggestions assume you're actually collecting metrics, not your situation where there are no metrics at all (because you stopped receiving requests).
This is probably by design: even if you are not receiving additional requests, you might still want to check why all the latest requests had this increased latency.
To work around this feature, you could try to use multiple conditions in your alert policy:
One condition related to the latency: if latency > X
One condition related to the existence of requests: if request count > 1
If you combine those with AND_WITH_MATCHING_RESOURCE, it should only trigger if there's high latency and there are requests. The incident should be resolved when one of the 2 conditions are not met. Even if no new metrics are ingested related to the latency (so the alerting policy still thinks the latency is high), the request count will stop matching after the duration period specified.

Related

Google Analytics Hit Quotas

I wonder whether someone can help me please.
I have a user who under a specific property, sporadically receives the following error:
Some hits sent on 03-Jul-2018 to property ...... exceeded one or more hit quotas and were therefore not processed.
Hits can be dropped when daily or monthly hit limits are exceeded. You can view your hit volume levels in Property Settings in Analytics.
Hits can also be dropped if visitor hit limits are exceeded. This can happen when your site is incorrectly generating the visitor ID for a GA session. Contact your website administrator to check that the visitor ID generation has been correctly implemented.
They are not using the Premium account but when I look at the data for the day in question, there aren't any issues with regards to 'High Cardinality' which unless I've misunderstood I'd expect to see.
Could someone look at this please and offer some guidance where the issue may be because this area is fairly new to me.
Many thanks and kind regards
Chris
Collection limits are influenced by 2 factors:
The tracker: whether you use ga.js,gtag.js,analytics.js etc... here are the details.
The property type: whether you are using GA (10M hits / month) or GA 360 (2B hits / month).
In your case you are facing a property limit. To find out when such limits where reached, you can create a custom report using a time dimension (eg date+time) combined with the hits metric. You can also combine the hit metrics with other dimensions (country, browser, device) to see if you find any patterns as to why you're getting so many hits.
Cardinality is something else: it refers to the number of unique value combinations for your dimensions. For instance if you have 500K events where each event category is different, you'll have a Cardinality of 500K on the event category dimension. The more hits, the more likely you'll have a high cardinality, but the 2 aren't necessary related (if you send 10B events with the same category, the cardinality on the category is 1).
So focus on identifying and solving your limits/quotas issue, as it's the real issue here:
If the number of hits is legitimate (you have a huge amount of traffic), then the only options are to upgrade to GA 360 or reduce the number of hits for each session
If the number of hits is abnormally high (eg traffic is stable but hits increased dramatically), look for implementation issues, especially generic event trackers such as error tracking with tools like Google Tag Manager

Google Analytics data limits

We are using Google Analytics on a webshop. Recently we have added enhanced ecommerce to measure more events so we can optimize the webshop. But now we are experiencing less pageviews and other data is missing.
I don't know what it is, but on a specific page we are nog measuring anymore, I removed some items from the ga:addImpression data, and now the pageview is measured again.
I can find limits for GA, but I can't find anything for the amount of data that can be send to GA. Because is this seems to be related to the amount of data that is send to GA. If I shorten the name of a product, the pageview is also measured again. GA is practically broken now for us because we are missing huge numbers of pageviews.
Where can I find these limits, or how will I ever know when I'm running into these limits?
In one hand, im not sure how are you building your hits but maybe you should keep in mind the payload limits to send information to GA. (The limit is 8Kb)
In the other hand there is a limit in fact that you should consider (Docs)
This applies to analytics.js, Android iOS SDK, and the Measurement Protocol.
200,000 hits per user per day
500 hits per session
If you go over either of these limits, additional hits will not be processed for that session / day, respectively. These limits apply to Analytics 360 as well.
My best advise is to regulate the amount of events you send really considering which information has value. No doubt EE data is really important so you should partition productImpression hits in multiple ones of the problem is the size. (As shown in the screenshot)
And finally, migrate to GTM.
EDIT: Steps to see what the dataLayer has in it (in a given moment)
A Google Analytics request can send max about 8KB of data:
POST:
payload_data – The BODY of the post request. The body must include
exactly 1 URI encoded payload and must be no longer than 8192 bytes.
URL Endpoint
The length of the entire encoded URL must be no longer than 8000
Bytes.
If your hit exceeds that limit (happens e.g. with large product lists in EEC tracking) it is not (as far as I can tell) processed.
There are also restrictions to field length for some fields (e.g. custom dimension with max 150 bytes, others are detailed in the parameter reference ).
In some cases the data type is relevant, e.g. if in your event tracking the event value is set to a string the call might fail.
I think this is the page you are looking for Quota and limits page can help
These limits apply to the Web Property / Property / Tracking ID.
10 million hits per month per property
If you go over this limit, the Google Analytics team might contact you and ask you upgrade to Analytics 360 or implement client sampling to reduce the amount of data being sent to Google Analytics.

Google Analytics Add-on limitaions

I would like to get some data from GA via spreadsheet add-on as I did a few weeks ago (I gathered ~200 000 rows). I am using same metrics, dimensions and rest of the settings but I am still getting this error :
https://i.stack.imgur.com/hTpIg.png
I found that I will get some data when I do not set up "max-results", but the default is set up on 1000 which is not enough for my needs. Why?
What I have tried to solve this problem and it doesn’t work:
change GA views
change dimensions and metrics
change time range
create new spreadsheet
set up sharing settings of spreadsheet to "public on web"
I found the link regarding limits and quotas on API (https://developers.google.com/analytics/devguides/config/mgmt/v3/limits-quotas#) and I should pass only through 50 000 requests per project, which I actually exceed on the first run, so another question how is it even possible to get more data than I suppose to get?
Should I really order more request or does "request" mean anything else than "one row"? Second why or what?
There is no any interpretation for the error.
Perhaps I am missing something, appreciate your help.
In short: while one could only guess what causes your problem it's most certainly not the API limit. Rows and requests are not at all the same, every request may fetch up to 10,000 rows.
"Request" is a call to the API, which might include one or many rows of data (unless your script somehow only requests one row at a time, which would be unusual).
If you exceeded your API quota the error message would say pretty much that.
The default is 1000 rows because that's a sensible default (compromise between convenience and performance). The API will return max 10,000 rows per request. To fetch 200 000 results the Add-on would have to do 20 requests, not 50 000.
Also a Google spreadsheet support 2mio cells at max, this might be exceeded by your result set.
"Service error" is a very unspecific error message which can be caused by a variety of causes from out-of-bound ranges to script timeouts or network latency. Sometimes the spreadsheet service dumps an additional error message in the browser console, so you should check your developer tools.

Google Map API Get Duration without traffic

I know that the durationIntraffic is depracated now.
I just cant find a way to get the duration without traffic using google api.
already tried the matrix using DrivingOptions and trafficModel but the duration and duration_in_traffic did not match the duration without traffic in Maps.Google.com
Any help on how to get the duration without traffic data using API?
Image
Duration gives average time and duration in Traffic gives time in traffic .sometimes you run googleApi at midnignt you will find duration in Traffic time is lesser than duration time means duration is not without traffic time.
According to this Google Maps API should return an element called duration as well as an element called duration_in_traffic. Duration will be time without traffic while duration_in_traffic is time with traffic conditions.
Just to set the correct expectations, you shouldn't expect the Web Services API and the Google Maps website to work in the exact same way. These are different products managed by different teams at Google. The search stack is also different, so results may differ.
The result you get in duration field is an approximation of the average travel time for the route. This takes into account average traffic conditions of the last several weeks for a time when you execute a request. That means that the duration can change during the day.
Applying the departure time and traffic_model you can add further information to your requests to get the travel time in the current traffic, this will be provided in a duration_in_traffic field.
Resuming, currently there is no way to get duration without traffic. You will get the average travel time via the API.

Having trouble getting accurate numbers from graphite

I have an application that publishes a number of stats to graphite via statsd. One of the stats simply sends a stat increment to statsd every time a message is received by the service. I need to display a graph that shows the the relative traffic over time for this stat. Generally speaking, I should be able to display a graph that refreshes every, say 10 seconds, and displays how many messages were recived in those 10 seconds as well as the history for a given period of time. However, no matter how I format my API query I cannot seem to get accurate data. I've read a number of articles including this one:
http://code.hootsuite.com/accurate-counting-with-graphite-and-statsd/
That seems to give some good insight but is still not quite giving me what I need. this is the closes I have come:
integral(hitcount(stats.recieved, "10seconds"))
However, I don't like the cumulative result of this and when I run this I get statistics that come nowhere near to what I see n my logs for messages received. I am ok with accepting some packet loss but we talking about orders of magnitude. I know I am doing something wrong. Just hoping someone can give me some insight as to what.
A couple of things to check/try:
Configure Graphite for Statsd
Check to make sure that you've used the retention schema and aggregation settings in Graphite that match how Statsd will be sending data (i.e. it sends one data point per 10 second flush interval).
Run a single Statsd aggregator
Be sure you are only running one instance of Statsd as running multiple statsd daemons will cause metrics to be dropped (as Graphite will be configured to only store one data point for it's highest precision of 10s:6h)
Limit the time range in the UI or URL API to less than 6 hours
When displaying graphs with data that crosses over the 6 hour threshold (e.g. from now to 7 hours ago), you will begin seeing 1 minute worth of aggregated count data for the displayed graph (if you've configured Graphite for statsd with retentions = 10s:6h,1min:7d,10min:5y). Rollups will occur based on the oldest data point in the time range (e.g. now till 7+ days = you'll get 10 min rollups).
If sending sparse or "bursty" data AND displaying old time range (triggering aggregation)
Confirm that your xFilesFactor is low enough that aggregation produces non null values even with a high rate of nulls. For example, 100 requests in the first 10 seconds, and none for the remaining 50 seconds in a minute would cause a storage of 100, null, null, null, null, null which would be summed up to null when the data ages if the XFilesFactor is higher than 1/6. Using the statsd recommended graphite configuration handles this, but it is good to know about... as this can give the appearance of lost data.
Saving schema or aggregation changes
If you changed the graphite schema or aggregation settings after any metrics were stored (in whisper = graphite's storage) you'll need to either delete the .wsp files for the metric (graphite will recreate them) or run whisper-resize.py.
Validating settings
You can verify the settings against some whisper data by running whisper-info.py on a .wsp file. Find the .wsp file for one of your metrics in /graphite/storage/whisper/
Run: whisper-info.py my_metric_data.wsp. whisper-info.py output should tell you more about how the storage settings are working.
TLDR;
You should ensure that Graphite is set to store one data point per 10 second interval for metrics coming from StatsD. You should make sure that Graphite is summing (not averaging) for count data coming from Statsd. Both of these can be handled by using the recommended Statsd configuration settings. Don't run more than one Statsd aggregator. When using the UI, limit the data returned to less than 6 hours OR understand what rollup you are viewing when looking at data that crosses retention thresholds. Lastly, make sure the settings take (if you've already been sending metrics).

Resources