Anomaly detection for Google Analytics in realtime - google-analytics

I'm trying to detect anomalies in google analytics events like page views or custom events.
I tested the custom alert feature from google itself. The period for those alerts are per day, week or month. What I'm looking for is a realtime detection. It would be useful to define rules for alerts like a maximum divergence between two points in time. For example [now, now - 15 minutes] or [now, now - 24 hours] or [now, now - 7 days]. Some solutions provide alerts when fixed threshold got passed (like observe.io). But thats not very helpful for highly fluctuating numbers that depend on weekday and daytime (like page views).
I would be thankful for any tips how to detect anomalies in GA in realtime.

I agree thereshold solutions is not a good idea for detect anomalies in time series. Because they generally are set by the user, rather than learned, which can be a time consuming and difficult process when monitoring many data streams.
Moreover, they need to be adjusted as the environment changes, so manual real-time maintenance is needed.
Besides, since they don’t take temporal sequences into account, simple thresholds cannot identify pattern changes that take place within the range. I recommend you use methods for anomaly detection in time series or change point detection.
You can googling about this topics and you'll find several algorithms. For realtime analisys, i also can recommend softwares like MOA (http://moa.cms.waikato.ac.nz/) and Numenta (https://numenta.com/).

Related

Google Analytics Real-time + historical data

I work for a non-profit that needs to see how our fundraising efforts are going in 'real-time'.
We look at results in blocks of about a half hour - so we need to report on how we finished the last 24 hours or so and also where we're at in the current half-hour. We're accomplishing this through google analytics, as we have multiple fundraising streams all pointing to a common GA account.
I have tried using datastudio to report against the GA API, but that connector does not seem to refresh at a reliable rate - someitmes it'll pull fresh data within a minute, sometimes it can take twenty minutes to report on recent transactions. I believe the 'real-time' API could be used to get fresher GA data, but as far as I can tell, that will only report 'live' data, and not prior/historical data (say from four hours ago). Does anyone know what API I could use if any to pull all data historical through current datetime?
I apologize if this request is vague, but I'm just looking for a conceptual approach at this point to get the freshest data - preferably in one fell swoop (API call). There is more complexity post-data intake (I have to then compare it to goals we've set for each half-hour, amongst other nuances to the transacitons themselves), so i wanted to start with this fundamental piece/question.
Thanks!
Given the context provided, I believe that the API solution would not be feasible. Among other reasons:
The real time API only offers a limited amount of dimensions and metrics. For example, e-commerce data is not available.
https://ga-dev-tools.appspot.com/dimensions-metrics-explorer/
https://developers.google.com/analytics/devguides/reporting/realtime/dimsmets
The Standard intraday processing SLA for the Core Reporting API is < 24 hours for standard properties. The processing occurs on a best effort basis. Meaning that an hourly availability can occur from time to time but can not be guaranteed.
https://support.google.com/analytics/answer/7084038?hl=en
As an alternative approach to the API solution, you could consider the use of an App + Web property which would allow you to stream event data in real time to BigQuery. However, this solution has some cost implications and would introduce you to a new tracking paradigm.
https://developers.google.com/analytics/devguides/collection/app-web/tag-guide
https://support.google.com/firebase/answer/6318765?hl=en
https://www.simoahava.com/analytics/getting-started-with-google-analytics-app-web/

Data sampling in google analytics goal flow report

The goal flow report on my google analytics account shows some strange sampling behavior. While I can usually select up to a month of data before sampling starts it seems to be different for the goal flow report.
As soon as I select more than one day of data the used data set is getting smaller very fast. At three days the report ist based on only 50% of the sessions, which, according to analytics, comes to only 35 sessions.
Has anyone experienced a similar behavior of sampling although only very small data-sets are used?
Sampling is induced when your request is calculation-intensive; there's no 'garunteed point at which it trips.
Goal flow complexity will increase exponentially as you add goals, so even a low number of goals might make this report demand a lot of processing.
Meanwhile you'll find that moast of the standard reports can cover large periods of time without sampling; they are preaggreated, so it's very cheap to load them.
If you want to know more about sampling, see here:
https://stackoverflow.com/a/37386181/5815149

Using analytics for non web-related project

I'm looking to use google analytics for its web interface only. A large dataset such as gasoline prices would be submitted to analytics via the api and viewed. Is this possible? Or is analytics purely tailored to viewing website statistics?
The Google Analytics data model is really geared toward datasets that can be thought of in terms of users, sessions, and hits (hits being things like pageviews and events).
If your data can be thought of in these terms, it will probably work. If, on the other hand, you're trying to do things like joins or calculate averages or other statistical operations, you're probably better of using something else.
While the others are correct, Google Analytics is geared towards users, sessions, and hits. It is none the less simply an application for data analysis. The question will be how to get the data into the system.
I think you need to give us a little more information about your data set. But let me assume a few things.
You have a dataset with gasoline prices over a period of days.
you have a dataset with gasoline prices for different gas stations.
It would be really nice if this wasn't old data that this is new gas prices coming in.
If I had this dataset I could insert it into Google Analytics. Directly using the measurement protocol.
The measurement protocol has a few required things, the first being hit type. 'pageview', 'screenview', 'event', 'transaction', 'item', 'social', 'exception', 'timing'. the second would cid or session id.
Now cid I think I would probably set to the different gas stations and probably add a custom dimension with the gas station name.
As for hit I think I would probably say screenview and make an application Google Analytics account. Mainly because well this isn't a website its a little different.
Then every time the price of Gas changes I would send a screenview, cid of the station with the custom dimension of the station, add a custom metric with the price.
The main problem you are going to have is that Google analytics doesn't handle old data well. If you are going to insert this data with a date associated the date and time cant be grater then 4 hours ago or the server wont process it.
Have you considered putting it in big Query instead?
This question really is to broad or opinion based, but it was fun to consider.
It is possible to send all kinds of hits with the Measurement Protocol. But Philip is correct in stating that the data model is largely geared towards users, sessions and hits. But you could probably get a good ways with custom dimensions and metrics.

Google analytics data adjustment?

I've been using a SSIS Integration component to download data from Google Analytics in order to keep an historical view of some websites and track the evolution of them. Basically the metrics we track are Visits (now Sessions) and Visitros (now Users), and the dimensions are Year and Month. However, today I noticed that the data I downloaded for july had a variation on the Users metric. I heard that google analytics uses an estimation method to "calculate" some (if not all) of their metrics, could it be that after that they "adjust" the data with more acurate information? If so, is this mentioned in the documentation? (a link would be highly appreciated) Since the users are complaining that we are not delivering the real GA Data. I tried looked on the Google analytics documentation page with no luck.
Thanks for your time.
PS: Sorry for my english, it isn´t my native language
If you are using the standard version of Google Analytics (you'll know if you are paying $150k for premium), data is sampled depending on volume. Have a read of this article can-you-trust-your-google-analytics-data
I have seen very slightly differing results being returned if you repeatedly call the api with the same historical parameters repeatedly. In my case the figures only differed by 1-2 over a daily set of several thousand, but nevertheless it differed.
If you want to guarantee your results, consider upgrading to premium
Sampling could be an issue if what you are requesting is over 50,000 rows for the time period you are requesting. To avoid it you can download more often, such as daily.
But I think your issue is that there is a processing time for Google Analytics - if you are downloading at 3 am on the 1st it is probable that the processing for the previous day has not finished.
Google Analytics Premium SLA is for 4 hour data freshness, so even that would have trouble. Pragmatically you should allow 24 hours before you download data for the previous day, 48 hours for e-commerce data.
Thirdly make sure it is not Unique Visitors you are requesting, as this is dependent on the time period you are requesting.

Could "filling up" Google Analytics with millions of events slow down query performance / increase sampling?

Considering doing some relatively large scale event tracking on my website.
I estimate this would create up to 6 million new events per month in Google Analytics.
My questions are, would all of this extra data that I'm now hanging onto:
a) Slow down GA UI performance
and
b) Increase the amount of data sampling
Notes:
I have noticed that GA seems to be taking longer to retrieve results for longer timelines for my website lately, but I don't know if it has to do with the increased amount of event tracking I've been doing lately or not – it may be that GA is fighting for resources as it matures and as more and more people collect more and more data...
Finally, one might guess that adding events may only slow down reporting on events, but this isn't necessarily so is it?
Drewdavid,
The amount of data being loaded will influence the speed of GA performance, but nothing really dramatic I would say. I am running a website/app with 15+ million events per month and even though all the reporting is automated via API, every now and then we need to find something specific and use the regular GA UI.
More than speed I would be worried about sampling. That's the reason we automated the reporting in the first place as there are some ways how you can eliminate it (with some limitations. See this post for instance that describes using Analytics Canvas, one my of favorite tools (am not affiliated in any way :-).
Also, let me ask what would be the purpose of your events? Think twice if you would actually use them later on...
Slow down GA UI performance
Standard Reports are precompiled and will display as usual. Reports that are generated ad hoc (because you apply filters, segments etc.) will take a little longer, but not so much that it hurts.
Increase the amount of data sampling
If by "sampling" you mean throwing away raw data, Google does not do that (I actually have that in writing from a Google representative). However the reports might not be able to resolve all data points (e.g. you get Top 10 Keywords and everything else is lumped under "other").
However those events will count towards you data limit which is ten million interaction hits (pageviews, events, transactions, any single product in a transaction, user timings and possibly others). Google will not drop data or close your account without warning (again, I have that in writing from a Google Sales Manager) but they reserve to right to either force you to collect less interaction hits or to close your account some time after they issued a warning (actually they will ask you to upgrade to Premium first, but chances are you don't want to spend that much money).
Google is pretty lenient when it comes to violations of the data limit but other peoples leniency is not a good basis for a reliable service, so you want to make sure that you stay withing the limits.

Resources