Unsampled data with Google Analytics API - google-analytics

I am trying to automate the weekly report. Currently, I am using Google Analytics website to get the data for my report. Sampling level is higher precision.
I tried to get the same data by Google analytics API set samplingLevel as HIGHER_PRECISION. However, I am still getting the sampled data.
For FASTER, Precision Level is roughly 25% whereas for DEFAULT and HIGHER_PRECISION sampling level is roughly 50%.
On Google Analytics website, it says 'This report is based on 100% of sessions'. Can I get the same level of accuracy with Google API? I am using Google Apps script.Response for HIGHER_PRECISION is not matching.

Sumit, the API and the Google Analytics UI are certainly different and similarly the sampling's effect on things is a different beast which must be handled properly to get anything useful out of it.
As was mentioned in the comment, you can achieve high precision unsampled reports by (typically) shortening your date range that you're querying for and then "walking" the data.
To walk the data, you are essentially just gradually incrementing that small date range as you move through the desired data.
The "unsampled reports API" is... well, not the best. Considering that's what they are avoiding giving the end user in the first place, the offering available is not a very good long term or large project friendly solution. I would recommend small date ranges and then doing a data walk.
Happy Coding

There are several solutions to avoid sampling issue in Google Analytics by automating the process of data export for short date ranges.
I prefer this tool, it's pretty simple to use: MadStats.io

Related

Data sampling in google analytics goal flow report

The goal flow report on my google analytics account shows some strange sampling behavior. While I can usually select up to a month of data before sampling starts it seems to be different for the goal flow report.
As soon as I select more than one day of data the used data set is getting smaller very fast. At three days the report ist based on only 50% of the sessions, which, according to analytics, comes to only 35 sessions.
Has anyone experienced a similar behavior of sampling although only very small data-sets are used?
Sampling is induced when your request is calculation-intensive; there's no 'garunteed point at which it trips.
Goal flow complexity will increase exponentially as you add goals, so even a low number of goals might make this report demand a lot of processing.
Meanwhile you'll find that moast of the standard reports can cover large periods of time without sampling; they are preaggreated, so it's very cheap to load them.
If you want to know more about sampling, see here:
https://stackoverflow.com/a/37386181/5815149

Using analytics for non web-related project

I'm looking to use google analytics for its web interface only. A large dataset such as gasoline prices would be submitted to analytics via the api and viewed. Is this possible? Or is analytics purely tailored to viewing website statistics?
The Google Analytics data model is really geared toward datasets that can be thought of in terms of users, sessions, and hits (hits being things like pageviews and events).
If your data can be thought of in these terms, it will probably work. If, on the other hand, you're trying to do things like joins or calculate averages or other statistical operations, you're probably better of using something else.
While the others are correct, Google Analytics is geared towards users, sessions, and hits. It is none the less simply an application for data analysis. The question will be how to get the data into the system.
I think you need to give us a little more information about your data set. But let me assume a few things.
You have a dataset with gasoline prices over a period of days.
you have a dataset with gasoline prices for different gas stations.
It would be really nice if this wasn't old data that this is new gas prices coming in.
If I had this dataset I could insert it into Google Analytics. Directly using the measurement protocol.
The measurement protocol has a few required things, the first being hit type. 'pageview', 'screenview', 'event', 'transaction', 'item', 'social', 'exception', 'timing'. the second would cid or session id.
Now cid I think I would probably set to the different gas stations and probably add a custom dimension with the gas station name.
As for hit I think I would probably say screenview and make an application Google Analytics account. Mainly because well this isn't a website its a little different.
Then every time the price of Gas changes I would send a screenview, cid of the station with the custom dimension of the station, add a custom metric with the price.
The main problem you are going to have is that Google analytics doesn't handle old data well. If you are going to insert this data with a date associated the date and time cant be grater then 4 hours ago or the server wont process it.
Have you considered putting it in big Query instead?
This question really is to broad or opinion based, but it was fun to consider.
It is possible to send all kinds of hits with the Measurement Protocol. But Philip is correct in stating that the data model is largely geared towards users, sessions and hits. But you could probably get a good ways with custom dimensions and metrics.

Google analytics data adjustment?

I've been using a SSIS Integration component to download data from Google Analytics in order to keep an historical view of some websites and track the evolution of them. Basically the metrics we track are Visits (now Sessions) and Visitros (now Users), and the dimensions are Year and Month. However, today I noticed that the data I downloaded for july had a variation on the Users metric. I heard that google analytics uses an estimation method to "calculate" some (if not all) of their metrics, could it be that after that they "adjust" the data with more acurate information? If so, is this mentioned in the documentation? (a link would be highly appreciated) Since the users are complaining that we are not delivering the real GA Data. I tried looked on the Google analytics documentation page with no luck.
Thanks for your time.
PS: Sorry for my english, it isn´t my native language
If you are using the standard version of Google Analytics (you'll know if you are paying $150k for premium), data is sampled depending on volume. Have a read of this article can-you-trust-your-google-analytics-data
I have seen very slightly differing results being returned if you repeatedly call the api with the same historical parameters repeatedly. In my case the figures only differed by 1-2 over a daily set of several thousand, but nevertheless it differed.
If you want to guarantee your results, consider upgrading to premium
Sampling could be an issue if what you are requesting is over 50,000 rows for the time period you are requesting. To avoid it you can download more often, such as daily.
But I think your issue is that there is a processing time for Google Analytics - if you are downloading at 3 am on the 1st it is probable that the processing for the previous day has not finished.
Google Analytics Premium SLA is for 4 hour data freshness, so even that would have trouble. Pragmatically you should allow 24 hours before you download data for the previous day, 48 hours for e-commerce data.
Thirdly make sure it is not Unique Visitors you are requesting, as this is dependent on the time period you are requesting.

BiqQuery vs Google Analytics, which data is more accurate?

As a Premium Google Analytics/BigQuery customer, our question is, Which data is more accurate?
I tend to want to lean toward BigQuery being more accurate because we can actually see the raw data, but we have no insight into the method Google Analyitcs is using to calculate its numbers.
I also think that a lot of it has to do with SAMPLING.
When you calculate something simple like Total Pageviews for a single page, the Google Analytics numbers line up to BigQuery within .00001%:
sum(case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then 1 else 0 end) as total_pageviews
When you calculate something more complex like Unique Pageviews for a single page, Google Analytics numbers are 5% greater than BigQuery. Note that it is sampling by the max 1 Million:
count(distinct (case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then concat(fullvisitorid, string(visitid)) end), 1000000) as unique_pageviews
I would love to know what others think or what the Google Developers themselves can explain.
If you are a premium customer I am assuming that's because you have a large website with a lot of data. The Google Analytics API will sample your data if there's too much of it. This is something you can try and prevent by putting the sampling level up. Even with the sampling level set to high precision you will still get sampled data back from the API.
Check the Json coming back from the API, it will tell you if your data is being sampled.
Big Query wont sample your data, there is a way for premium customers to use the API with out sampling data but I think you have to contact Google about setting that up.
The bigger point in Big Queries favor is that you aren't limited to 7 dimensions and 10 metrics like you are with the Google Analytics API.
Note: I am not a Google Developer but I am a Google Developer Expert for Google Analytics.
I am a big fan of BigQuery. I have also used Google Analytics quite a lot. So the question is about where the data is more accurate.
Well, the answer to such a question is always: "data is more accurate, the closer it is to where it originates". BigQuery is an underlying storage of all of Google's data. This is where data is collected, indexed, and then made accessible through a SQL interface.
Google Analytics is a tool that was developed with a lot of free accounts in mind. To support free accounts, GA needed to scale well. To scale, companies optimize on storage by pre-aggregating data.
So you are really comparing two things: pre-summarized/pre-aggregated data (GA) and raw accumulated data (BigQuery). Which would you trust?
Now, it sounds like there is also a 2nd question: "how to get accurate aggregates from BigQuery?" BigQuery is full on ANSI incompatible SQL that is hard to remember for ad-hoc queries. You are better off connecting a BI tool on top of BigQuery, so that you can explore data in a consistent manner (i.e. same threshold/rounding).

Could "filling up" Google Analytics with millions of events slow down query performance / increase sampling?

Considering doing some relatively large scale event tracking on my website.
I estimate this would create up to 6 million new events per month in Google Analytics.
My questions are, would all of this extra data that I'm now hanging onto:
a) Slow down GA UI performance
and
b) Increase the amount of data sampling
Notes:
I have noticed that GA seems to be taking longer to retrieve results for longer timelines for my website lately, but I don't know if it has to do with the increased amount of event tracking I've been doing lately or not – it may be that GA is fighting for resources as it matures and as more and more people collect more and more data...
Finally, one might guess that adding events may only slow down reporting on events, but this isn't necessarily so is it?
Drewdavid,
The amount of data being loaded will influence the speed of GA performance, but nothing really dramatic I would say. I am running a website/app with 15+ million events per month and even though all the reporting is automated via API, every now and then we need to find something specific and use the regular GA UI.
More than speed I would be worried about sampling. That's the reason we automated the reporting in the first place as there are some ways how you can eliminate it (with some limitations. See this post for instance that describes using Analytics Canvas, one my of favorite tools (am not affiliated in any way :-).
Also, let me ask what would be the purpose of your events? Think twice if you would actually use them later on...
Slow down GA UI performance
Standard Reports are precompiled and will display as usual. Reports that are generated ad hoc (because you apply filters, segments etc.) will take a little longer, but not so much that it hurts.
Increase the amount of data sampling
If by "sampling" you mean throwing away raw data, Google does not do that (I actually have that in writing from a Google representative). However the reports might not be able to resolve all data points (e.g. you get Top 10 Keywords and everything else is lumped under "other").
However those events will count towards you data limit which is ten million interaction hits (pageviews, events, transactions, any single product in a transaction, user timings and possibly others). Google will not drop data or close your account without warning (again, I have that in writing from a Google Sales Manager) but they reserve to right to either force you to collect less interaction hits or to close your account some time after they issued a warning (actually they will ask you to upgrade to Premium first, but chances are you don't want to spend that much money).
Google is pretty lenient when it comes to violations of the data limit but other peoples leniency is not a good basis for a reliable service, so you want to make sure that you stay withing the limits.

Resources