As a Premium Google Analytics/BigQuery customer, our question is, Which data is more accurate?
I tend to want to lean toward BigQuery being more accurate because we can actually see the raw data, but we have no insight into the method Google Analyitcs is using to calculate its numbers.
I also think that a lot of it has to do with SAMPLING.
When you calculate something simple like Total Pageviews for a single page, the Google Analytics numbers line up to BigQuery within .00001%:
sum(case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then 1 else 0 end) as total_pageviews
When you calculate something more complex like Unique Pageviews for a single page, Google Analytics numbers are 5% greater than BigQuery. Note that it is sampling by the max 1 Million:
count(distinct (case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then concat(fullvisitorid, string(visitid)) end), 1000000) as unique_pageviews
I would love to know what others think or what the Google Developers themselves can explain.
If you are a premium customer I am assuming that's because you have a large website with a lot of data. The Google Analytics API will sample your data if there's too much of it. This is something you can try and prevent by putting the sampling level up. Even with the sampling level set to high precision you will still get sampled data back from the API.
Check the Json coming back from the API, it will tell you if your data is being sampled.
Big Query wont sample your data, there is a way for premium customers to use the API with out sampling data but I think you have to contact Google about setting that up.
The bigger point in Big Queries favor is that you aren't limited to 7 dimensions and 10 metrics like you are with the Google Analytics API.
Note: I am not a Google Developer but I am a Google Developer Expert for Google Analytics.
I am a big fan of BigQuery. I have also used Google Analytics quite a lot. So the question is about where the data is more accurate.
Well, the answer to such a question is always: "data is more accurate, the closer it is to where it originates". BigQuery is an underlying storage of all of Google's data. This is where data is collected, indexed, and then made accessible through a SQL interface.
Google Analytics is a tool that was developed with a lot of free accounts in mind. To support free accounts, GA needed to scale well. To scale, companies optimize on storage by pre-aggregating data.
So you are really comparing two things: pre-summarized/pre-aggregated data (GA) and raw accumulated data (BigQuery). Which would you trust?
Now, it sounds like there is also a 2nd question: "how to get accurate aggregates from BigQuery?" BigQuery is full on ANSI incompatible SQL that is hard to remember for ad-hoc queries. You are better off connecting a BI tool on top of BigQuery, so that you can explore data in a consistent manner (i.e. same threshold/rounding).
Related
In a typical GA session, after picking a View ID and a date range,
We can get a week's worth of data like this:
Users
146,207
New Users
124,582
Sessions
186,191
The question is, what BQ field(s) to query in order to get this Users value?
Here is an example query with 2 methods (the 2nd method is commented out).
SELECT
count(DISTINCT(CONCAT(CAST(visitID as STRING),cast(visitNumber as
STRING)))) as visitors,
-- count(DISTINCT(fullVisitorId)) as visitors
I noticed the FVID method was fairly close to what I see in GA (with Users being a little understated by a 3% in BQ) and if I use the commented out method, I get a value that is about 15% overstated as compared to GA. Is there a more reliable method in BQ to acquire the Users value in GA?
The COUNT(DISTINCT fullVisitorId) method is the most correct method, but it won't match what Analytics 360 reports by default. Since last year, Google Analytics 360 by default uses a different calculation for the Users metric than it previously did. The old calculation, which is still used in unsampled reports, is more likely to match what you get out of BigQuery. You can verify this by exporting your report as an unsampled report, or using the unsampled reporting features in the Management API.
If you want the numbers to match exactly, you can turn off the new calculation by using the instructions here. The new calculation's precise details are not public, so duplicating that value in BigQuery is quite difficult.
There are still some reasons you might see different numbers, even with the old calculation. One is if the site has implemented User ID, in which case the GA number will be lower than BigQuery for fullVisitorId. Another is sampling, though that's unlikely in Analytics 360 at the volumes you're talking about.
I am trying to automate the weekly report. Currently, I am using Google Analytics website to get the data for my report. Sampling level is higher precision.
I tried to get the same data by Google analytics API set samplingLevel as HIGHER_PRECISION. However, I am still getting the sampled data.
For FASTER, Precision Level is roughly 25% whereas for DEFAULT and HIGHER_PRECISION sampling level is roughly 50%.
On Google Analytics website, it says 'This report is based on 100% of sessions'. Can I get the same level of accuracy with Google API? I am using Google Apps script.Response for HIGHER_PRECISION is not matching.
Sumit, the API and the Google Analytics UI are certainly different and similarly the sampling's effect on things is a different beast which must be handled properly to get anything useful out of it.
As was mentioned in the comment, you can achieve high precision unsampled reports by (typically) shortening your date range that you're querying for and then "walking" the data.
To walk the data, you are essentially just gradually incrementing that small date range as you move through the desired data.
The "unsampled reports API" is... well, not the best. Considering that's what they are avoiding giving the end user in the first place, the offering available is not a very good long term or large project friendly solution. I would recommend small date ranges and then doing a data walk.
Happy Coding
There are several solutions to avoid sampling issue in Google Analytics by automating the process of data export for short date ranges.
I prefer this tool, it's pretty simple to use: MadStats.io
I'm looking for a way to integrate all my DFP data (impressions, CTR, revenues, etc.) to my Google Analytics campaigns (and sources, medium, etc.).
From what I understand Google Analytics premium/360 has a feature which integrates with DFP data but I also see that it costs (at least) $150K/year which is a bit over my budget right now :)
It seems like a pretty common issue so I think I'm not the first one to need a functionality like that.
Otherwise what are my option to track if a specific marketing campaign is yielding the expected revenues ?
Thanks !
I'm looking to use google analytics for its web interface only. A large dataset such as gasoline prices would be submitted to analytics via the api and viewed. Is this possible? Or is analytics purely tailored to viewing website statistics?
The Google Analytics data model is really geared toward datasets that can be thought of in terms of users, sessions, and hits (hits being things like pageviews and events).
If your data can be thought of in these terms, it will probably work. If, on the other hand, you're trying to do things like joins or calculate averages or other statistical operations, you're probably better of using something else.
While the others are correct, Google Analytics is geared towards users, sessions, and hits. It is none the less simply an application for data analysis. The question will be how to get the data into the system.
I think you need to give us a little more information about your data set. But let me assume a few things.
You have a dataset with gasoline prices over a period of days.
you have a dataset with gasoline prices for different gas stations.
It would be really nice if this wasn't old data that this is new gas prices coming in.
If I had this dataset I could insert it into Google Analytics. Directly using the measurement protocol.
The measurement protocol has a few required things, the first being hit type. 'pageview', 'screenview', 'event', 'transaction', 'item', 'social', 'exception', 'timing'. the second would cid or session id.
Now cid I think I would probably set to the different gas stations and probably add a custom dimension with the gas station name.
As for hit I think I would probably say screenview and make an application Google Analytics account. Mainly because well this isn't a website its a little different.
Then every time the price of Gas changes I would send a screenview, cid of the station with the custom dimension of the station, add a custom metric with the price.
The main problem you are going to have is that Google analytics doesn't handle old data well. If you are going to insert this data with a date associated the date and time cant be grater then 4 hours ago or the server wont process it.
Have you considered putting it in big Query instead?
This question really is to broad or opinion based, but it was fun to consider.
It is possible to send all kinds of hits with the Measurement Protocol. But Philip is correct in stating that the data model is largely geared towards users, sessions and hits. But you could probably get a good ways with custom dimensions and metrics.
Google Analytics definitely has a problem in statistic display.
I do query with Date range 14-19 jul:
https://www.dropbox.com/s/fim4avm75ohypqs/gatrust2.png
And have 12 transactions from iOS 5.1.1 and 0 from any others versions which is very strange.
Ok, who knows, maybe there is some abnormal users behavior.
But then i do same request, but for 1 day (18 Jul):
https://www.dropbox.com/s/m0q0lvuvzu4svy5/gatrust1.png
Now there is 6 transactions shown from others versions.
I have feeling that i may meet such inconsistencies in Google Analytics in other queries,
where i just do not see exact inconsistencies proof, but feel that provided information is not logical.
Does it mean, that i can’t trust to information provided by GA?
Just use it as some... sandbox tool?
Confused.
Probably the inconsistencies are caused by Google Analytics sampling. When you have a lot of visits, Google Analytics (free plan) only takes only a part of them to show the corresponding stats.
If you look at the first of your captions, you can see in the upper right a box saying: "This report is based on 248.360 visits (8,36% of sessions)". So you are only seeing data from 8,36% of the real visits. You don't know what the other 91,64% did in that date range.
If you want reliable data with such high number of visits, Google Analytics (free) is probably not your best option. You could use Google Analytics Premium (quite expensive, but eliminates the sampling issue), other paid analytics software or some free alternative like Piwik or Open Web Analytics.