Analytics API doesn't match web data - google-analytics

I understand that this is a question which has been asked elsewhere, but I haven't yet found an answer which is especially helpful.
The problem I'm having is that the data on the regular web version of analytics doesn't match the data I've pulled from the API.
From what I've read, this can sometimes be an issue with the type of query being used. Here's what I've been using:
var requiredArguments = {
'dimensions':'ga:medium',
'metrics': 'ga:users, ga:sessions, ga:uniquePageviews, ga:newUsers',
'sort': 'ga:medium',
'start-index': '1',
'max-results': '1000',
'sampling-level': 'DEFAULT',
};
and then...
var results = Analytics.Data.Ga.get(
tableId,
startDate,
finishDate,
'ga:users, ga:sessions, ga:uniquePageviews, ga:newUsers',
requiredArguments);
Sessions, across a month, for instance, can sometimes vary by other 1000. I've tried using different sampling types; I don;t think it's that, because I'm not going over 50,000 sessions in a query.
Any help on this is much appreciated.

You need to check the result returned if the data is sampled it will tell you the data is sampled.
"containsSampledData":false
samplingLevel
samplingLevel=DEFAULT Optional. Use this parameter to set the sampling
level (i.e. the number of sessions used to calculate the result) for a
reporting query. The allowed values are consistent with the web
interface and include: DEFAULT — Returns response with a sample size
that balances speed and accuracy. FASTER — Returns a fast response
with a smaller sample size. HIGHER_PRECISION — Returns a more accurate
response using a large sample size, but this may result in the
response being slower. If not supplied, the DEFAULT sampling level
will be used. See the Sampling section for details on how to calculate
the percentage of sessions that were used for a query.
Sampling should return results that are close but not exactly the same as the website. The only way to completely remove sampling from the API is to have a Premium Google Analytics Account
Also remember to consider processing latency. If you request data that is under 48 hours old it will also be different from the website.

Related

Discrepancy in Google Analytics data when using segments

I'm having a tough time with Google Analytics, trying to understand why the value of metrics changes when segments are applied.
There is a standard audience overview report, which is based on 100% of sessions (no sampling) and the view is not filtered. The period is March of 2017.
Standard "All visitors" segment looks like this:
Then, there is another built-in segment called "Bounced Sessions". When I apply this segment, the "All visitors" values changes:
Amount of users increases, but the count of pageviews decreases.
Any ideas how to explain this?.. Thank you in advance!
Oki, there can be, multiple reasons. Let me explain first how these numbers are calculated, then we move on to your query.
There two types of data gathering and manipulation from google.
Pre-calculated data -- pre-aggregated tables
These are the precalculated data that Google uses to speed up the UI. Google does not specify when this is done but it can be at any point of the time. These are known as pre-aggregated tables
Data calculated on the fly
Some that you do which result in computation or manipulation falls under this category. Like using segments, creating custom reports etc.
Coming to your problem. When you apply segment, every metric that it effects will be calculated again. Thus it may result in numbers greater than you see in normal view.
Standard audience overview report is pre-aggregated at some point of the day. When you apply segment, the results will be calculated with the fresh data. Since latter is the latest, it will automatically give you increased number of the metrics. Even you can see a decrease as well, all depends on your data and user behavior.
Resolution: If you are a premium user, use Big Query. You must rely on big query for every metric as they are fresh and calculated on the fly

Is the Google Analytics API containsSampledData field reliable?

We are running the Google Analytics free version and I'm seeing some inconsistent results regarding data sampling. I have tried my requests in Google Analytics Query Explorer, the GA Sheets add-on, and within the GA interface.
Basically, I am comparing results from a complete date range against the sum of results for that date range broken into smaller chunks (to reduce/remove the chance of sampling occurring). Metrics are sessions, transactions, and revenue. I have a session-level dynamic segment applied: sessions::condition::!ga:landingPagePath=#/thanks
As you may expect, the results from the single request are different (counts are lower) than those from summing the multiple smaller requests. For example, sessions are 45,311 vs. 51,596 and income is further apart. This implies that sampling is being used for the larger request. The trouble is that the API response explicitly says that sampling is not used in any case, i.e. "Contains Sampled Data" equals "No", even for the full date range within which our property should be exceeding the 500,000 session threshold for sampling to kick in.
I'm almost certain that the results from summing smaller date ranges are correct, as these are pretty close to what we see in our CMS analytics.
Can anyone explain the mechanics behind this? Is GA doing some sort of behind-the-scenes sampling to produce this inconsistency?
Thanks,
Daniel
Sounds like sampling. Check all your sources to see if they contain sampling and make sure you have Sampling Level Set to "HIGHER_PRECISION".
1) Google Sheets Google Analytics Add-On in cell B6 of the data for each query check to see if it says "Yes: for "Contains Sampled Data"
2) Google Analytics Query Explorer in the header below your profile name check to see if it says "Contains Sampled Data: Yes"
You are on the right track in breaking your query down into smaller chunks with smaller date ranges to avoid sampling. Here is a post on how to Avoid Google Analytics Sampling using Python

Wrong users count returned

I'm seeing inconsistencies between reported ga:sessions, ga:users, ga:pageviews from a query spanning a year through the API, and the same date range from the GA website.
I've been able to match ga:sessions & ga:pageviews exactly by requesting every month separately and summing the values, however in the case of ga:users I am still seeing wildly different figures between the numbers returned by GAPI and the GA website.
The number is actually larger than the year's figures when I sum the month's figures, and both numbers are higher than the values reported in the GA website.
What dimension/metric could GA be using for 'Users'?
I suspect your having an issue with sampling level. If the request you are making returns a large enough amount of data in this case selecting a full years worth of data. The server will return the results sampled.
Sampling
Google Analytics calculates certain combinations of dimensions and
metrics on the fly. To return the data in a reasonable time, Google
Analytics may only process a sample of the data.
You can specify the sampling level to use for a request by setting the
samplingLevel parameter.
If a Core Reporting API response contains sampled data, then the
containsSampledData response field will be true. In addition, 2
properties will provide information about the sampling level for the
query: sampleSize and sampleSpace. With these 2 values you can
calculate the percentage of sessions that were used for the query. For
example, if sampleSize is 201,000 and sampleSpace is 220,000 then the
report is based on (201,000 / 220,000) * 100 = 91.36% of sessions.
When requesting from the api the Default sampleing level is used this is in order to increase speed of the request. You can change that by sending specifying the samplingLevel to use in your request.
samplingLevel=DEFAULT
Optional.Use this parameter to set the sampling level (i.e. the number of sessions used
to calculate the result) for a reporting query. The allowed values are consistent with
the web interface and include:
•DEFAULT — Returns response with a sample size that balances speed and accuracy.
•FASTER — Returns a fast response with a smaller sample size.
•HIGHER_PRECISION — Returns a more accurate response using a large sample size, but this may result in the response being slower.

How to get unsampled data from Google Analytics API - even for one day?

I'm trying to obtain unsampled data from the Google Analytics API, but for some reason it always comes out sampled. Even if I select only one day and filter for one page only. This is what I've tried on Google's Query Explorer:
What do I need to do to overcome this? Also, is there a way to see how much of the data is sampled (without having to log into the Google Analytics page...)?
In your query you need to supply a sampiling level
samplingLevel=DEFAULT Optional.
Use this parameter to set the sampling level (i.e. the number of visits used to
calculate the result) for a reporting query. The allowed values are consistent with
the web interface and include:
•DEFAULT — Returns response with a sample size that balances speed and accuracy.
•FASTER — Returns a fast response with a smaller sample size.
•HIGHER_PRECISION — Returns a more accurate response using a large sample size,
but this may result in the response being slower.
If not supplied, the DEFAULT sampling level will be used.
You haven't stated which language you are using so your going to have to check the Library for that language and figure out how to send it.
Update: trying to help with code. I haven't tested this, but my guess is you would add it to as an optional parameter. Let me know if it doesn't work and i will see if i can get it working.
$optParams = array(
'dimensions' => 'ga:dateHour,ga:hour',
'filters' => 'ga:pagePath=~'.$pagelink.'*',
'max-results' => 1,
'sort' => 'ga:dateHour',
'samplingLevel' => 'HIGHER_PRECISION' );
$results_starttime = $connect->data_ga->get( 'ga:' . $signedupuser["google id"],
$startdate_analysed,
$enddate_analysed,
'ga:uniquePageviews', $optParams );
Update2: Make sure you downloaded the lib from GitHub google/google-api-php-client i checked and Analytics.php from there does have code to support samplingLevel.
Update 3 Your getting the old version of the lib get the one from GitHub check link above.
google-api-php-client
Status:
The beta release of the next major revision (1.0.1-beta) of the
library is available, please migrate when possible. The newer version
is published on GitHub. All new development, and issue tracking, will
occur on Github.
Did you by any chance exceed the daily limits (500,000 visits) for sampling? If so, getting unsampled data is technically not possible.
Otherwise, see this answer on very similar question.

big difference in "visitor" count

I try to pull out the (unique) visitor count for a certain directory using three different methods:
* with a profile
* using an dynamic advanced segment
* using custom report filter
On a smaller site the three methods give the same result. But on the large site (> 5M visits/month) I get a big discrepancy between the profile on one hand and the advanced segment and filter on the other. This might be because of sampling - but the difference is smaller when it comes to pageviews. Is the estimation of visitors worse and the discrepancy bigger when using sampled data? Also when extracting data from the API (using filters or profiles) I still get DIFFERENT data even if GA doesn't indicate that the data is sampled - ie I'm looking at unsampled data.
Another strange thing is that the pageviews are higher in the profile than the filter, while the visitor count is higher for the filter vs the profile. I also applied a filter at the profile to force it to use sample data - and I again get quite similar results to the filter and segment-data.
profile filter segment filter#profile
unique 25550 37778 36433 37971
pageviews 202761 184130 n/a 202761
What I am trying to achieve is to find a way to get somewhat accurat data on unique visitors when I've run out of profiles to use.
More data with discrepancies can be found in this google docs: https://docs.google.com/spreadsheet/ccc?key=0Aqzq0UJQNY0XdG1DRFpaeWJveWhhdXZRemRlZ3pFb0E
Google Analytics (free version) tracks only 10 mio page interactions [0] (pageviews and events, any tracker method that start with "track" is an interaction) per month [1], so presumably the data for your larger site is already heavily sampled (I guess each of you 5 Million visitors has more than two interactions) [2]. Ad hoc reports use only 1 mio datapoints at max, so you have a sample of a sample. Naturally aggregated values suffer more from smaller sample sizes.
And I'm pretty sure the data limits apply to api access too (Google says that there is "no assurance that the excess hits will be processed"), so for the large site the api returns sampled (or incomplete) data, too - so you cannot really be looking at unsampled data.
As for the differences, I'd say that different ad hoc report use different samples so you end up with different results. With GA you shouldn't rely too much an absolute numbers anyway and look more for general trends.
[1] Analytics Premium tracks 50 mio interactions per month (and has support from Google) but comes at 150 000 USD per year
[2] Google suggests to use "_setSampleRate()" on large sites to make sure you have actually sampled data for each day of the month instead of random hit or miss after you exceed the data limits.
Data limits:
http://support.google.com/analytics/bin/answer.py?hl=en&answer=1070983).
setSampleRate:
https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiBasicConfiguration#_gat.GA_Tracker_._setSampleRate
Yes, the sampled data is less accurate, especially with visitor counts.
I've also seen them miss 500k pageviews over two days, only to see them appear in their reporting a few days later. It also doesn't surprise me to see different results from different interfaces. The quality of Google Analytics has diminished, even as they have tried to become more real-time. It appears that their codebase is inconsistent across API's, and their algorithms are all over the map.
I usually stick with the same metrics and reporting methods, so that my results remain comparable to one another. I also run GA in tandem with Gaug.es, as a validation and sanity check. With that extra data, I choose the reporting method in GA that I am most confident with and I rely on that exclusively.

Resources