Differences between BigQuery and Google Analytics outputs - google-analytics

Am trying to get the count of distinct authorized customers from both Bigquery and Google Analytics, but I see a huge difference.
BigQuery gives me 15% less counts compared to GA. Is this something to do with unnesting or any other issues?
bigquery:
CREATE TEMP FUNCTION
customDimensionByIndex(indx INT64,
arr ARRAY<STRUCT<index INT64,
value STRING>>) AS ( (
SELECT
x.value
FROM
UNNEST(arr) x
WHERE
indx=x.index) );
SELECT
auth,
COUNT(DISTINCT ID)
FROM (
SELECT
ID,
customDimensionByIndex(2,
hit.customDimensions) AS auth
FROM
`test-apis2.94566232234.ga_sessions_20180422`,
UNNEST(hits) AS hit )
GROUP BY 1
This query gives me
Authorized - TRUE: *76,649
GA gives me *50,833
But I see very close numbers for "FALSE".
Am trying to find why there is huge difference for "TRUE". Appreciate any little help.
Thanks in advance!!

Related

Replicating Google Analytics All Sessions for a Custom Dimension in In BigQuery

I am trying to replicate sessions by a custom dimension in BigQuery to the Google Analytics AII. I am only a few sessions off and I can't figure out what how to get an exact match.
My current understanding is that GA breaks sessions at midnight (because its data model relies on processing in day chunks). I tired to take this into account with the code below, but something is not quite right. Does anyone know how to get an exact match?
SELECT
CD12,
SUM(sessions) AS sessions
FROM (
SELECT
CD12,
CASE WHEN hitNumber = first_hit THEN visits ELSE 0 END AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
totals.visits,
hits.hitNumber,
CASE WHEN cd.index = 12 THEN cd.value END AS CD12,
MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
FROM `data-....`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
)
)
WHERE CD12 ='0'
GROUP BY
CD12
ORDER BY
sessions DESC

Big Query and Google Analytics UI do not match when ecommerce action filter applied

We are validating a query in Big Query, and cannot get the results to match with the google analytics UI. A similar question can be found here, but in our case the the mismatch only occurs when we apply a specific filter on ecommerce_action.action_type.
Here is the query:
SELECT COUNT(distinct fullVisitorId+cast(visitid as string)) AS sessions
FROM (
SELECT
device.browserVersion,
geoNetwork.networkLocation,
geoNetwork.networkDomain,
geoNetwork.city,
geoNetwork.country,
geoNetwork.continent,
geoNetwork.region,
device.browserSize,
visitNumber,
trafficSource.source,
trafficSource.medium,
fullvisitorId,
visitId,
device.screenResolution,
device.flashVersion,
device.operatingSystem,
device.browser,
totals.pageviews,
channelGrouping,
totals.transactionRevenue,
totals.timeOnSite,
totals.newVisits,
totals.visits,
date,
hits.eCommerceAction.action_type
FROM
(select *
from TABLE_DATE_RANGE([zzzzzzzzz.ga_sessions_],
<range>) ))t
WHERE
hits.eCommerceAction.action_type = '2' and <stuff to remove bots>
)
From the UI using the built in shopping behavior report, we get 3.836M unique sessions with a product detail view, compared with 3.684M unique sessions in Big Query using the query above.
A few questions:
1) We are under the impression the shopping behavior report "Sessions with Product View" breakdown is based off of the ecommerce_action.actiontype filter. Is that true?
2) Is there a .totals pre-aggregated table that the UI maybe pulling from?
It sounds like the issue is that COUNT(DISTINCT ...) is approximate when using legacy SQL, as noted in the migration guide, so the counts are not accurate. Either use standard SQL instead (preferred) or use EXACT_COUNT_DISTINCT with legacy SQL.
You're including product list views in your query.
As described in https://support.google.com/analytics/answer/3437719 you need to make sure, that no product has isImpression = TRUE because that would mean it is a product list view.
This query sums all sessions which contain any action_type='2' for which all isProduct are null or false:
SELECT
SUM(totals.visits) AS sessions
FROM
`project.123456789.ga_sessions_20180101` AS t
WHERE
(
SELECT
LOGICAL_OR(h.ecommerceaction.action_type='2')
FROM
t.hits AS h
WHERE
(SELECT LOGICAL_AND(isimpression IS NULL OR isimpression = FALSE) FROM h.product))
For legacySQL you can adapt the example in the documentation.
In addition to the fact that COUNT(DISTINCT ...) is approximate when using legacy SQL, there could be sessions in which there are only non-interactive hits, which will not be counted as sessions in the Google Analytics UI but they are counted by both COUNT(DISTINCT ...) and EXACT_COUNT_DISTINCT(...) because in your query they count visit id's.
Using SUM(totals.visits) you should get the same result as in the UI because SUM does not take into account NULL values of totals.visits (corresponding to sessions in which there are only non-interactive hits).

Counting app screen-views using Google Analytics BigQuery export

I'm trying to count the number of app screen-views for a particular screen using the Google Analytics BigQuery data export. My approach would be to count the number of hits with a screen-view hits.type. For instance, to count the number of page-views on the web version of our app I would count the number of hits with hits.type = 'PAGE'. but I can't see how to do this on app because there is no "SCREENVIEW" hits.type value.
This is the description of hits.type from Google (https://support.google.com/analytics/answer/3437719?hl=en):
The type of hit. One of: "PAGE", "TRANSACTION", "ITEM", "EVENT",
"SOCIAL", "APPVIEW", "EXCEPTION".
Is there another way to do this that I'm missing?
I've tried using the totals.screenviews metric:
SELECT
hits.appInfo.screenName,
SUM(totals.screenviews) AS screenViews
FROM (TABLE_DATE_RANGE([tableid.ga_sessions_], TIMESTAMP('2018-01-12'), TIMESTAMP('2018-01-12') ))
GROUP BY
hits.appInfo.screenName
But that returns numbers that are too high.
Legacy SQL automatically unnest your data which explains why your SUM(totals.screenviews) ends up being much higher (basically this field gets duplicated).
I'd recommend solving this one in Standard SQL, it's much easier and faster. See if this works for you:
#standardSQL
SELECT
name,
SUM(views) views
FROM(
SELECT
ARRAY(SELECT AS STRUCT appInfo.screenName name, COUNT(1) views FROM UNNEST(hits) WHERE type = 'APPVIEW' GROUP BY 1) data
FROM `projectId.datasetId.ga_sessions_*`
WHERE TRUE
AND EXISTS(SELECT 1 FROM UNNEST(hits) WHERE type = 'APPVIEW')
AND _TABLE_SUFFIX BETWEEN('20180112') AND ('20180112')
), UNNEST(data)
GROUP BY 1
ORDER BY 2 DESC
The hit.type is ‘APPVIEW’, because it no counts events.
#standardSQL
SELECT
hit.appInfo.screenName name,
count(hit.appInfo.screenName) view
FROM
project_id.dataset_id.ga_sessions_*,
UNNEST(hits) hit
WHERE type = 'APPVIEW'
GROUP BY
name)

Results of joined queries in BQ don't match data in Google Analytics

Background
In BigQuery, I'm trying to find the number of visitors that both visit one of two pages and purchase a specific product.
When I run each of the sub-queries, the numbers match exactly what I see in Google Analytics.
However, when I join them, the number is different than what I see in GA. I've had someone bring the results of the two sub-queries into Excel and do the equivalent, and their results equal what I'm seeing in BQ.
Details
Here's the query:
SELECT
ProductSessions.date AS date,
SUM(ProductTransactions.totalTransactions) transactions,
COUNT(ProductSessions.visitId) visited_product_sessions
FROM (
SELECT
visitId, date
FROM
`103554833.ga_sessions_20170219`
WHERE
EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE REGEXP_CONTAINS(h.page.pagePath, r"^www.domain.com/(product|product2).html.*"))
GROUP BY visitID, date)
AS ProductSessions
LEFT JOIN (
SELECT
totals.transactions as totalTransactions,
visitId,
date
FROM
`103554833.ga_sessions_20170219`
WHERE
totals.transactions IS NOT NULL
AND EXISTS(
SELECT 1
FROM
UNNEST(hits) h,
UNNEST(h.product) prod
WHERE REGEXP_CONTAINS(prod.v2ProductName, r"^Product®$"))
GROUP BY
visitId, totals.transactions,
date) AS ProductTransactions
ON
ProductTransactions.visitId = ProductSessions.visitId
WHERE ProductTransactions.visitId is not null
GROUP BY
date
ORDER BY
date ASC
I'm expecting ProductTransactions.totalTransactions to replicate the number of transactions in Google Analytics when filtered with an advanced segment of both:
Sessions include Page matching RegEx: www.domain.com/(product|product2).html.*
Sessions include Product matches exactly: Product®
However, results in BG are about 20% higher than in GA.
Why the difference?

Total Sessions in BigQuery vs Google Analytics Reports

I'm just learning BigQuery so this might be a dumb question, but we want to get some statistics there and one of those is the total sessions in a given day.
To do so, I've queried in BQ:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
group each by fullvisitorid
)
(I'm using the table_query because later on we might increase the range of days)
This results in 1,075,137.
But in our Google Analytics Reports, in the "Audience Overview" section, the same day results:
This report is based on 1,026,641 sessions (100% of sessions).
There's always this difference of roughly ~5% despite of the day. So I'm wondering, even though the query is quite simple, is there any mistake we've made?
Is this difference expected to happen? I read through BigQuery's documentation but couldn't find anything on this issue.
Thanks in advance,
standardsql
Simply SUM(totals.visits) or when using COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) make sure totals.visits=1!
If you use visitId and you are not grouping per day, you will combine midnight-split-sessions!
Here are all scenarios:
SELECT
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) allSessionsUniquePerDay,
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitId AS STRING) )) allSessionsUniquePerSelectedTimeframe,
sum(totals.visits) interactiveSessionsUniquePerDay, -- equals GA UI sessions
COUNT(DISTINCT IF(totals.visits=1, CONCAT(fullVisitorId, CAST(visitId AS STRING)), NULL) ) interactiveSessionsUniquePerSelectedTimeframe,
SUM(IF(totals.visits=1,0,1)) nonInteractiveSessions
FROM
`project.dataset.ga_sessions_2017102*`
Wrap up:
fullVisitorId + visitId: useful to reconnect midnight-splits
fullVisitorId + visitStartTime: useful to take splits into account
totals.visits=1 for interaction sessions
fullVisitorId + visitStartTime where totals.visits=1: GA UI sessions (in case you need a session id)
SUM(totals.visits): simple GA UI sessions
fullVisitorId + visitId where totals.visits=1 and GROUP BY date: GA UI sessions with too many chances for errors and misunderstandings
After posting the question we got into contact with Google support and found that in Google Analytics only sessions that had an "event" being fired are actually counted.
In Bigquery you will find all sessions regardless whether they had an interaction or not.
In order to find the same result as in GA, you should filter by sessions with totals.visits = 1 in your BQ query (totals.visits is 1 only for sessions that had an event being fired).
That is:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
where totals.visits = 1
group each by fullvisitorid
)
The problem could be due to "COUNT DISTINCT".
According to this post:
COUNT DISTINCT is a statistical approximation for all results greater than 1000
You could try setting an additional COUNT parameter to improve accuracy at the expense of performance (see post), but I would first try:
SELECT COUNT( CONCAT( fullvisitorid,'_', STRING(visitid))) as sessions
from (table_query([40663402], 'timestamp(right(table_id,8)) between
timestamp("20150519") and timestamp("20150519")'))
What worked for me was this:
SELECT count(distinct sessionId) FROM(
SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*`
WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1)
The explanation (found very well written in
this article: https://adswerve.com/blog/google-analytics-bigquery-tips-users-sessions-part-one/) is that when counting and dealing with sessions we should be careful because by default, Google Analytics breaks sessions that carryover midnight (time zone of the view). Therefore a same session can end up in two daily tables:
Image from article mentioned above
The code provided creates a sessionID by combining:
client id + visit number + date
while acknowledging the session break; the result will be in a human-readable format. Finally to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.

Resources