Total Sessions in BigQuery vs Google Analytics Reports - google-analytics

I'm just learning BigQuery so this might be a dumb question, but we want to get some statistics there and one of those is the total sessions in a given day.
To do so, I've queried in BQ:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
group each by fullvisitorid
)
(I'm using the table_query because later on we might increase the range of days)
This results in 1,075,137.
But in our Google Analytics Reports, in the "Audience Overview" section, the same day results:
This report is based on 1,026,641 sessions (100% of sessions).
There's always this difference of roughly ~5% despite of the day. So I'm wondering, even though the query is quite simple, is there any mistake we've made?
Is this difference expected to happen? I read through BigQuery's documentation but couldn't find anything on this issue.
Thanks in advance,

standardsql
Simply SUM(totals.visits) or when using COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) make sure totals.visits=1!
If you use visitId and you are not grouping per day, you will combine midnight-split-sessions!
Here are all scenarios:
SELECT
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) allSessionsUniquePerDay,
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitId AS STRING) )) allSessionsUniquePerSelectedTimeframe,
sum(totals.visits) interactiveSessionsUniquePerDay, -- equals GA UI sessions
COUNT(DISTINCT IF(totals.visits=1, CONCAT(fullVisitorId, CAST(visitId AS STRING)), NULL) ) interactiveSessionsUniquePerSelectedTimeframe,
SUM(IF(totals.visits=1,0,1)) nonInteractiveSessions
FROM
`project.dataset.ga_sessions_2017102*`
Wrap up:
fullVisitorId + visitId: useful to reconnect midnight-splits
fullVisitorId + visitStartTime: useful to take splits into account
totals.visits=1 for interaction sessions
fullVisitorId + visitStartTime where totals.visits=1: GA UI sessions (in case you need a session id)
SUM(totals.visits): simple GA UI sessions
fullVisitorId + visitId where totals.visits=1 and GROUP BY date: GA UI sessions with too many chances for errors and misunderstandings

After posting the question we got into contact with Google support and found that in Google Analytics only sessions that had an "event" being fired are actually counted.
In Bigquery you will find all sessions regardless whether they had an interaction or not.
In order to find the same result as in GA, you should filter by sessions with totals.visits = 1 in your BQ query (totals.visits is 1 only for sessions that had an event being fired).
That is:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
where totals.visits = 1
group each by fullvisitorid
)

The problem could be due to "COUNT DISTINCT".
According to this post:
COUNT DISTINCT is a statistical approximation for all results greater than 1000
You could try setting an additional COUNT parameter to improve accuracy at the expense of performance (see post), but I would first try:
SELECT COUNT( CONCAT( fullvisitorid,'_', STRING(visitid))) as sessions
from (table_query([40663402], 'timestamp(right(table_id,8)) between
timestamp("20150519") and timestamp("20150519")'))

What worked for me was this:
SELECT count(distinct sessionId) FROM(
SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*`
WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1)
The explanation (found very well written in
this article: https://adswerve.com/blog/google-analytics-bigquery-tips-users-sessions-part-one/) is that when counting and dealing with sessions we should be careful because by default, Google Analytics breaks sessions that carryover midnight (time zone of the view). Therefore a same session can end up in two daily tables:
Image from article mentioned above
The code provided creates a sessionID by combining:
client id + visit number + date
while acknowledging the session break; the result will be in a human-readable format. Finally to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.

Related

Is there a way to duplicate sessions from the app-warming issue in Firebase < 8.11.0 via BigQuery?

In Version 8.12.1 of the Firebase Apple SDK an issue with session_start events being logged during app prewarming on iOS 15+ which inserts additional 'session_start' events. I've noticed that as a result of additional session start rows which inserts additional 'ga_session_id' values into the BigQuery table.
ga_session_id is a unique session identifier associated with each event that occurs within a session and is thus created when this additional session_start fires when the app_warming occurs - using the session_number field and calculating session length it's possible to remove sessions with just one session_start and a small session length but this does not seem to reduce the overall count of sessions by much.
This has impacted the reported number of sessions when querying the BigQuery table when counting distinct user_psuedo_id||ga_session_id.
Is there a way to isolate these sessions in a separate table or constrict them from the query using an additional clause in said query to remove the sessions which are not truly sessions.
https://github.com/firebase/firebase-ios-sdk/issues/6161
https://firebase.google.com/support/release-notes/ios
A simplified version of said query I'm using:
with windowTemp as
(
select
PARSE_DATE("%Y%m%d",event_date) as date_formatted,
event_name,
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id
from
`firebase-XXXX.analytics_XXX.events_*`
where
_table_suffix between '20210201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
1,2,3,4
)
SELECT
date_formatted,
Count(DISTINCT user_pseudo_id) AS users,
Count(DISTINCT Concat(user_pseudo_id,session_id)) AS sessions,
FROM
windowTemp
GROUP by 1
ORDER BY 1

Google analytics Users calculation

I am trying mimic this chart in GA:
But I have noticed that when I dont add date in my code the numbers match but when I add date the numbers seem doubling up.
Code:
SELECT
date,
COUNT(DISTINCT fullVisitorId) AS Users,
-- New Users (metric)
COUNT(DISTINCT(
CASE
WHEN totals.newVisits = 1 THEN fullVisitorId
ELSE
NULL
END)) AS New_Users,
-- Sessions (metric)
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS Sessions,
-- Bounces (metric)
COUNT(DISTINCT
CASE
WHEN totals.bounces = 1 THEN CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))
ELSE
NULL
END
) AS Bounces,
-- Transactions (metric)
COUNT(DISTINCT hits.transaction.transactionId) AS Transactions,
--Revenue (metric)
SUM(hits.transaction.transactionRevenue)/1000000 AS Revenue
FROM
`ABC-ca-web.123.ga_sessions_*`, Unnest(hits) hits
WHERE trafficSource.campaign LIKE '%ABC%' and date between '20200801' AND '20200831'
This also happens if you count users by date in GA, which is the usual query operation.
You cannot sum users from different periods. For example, if user X has visited the site every day in a week, analyzing the entire period the number of users is 1, but if you analyze it day by day it is 1 on the first day, 1 on the second day, 1 the third day, etc ... because the same user was there every day. If you count the users by days the result is that you have 7 users, but in reality you have 1 user because it is the same user.
I suppose when you add "date", you'll be also adding a group by "date" as well, without which the query will error out.
When you take the distinct count of users per day, a user can be included in multiple days grouping. But when you drop the "date" field, then that user is included only once.
That's why you see double, but it could be any number same or greater than when you don't add "date"

GA Funnel Analysis in Big Query - "Correct" Session Counts

Thank you for stopping by! I would be grateful to (re)create the ultimate GA Session Funnel in Big Query. The focus is on the funnel per session, with certain, but not necessarily sequentially visited pages during one session.
The solution should count sessions as COUNT( DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))).
Further, the funnel should be of the form that every funnel step can only be reached if the previous step has been completed within a session (e.g. the fourth step should only be counted if steps 1 - 3 have been visited during the session). However, the steps do not need to be performed consecutively
That is, unfortunately, why this example, which I like a lot, would not work for me. It returns numbers for visits of totals.visits. Also, I need to use REGXP_CONTAINS for the pages, as I do not have events (or custom dimensions) on my pages for the funnel steps. For the original query (for every respective step)
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page
I tried:
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE REGEXP_CONTAINS(hits.page.pagePath, r”myfunnelpage”)
However, my funnel step visits are actually more than my total “sessions” as per COUNT( DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS overday_sessions.
Another example looks at user sessions (I am incredibly impressed, also absolutely intimidated, props to #Martin)
Allegedly, there is a website that ought to have it all is down when I wrote this #StuffGettingLostOnline
My approach would look something like this. But it returns only sessions with single page views, not sequential ones:
SELECT
date,
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewpage") LIMIT 1)) AS product_overview_s1,
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE EXISTS(SELECT 1 FROM UNNEST(GA.hitS) WHERE REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewregex")) AND REGEXP_CONTAINS(hits.page.pagePath, cartoverviewregex") LIMIT 1)) AS cart_overview_s2
FROM
data as GA,
UNNEST(GA.hits) AS hits
WHERE hits.type = "PAGE"
AND
TRUE IN UNNEST(
[REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewpage"),
REGEXP_CONTAINS(hits.page.pagePath, r"cartoverviewregex""]
)
Any ideas? Anyone able to recreate the ultimate big query funnel using the “correct” session count?
You can use inline subqueries to check for the individual steps of the funnel:
WITH
sessions AS (
SELECT
(
SELECT
hits
FROM
UNNEST(hits) hits
WHERE
hits.page.pagePath = "/"
) first_step,
(
SELECT
hits
FROM
UNNEST(hits) hits
WHERE
hits.page.pagePath = "/basket"
) second_step
FROM
`project.dataset.ga_sessions_*`)
SELECT
COUNT(first_step) sessions_step_one,
COUNTIF(first_step.hitNumber < second_step.hitNumber) sessions_step_two
FROM
sessions

Replicating Google Analytics All Sessions for a Custom Dimension in In BigQuery

I am trying to replicate sessions by a custom dimension in BigQuery to the Google Analytics AII. I am only a few sessions off and I can't figure out what how to get an exact match.
My current understanding is that GA breaks sessions at midnight (because its data model relies on processing in day chunks). I tired to take this into account with the code below, but something is not quite right. Does anyone know how to get an exact match?
SELECT
CD12,
SUM(sessions) AS sessions
FROM (
SELECT
CD12,
CASE WHEN hitNumber = first_hit THEN visits ELSE 0 END AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
totals.visits,
hits.hitNumber,
CASE WHEN cd.index = 12 THEN cd.value END AS CD12,
MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
FROM `data-....`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
)
)
WHERE CD12 ='0'
GROUP BY
CD12
ORDER BY
sessions DESC

Big Query and Google Analytics UI do not match when ecommerce action filter applied

We are validating a query in Big Query, and cannot get the results to match with the google analytics UI. A similar question can be found here, but in our case the the mismatch only occurs when we apply a specific filter on ecommerce_action.action_type.
Here is the query:
SELECT COUNT(distinct fullVisitorId+cast(visitid as string)) AS sessions
FROM (
SELECT
device.browserVersion,
geoNetwork.networkLocation,
geoNetwork.networkDomain,
geoNetwork.city,
geoNetwork.country,
geoNetwork.continent,
geoNetwork.region,
device.browserSize,
visitNumber,
trafficSource.source,
trafficSource.medium,
fullvisitorId,
visitId,
device.screenResolution,
device.flashVersion,
device.operatingSystem,
device.browser,
totals.pageviews,
channelGrouping,
totals.transactionRevenue,
totals.timeOnSite,
totals.newVisits,
totals.visits,
date,
hits.eCommerceAction.action_type
FROM
(select *
from TABLE_DATE_RANGE([zzzzzzzzz.ga_sessions_],
<range>) ))t
WHERE
hits.eCommerceAction.action_type = '2' and <stuff to remove bots>
)
From the UI using the built in shopping behavior report, we get 3.836M unique sessions with a product detail view, compared with 3.684M unique sessions in Big Query using the query above.
A few questions:
1) We are under the impression the shopping behavior report "Sessions with Product View" breakdown is based off of the ecommerce_action.actiontype filter. Is that true?
2) Is there a .totals pre-aggregated table that the UI maybe pulling from?
It sounds like the issue is that COUNT(DISTINCT ...) is approximate when using legacy SQL, as noted in the migration guide, so the counts are not accurate. Either use standard SQL instead (preferred) or use EXACT_COUNT_DISTINCT with legacy SQL.
You're including product list views in your query.
As described in https://support.google.com/analytics/answer/3437719 you need to make sure, that no product has isImpression = TRUE because that would mean it is a product list view.
This query sums all sessions which contain any action_type='2' for which all isProduct are null or false:
SELECT
SUM(totals.visits) AS sessions
FROM
`project.123456789.ga_sessions_20180101` AS t
WHERE
(
SELECT
LOGICAL_OR(h.ecommerceaction.action_type='2')
FROM
t.hits AS h
WHERE
(SELECT LOGICAL_AND(isimpression IS NULL OR isimpression = FALSE) FROM h.product))
For legacySQL you can adapt the example in the documentation.
In addition to the fact that COUNT(DISTINCT ...) is approximate when using legacy SQL, there could be sessions in which there are only non-interactive hits, which will not be counted as sessions in the Google Analytics UI but they are counted by both COUNT(DISTINCT ...) and EXACT_COUNT_DISTINCT(...) because in your query they count visit id's.
Using SUM(totals.visits) you should get the same result as in the UI because SUM does not take into account NULL values of totals.visits (corresponding to sessions in which there are only non-interactive hits).

Resources