Thank you for stopping by! I would be grateful to (re)create the ultimate GA Session Funnel in Big Query. The focus is on the funnel per session, with certain, but not necessarily sequentially visited pages during one session.
The solution should count sessions as COUNT( DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))).
Further, the funnel should be of the form that every funnel step can only be reached if the previous step has been completed within a session (e.g. the fourth step should only be counted if steps 1 - 3 have been visited during the session). However, the steps do not need to be performed consecutively
That is, unfortunately, why this example, which I like a lot, would not work for me. It returns numbers for visits of totals.visits. Also, I need to use REGXP_CONTAINS for the pages, as I do not have events (or custom dimensions) on my pages for the funnel steps. For the original query (for every respective step)
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page
I tried:
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE REGEXP_CONTAINS(hits.page.pagePath, r”myfunnelpage”)
However, my funnel step visits are actually more than my total “sessions” as per COUNT( DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS overday_sessions.
Another example looks at user sessions (I am incredibly impressed, also absolutely intimidated, props to #Martin)
Allegedly, there is a website that ought to have it all is down when I wrote this #StuffGettingLostOnline
My approach would look something like this. But it returns only sessions with single page views, not sequential ones:
SELECT
date,
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewpage") LIMIT 1)) AS product_overview_s1,
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE EXISTS(SELECT 1 FROM UNNEST(GA.hitS) WHERE REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewregex")) AND REGEXP_CONTAINS(hits.page.pagePath, cartoverviewregex") LIMIT 1)) AS cart_overview_s2
FROM
data as GA,
UNNEST(GA.hits) AS hits
WHERE hits.type = "PAGE"
AND
TRUE IN UNNEST(
[REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewpage"),
REGEXP_CONTAINS(hits.page.pagePath, r"cartoverviewregex""]
)
Any ideas? Anyone able to recreate the ultimate big query funnel using the “correct” session count?
You can use inline subqueries to check for the individual steps of the funnel:
WITH
sessions AS (
SELECT
(
SELECT
hits
FROM
UNNEST(hits) hits
WHERE
hits.page.pagePath = "/"
) first_step,
(
SELECT
hits
FROM
UNNEST(hits) hits
WHERE
hits.page.pagePath = "/basket"
) second_step
FROM
`project.dataset.ga_sessions_*`)
SELECT
COUNT(first_step) sessions_step_one,
COUNTIF(first_step.hitNumber < second_step.hitNumber) sessions_step_two
FROM
sessions
Related
I am trying mimic this chart in GA:
But I have noticed that when I dont add date in my code the numbers match but when I add date the numbers seem doubling up.
Code:
SELECT
date,
COUNT(DISTINCT fullVisitorId) AS Users,
-- New Users (metric)
COUNT(DISTINCT(
CASE
WHEN totals.newVisits = 1 THEN fullVisitorId
ELSE
NULL
END)) AS New_Users,
-- Sessions (metric)
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS Sessions,
-- Bounces (metric)
COUNT(DISTINCT
CASE
WHEN totals.bounces = 1 THEN CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))
ELSE
NULL
END
) AS Bounces,
-- Transactions (metric)
COUNT(DISTINCT hits.transaction.transactionId) AS Transactions,
--Revenue (metric)
SUM(hits.transaction.transactionRevenue)/1000000 AS Revenue
FROM
`ABC-ca-web.123.ga_sessions_*`, Unnest(hits) hits
WHERE trafficSource.campaign LIKE '%ABC%' and date between '20200801' AND '20200831'
This also happens if you count users by date in GA, which is the usual query operation.
You cannot sum users from different periods. For example, if user X has visited the site every day in a week, analyzing the entire period the number of users is 1, but if you analyze it day by day it is 1 on the first day, 1 on the second day, 1 the third day, etc ... because the same user was there every day. If you count the users by days the result is that you have 7 users, but in reality you have 1 user because it is the same user.
I suppose when you add "date", you'll be also adding a group by "date" as well, without which the query will error out.
When you take the distinct count of users per day, a user can be included in multiple days grouping. But when you drop the "date" field, then that user is included only once.
That's why you see double, but it could be any number same or greater than when you don't add "date"
I started exploring big query, i am wondering, is it possible to combine in big query or GA number of unique users and pages that they have seen?
So i want to see how many are there Y unique visitors who viewed one or more pages and of these, Z% also viewed W pages?
I used below query to get Y unique visitors who viewed certain pages, but not able to see the % who have viewed W pages.
#standardSQL
SELECT
hits.page.pagePath AS other_seen_pages,
COUNT(hits.page.pagePath) AS number_other_seen_pages
FROM `project.dataset.session`,UNNEST(hits) AS hits
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `project.dataset.session`,UNNEST(hits) AS hits
WHERE hits.page.pagePath LIKE '%x_page%'
GROUP BY fullVisitorId )
AND hits.page.pagePath IS NOT NULL
AND hits.page.pagePath NOT LIKE '%x_page%'
GROUP BY other_seen_pages
ORDER BY number_other_seen_pages DESC;
I understand that you would like a query where, on top on the other pages that the same visitors visited, the number of visitors (from the same subset of visitors) that visited them (and the percentage above the total amount of users) appears.
Here is some code that worked for me with the bigquery-public-data.google_analytics_sample.ga_sessions_20170801 Google Analytics public table and the '/google+redesign/electronics' pagePath:
It:
creates a table with the total number of different users in the table
creates a table like yours, with the addition of a filed for the total of different users that visited your page and the page of the row
selects the desired fields from these two tables and computes the %
.
WITH
t_total_users as (select count(DISTINCT fullVisitorId) as total_users from `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`),
t_other_pages as (SELECT
hits.page.pagePath AS other_seen_pages,
COUNT(hits.page.pagePath) AS number_other_seen_pages,
COUNT(DISTINCT fullvisitorID) as visitors_per_page
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, UNNEST(hits) AS hits
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,UNNEST(hits) AS hits
WHERE hits.page.pagePath LIKE '/google+redesign/electronics'
GROUP BY fullVisitorId )
AND hits.page.pagePath IS NOT NULL
AND hits.page.pagePath NOT LIKE '/google+redesign/electronics'
GROUP BY other_seen_pages
ORDER BY number_other_seen_pages DESC)
SELECT
t_other_pages.other_seen_pages,
t_other_pages.number_other_seen_pages,
t_other_pages.visitors_per_page,
t_total_users.total_users,
(t_other_pages.visitors_per_page/t_total_users.total_users)*100 as percentage_visitants
FROM t_total_users, t_other_pages
If there is something in the query goal I missunderstood please specify!
I am trying to replicate sessions by a custom dimension in BigQuery to the Google Analytics AII. I am only a few sessions off and I can't figure out what how to get an exact match.
My current understanding is that GA breaks sessions at midnight (because its data model relies on processing in day chunks). I tired to take this into account with the code below, but something is not quite right. Does anyone know how to get an exact match?
SELECT
CD12,
SUM(sessions) AS sessions
FROM (
SELECT
CD12,
CASE WHEN hitNumber = first_hit THEN visits ELSE 0 END AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
totals.visits,
hits.hitNumber,
CASE WHEN cd.index = 12 THEN cd.value END AS CD12,
MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
FROM `data-....`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
)
)
WHERE CD12 ='0'
GROUP BY
CD12
ORDER BY
sessions DESC
I'm a bit surprised, how come successive sessions from user have visitNumber == 1 (it happens with more than one users). Doesn't visitNumber (session number for user) increments with each successive session?
see attach screenshot pls.
====
SELECT fullvisitorid, visitid, date, visitNumber, hitNumber, type, page.pagePath, isInteraction
FROM `122623284.ga_sessions_2017*` ga_sessions, unnest(hits) as ht
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
ORDER BY fullvisitorid, visitid, hitnumber
Thanks in advance, if anyone any idea under what scenarios this can happen ?
cheers!
UPDATE (after #WillianFuks response)
It's still the same, after re-running the query that #WillianFuks suggested,
The observation here is the stark date difference between the successive visits :
188 days (red)
210 days (green)
184 days (blue)
Analytics does a lookback for the last session to increment the visitNumber count, but there is a limit on number of days it lookbacks upto, called as lookback window. I don't remember exactly for analytics but the lookback window generally ranges from 90 days to 180 days for various Google products.
Since it is not able to find the previous visit within the lookback window, it resets the visitNumber to 1 again.
Update: By default it is 6 months for Google Analytics.
As Elliott suggested in his comment, the problem most likely is due the duplication that happens when you apply UNNEST to the hits field.
You can confirm that by running this query:
SELECT
fullvisitorid fv,
visitid,
date,
visitNumber,
ARRAY(SELECT AS STRUCT hitNumber, type, page.pagePath AS pagePath, isInteraction FROM UNNEST(hits)) data
FROM `122623284.ga_sessions_2017*`
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
LIMIT 1000
This will bring the fields inside hits without making the cross product (unnest operation) with the outer fields.
I'm just learning BigQuery so this might be a dumb question, but we want to get some statistics there and one of those is the total sessions in a given day.
To do so, I've queried in BQ:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
group each by fullvisitorid
)
(I'm using the table_query because later on we might increase the range of days)
This results in 1,075,137.
But in our Google Analytics Reports, in the "Audience Overview" section, the same day results:
This report is based on 1,026,641 sessions (100% of sessions).
There's always this difference of roughly ~5% despite of the day. So I'm wondering, even though the query is quite simple, is there any mistake we've made?
Is this difference expected to happen? I read through BigQuery's documentation but couldn't find anything on this issue.
Thanks in advance,
standardsql
Simply SUM(totals.visits) or when using COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) make sure totals.visits=1!
If you use visitId and you are not grouping per day, you will combine midnight-split-sessions!
Here are all scenarios:
SELECT
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) allSessionsUniquePerDay,
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitId AS STRING) )) allSessionsUniquePerSelectedTimeframe,
sum(totals.visits) interactiveSessionsUniquePerDay, -- equals GA UI sessions
COUNT(DISTINCT IF(totals.visits=1, CONCAT(fullVisitorId, CAST(visitId AS STRING)), NULL) ) interactiveSessionsUniquePerSelectedTimeframe,
SUM(IF(totals.visits=1,0,1)) nonInteractiveSessions
FROM
`project.dataset.ga_sessions_2017102*`
Wrap up:
fullVisitorId + visitId: useful to reconnect midnight-splits
fullVisitorId + visitStartTime: useful to take splits into account
totals.visits=1 for interaction sessions
fullVisitorId + visitStartTime where totals.visits=1: GA UI sessions (in case you need a session id)
SUM(totals.visits): simple GA UI sessions
fullVisitorId + visitId where totals.visits=1 and GROUP BY date: GA UI sessions with too many chances for errors and misunderstandings
After posting the question we got into contact with Google support and found that in Google Analytics only sessions that had an "event" being fired are actually counted.
In Bigquery you will find all sessions regardless whether they had an interaction or not.
In order to find the same result as in GA, you should filter by sessions with totals.visits = 1 in your BQ query (totals.visits is 1 only for sessions that had an event being fired).
That is:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
where totals.visits = 1
group each by fullvisitorid
)
The problem could be due to "COUNT DISTINCT".
According to this post:
COUNT DISTINCT is a statistical approximation for all results greater than 1000
You could try setting an additional COUNT parameter to improve accuracy at the expense of performance (see post), but I would first try:
SELECT COUNT( CONCAT( fullvisitorid,'_', STRING(visitid))) as sessions
from (table_query([40663402], 'timestamp(right(table_id,8)) between
timestamp("20150519") and timestamp("20150519")'))
What worked for me was this:
SELECT count(distinct sessionId) FROM(
SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*`
WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1)
The explanation (found very well written in
this article: https://adswerve.com/blog/google-analytics-bigquery-tips-users-sessions-part-one/) is that when counting and dealing with sessions we should be careful because by default, Google Analytics breaks sessions that carryover midnight (time zone of the view). Therefore a same session can end up in two daily tables:
Image from article mentioned above
The code provided creates a sessionID by combining:
client id + visit number + date
while acknowledging the session break; the result will be in a human-readable format. Finally to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.