GA BigQuery Export - COUNT(DISTINCT(fullVisitorId)) with source/medium overcounting - google-analytics

I'm having an issue calculating unique users in our GA BigQuery export. I've reproduced the same error using the sample data.
SELECT sum(users) as users, sum(sessions) as sessions FROM (
SELECT
h.page.pagePath as page_path,
trafficSource.source,
trafficSource.medium,
COUNT(DISTINCT(fullVisitorId)) AS users,
COUNT(*) as sessions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
WHERE h.page.pagePath = "/home"
GROUP BY page_path, source, medium
)
UNION ALL
SELECT sum(users) as users, sum(sessions) as sessions FROM (
SELECT
h.page.pagePath as page_path,
COUNT(DISTINCT(fullVisitorId)) AS users,
COUNT(*) as sessions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
WHERE h.page.pagePath = "/home"
GROUP BY page_path
)
When I include the source and medium columns, the distinct fullVisitorId count is 10 higher than without them. How does including these columns cause an increased number of fullVisitorIds? This doesn't make sense to me.
What's causing this and how would I get an accurate count?

How does including these columns cause an increased number of fullVisitorIds? This doesn't make sense to me.
You can see why if you run your inner query like this:
SELECT
MAX(fullVisitorId) AS fullVisitorId,
h.page.pagePath as page_path,
trafficSource.source,
trafficSource.medium,
COUNT(DISTINCT(TRIM(fullVisitorId))) AS users,
COUNT(*) as sessions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
WHERE h.page.pagePath = "/home"
and fullVisitorId = '9902321252073939460'
GROUP BY page_path, source, medium
Which return this results:
As you can see because a user is coming from 2 different source/medium you are counting the same user twice which cause the increase.
One option to solve this is to use aggregate function on source/medium and remove them from the GROUP BY like this:
SELECT sum(users) as users, sum(sessions) as sessions FROM (
SELECT
h.page.pagePath as page_path,
MAX(trafficSource.source) as source,
MAX(trafficSource.medium) as medium,
COUNT(DISTINCT(TRIM(fullVisitorId))) AS users,
COUNT(*) as sessions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
WHERE h.page.pagePath = "/home"
GROUP BY page_path
)
UNION ALL
SELECT sum(users) as users, sum(sessions) as sessions FROM (
SELECT
h.page.pagePath as page_path,
COUNT(DISTINCT(TRIM(fullVisitorId))) AS users,
COUNT(*) as sessions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
WHERE h.page.pagePath = "/home"
GROUP BY page_path
)
Now the number of users is the same:

Related

GA in bigquery does not show the correct result

I'm trying to get GA data in Bigquery by the query below.
GA data by products, but somehow it returns no result.
Anyone knows why?
SELECT
product.productSKU,
COUNT(DISTINCT fullVisitorId) AS unique_user,
COUNT(fullVisitorId) AS page_view,
COUNT(DISTINCT visitId) AS unique_session
FROM
`.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
hits.type = 'PAGE'
GROUP BY
product.productSKU

How to query Direct returning visitor in BigQuery

I am trying to figure out how many users returned as Direct users after visiting the website as Organic using BigQuery
This is what I did so far. In order to get the number of users who came back as Direct after visiting as Organic, I used
organic_user.visitNumber < direct_user.visitNumber
in WHERE clause.
SELECT
organic_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM
(
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Organic Search'
) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Direct'
) AS direct_user ON organic_user.fullVisitorId = direct_user.fullVisitorId
WHERE
organic_user.visitNumber < direct_user.visitNumber
GROUP BY
date
ORDER BY
date ASC
Could anyone verify this query is correct?
If not, could you provide a solution for this?
With all the clarifications you provided in the comments, I was able to come up with some adaptations of your original query:
SELECT
direct_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Organic Search' ) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Direct' ) AS direct_user
ON
organic_user.fullVisitorId = direct_user.fullVisitorId
AND organic_user.visitNumber < direct_user.visitNumber
GROUP BY
direct_user.date
ORDER BY
direct_user.date ASC
Here are some considerations about the changes I made:
I noticed it was important to specify the subquery group date we
are using for the group by. Since we are counting ‘Direct’ visits
per day, it makes sense we count when they happen.
I moved the organic_user.visitNumber < direct_user.visitNumber
condition to the JOIN clause, I know for INNER JOINs it does not
make any technical difference, but for semantic reasons I thought it
belong there.
I hope this information results to be helpful to you.

Joining to landing pages query doubles the sessions per source

I'm trying to query sum of visits per source from a Big Query table of Google Analytics data, but will need to filter some sessions out at landing page level. Hence I'm pre-querying visitIDs by landing page and re-joining to session data like so:
#StandardSQL
WITH landingpages AS (
SELECT
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
# filters to be added here
)
SELECT
sessions.trafficSource.source,
SUM(sessions.totals.visits) AS visits
FROM `project.dataset.ga_sessions_*` AS sessions
JOIN
landingpages
ON
landingpages.visitID = sessions.visitID
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
GROUP BY
trafficSource.source
ORDER BY
visits DESC
This roughly doubles the number of sessions per each source as reported from GA.
Can anyone point out what I've done wrong? (I suspect it is blindingly obvious)
I've tried examining the data output from the first query and can't find anything wrong with it aside from a very small proportion of duplicated visitIDs. I've also tried various different types of JOIN, all to now avail.
When querying ga data from GBQ it's imperative to know and keep in mind that a unique visit is represented by both a fullVisitorID and visitID. Only a double join on both will return a meaningful data set.
Here's what I should have written:
#StandardSQL
WITH landingpages AS (
SELECT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
),
session_data AS (
SELECT
date AS ga_date, trafficSource.source AS source, fullVisitorId, visitID, SUM(totals.visits) AS visits
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
AND
totals.visits > 0
GROUP BY ga_date, source, fullVisitorId, visitID
)
SELECT
ga_date, source, SUM(visits) AS Sessions
FROM
landingpages
JOIN
session_data
ON
landingpages.VisitID = session_data.VisitID
AND
landingpages.fullVisitorId = session_data.fullVisitorId
GROUP BY
ga_date, source
ORDER BY
Sessions DESC

Need BigQuery SQL query to collect time on page from Google Analytics data

can anyone help with a BIgQuery SQL query to extract the time on page for a specific page from Google Analytics data please?
For every visitorId who has visited a particular page I would like the time on page for that page. This is so that I can calculate the median time on page rather than the mean.
I'm assuming that the visitorId, hits.hitsNumber and hits.time dimensions will be needed. Also that somehow the hits.time for the hit where the page was viewed will need to be subtracted from the hits.time of the following hit.
Any help much appreciated.
Try this:
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
(nextTime - hits.time) as timeOnPage
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
The key to this code is the LEAD() function, which grabs the specified value from the next row in the partition, based on the PARTITION BY and ORDER BY qualifiers.
Hope that helps!
To Account for last page time, this query can be used, and it will give zero time on last page,since BQ doesn't have way to calculate time spent on last page, but it will at least give zero instead of null.
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
CASE
WHEN hits.isExit IS NOT NULL THEN last_interaction - hit_time
ELSE next_pageview - hit_time
END
AS time_on_page
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.isExit,
hits.time/1000 as hits_time,
LEAD(hits.time/1000, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime,
MAX(IF(hits.isInteraction = TRUE,hits.time / 1000,0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
Google Analytics 4
Time-Spent-On-Page
SELECT
user_pseudo_id,
event_timestamp,
ga_session_id,
page_location,
page_title,
next_hit_in_the_same_session,
(next_hit_in_the_same_session - event_timestamp)/1000000 AS time_on_page_in_seconds
FROM (
SELECT
user_pseudo_id,
event_timestamp,
(
SELECT value.int_value FROM
UNNEST(event_params)
WHERE
key = 'ga_session_id') AS ga_session_id,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_location') AS page_location,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_title') AS page_title,
LEAD(event_timestamp) OVER (PARTITION BY (SELECT value.int_value FROM UNNEST(event_params)
WHERE
key = 'ga_session_id')
ORDER BY
event_timestamp ASC) AS next_hit_in_the_same_session
FROM
-- Replace table name.
`cloud-search.analytics_298504139.events_20220106` AS tableAlias
WHERE
event_name = 'page_view'
ORDER BY
user_pseudo_id,
ga_session_id,
event_timestamp ASC )

How to get the Google Analytics definition of unique page views in Bigquery

https://support.google.com/analytics/answer/1257084?hl=en-GB#pageviews_vs_unique_views
I'm trying to calculate the sum of unique page views per day which Google analytics has on its interface
How do I get the equivalent using bigquery?
There are two ways how this is used:
1) One is as the original linked documentation says, to combine full visitor user id, and their different session id: visitId, and count those.
SELECT
EXACT_COUNT_DISTINCT(combinedVisitorId)
FROM (
SELECT
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE' )
2) The other is just counting distinct fullVisitorIds
SELECT
EXACT_COUNT_DISTINCT(fullVisitorId)
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
If someone wants to try out this on a sample public dataset there is a tutorial how to add the sample dataset.
The other queries didn't match the Unique Pageviews metric in my Google Analytics account, but the following did:
SELECT COUNT(1) as unique_pageviews
FROM (
SELECT
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber,
COUNT(1) as hits
FROM [my_table]
WHERE hits.type='PAGE'
GROUP BY
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber
)
For uniquePageViews you better want to use something like this:
SELECT
date,
SUM(uniquePageviews) AS uniquePageviews
FROM (
SELECT
date,
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId,
EXACT_COUNT_DISTINCT(hits.page.pagePath) AS uniquePageviews
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
GROUP BY 1,2)
GROUP EACH BY 1;
So, in 2022 EXACT_COUNT_DISTINCT() seems to be deprecated..
Also for me the following combination of fullvisitorid+visitNumber+visitStartTime+hits.page.pagePath was always more precise than the above solutions:
SELECT
SUM(Unique_PageViews)
FROM
(SELECT
COUNT(DISTINCT(CONCAT(fullvisitorid,"-",CAST(visitNumber AS string),"-",CAST(visitStartTime AS string),"-",hits.page.pagePath))) as Unique_PageViews
FROM
`mhsd-bigquery-project.8330566.ga_sessions_*`,
unnest(hits) as hits
WHERE
_table_suffix BETWEEN '20220307'
AND '20220313'
AND hits.type = 'PAGE')

Resources