Counting google analytics unique events in BigQuery - google-analytics

I have managed to calculate total events by ISOweek but not unique events for a given Google Analytics Event using BigQuery. When checking GA, total_events matches the GA interface on the dot but unique_events are off. Do you know how I can solve this?
The query:
SELECT INTEGER(STRFTIME_UTC_USEC(PARSE_UTC_USEC(date),"%V")) iso8601_week_number,
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
COUNT(hits.eventInfo.eventCategory) AS total_events,
EXACT_COUNT_DISTINCT(fullVisitorId) AS unique_events
FROM
TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2017-05-01'), TIMESTAMP('2017-05-07'))
WHERE
hits.type = 'EVENT' AND hits.eventInfo.eventCategory = 'BIG_Transaction'
GROUP BY
iso8601_week_number, hits.eventInfo.eventCategory, hits.eventInfo.eventAction

Depending on the scope you need to count(distinct ) different things, but you always need to fulfill these conditions:
unique events refer to the combination of category, action and label
make sure eventAction is not NULL
make sure eventLabel is not NULL
eventCategory is allowed be NULL
I'm using COALESCE() to avoid NULLs
Example Session Scope
SELECT
SUM( (SELECT COUNT(h.eventInfo.eventCategory) FROM t.hits h) ) events,
SUM( (SELECT COUNT(DISTINCT
CONCAT( h.eventInfo.eventCategory,
COALESCE(h.eventinfo.eventaction,''),
COALESCE(h.eventinfo.eventlabel, ''))
)
FROM
t.hits h ) ) uniqueEvents
FROM
`google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t
Example Hit Scope
SELECT
h.eventInfo.eventCategory,
COUNT(1) events,
-- we need to take sessions into account, so we add fullvisitorid and visitstarttime
COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS string),
COALESCE(h.eventinfo.eventaction,''),
COALESCE(h.eventinfo.eventlabel, ''))) uniqueEvents
FROM
`google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t,
t.hits h
WHERE
h.type='EVENT'
GROUP BY
1
ORDER BY
2 DESC
hth!

The definition of unique events in Google Analytics is:
A count of the number of times an event with the category/action/label
value was seen at least once within a session.
In other words, the number of sessions in which a specific event (defined by category, action AND label) was sent. In your query, you count the number of unique visitors that had the event, while you need to count the number of sessions and keep in mind that events with different labels should be counted as different unique events (although we are only interested in category and action).
A possible way to fix your code is:
SELECT
INTEGER(STRFTIME_UTC_USEC(PARSE_UTC_USEC(date),"%V")) iso8601_week_number,
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
COUNT(hits.eventInfo.eventCategory) AS total_events,
EXACT_COUNT_DISTINCT(CONCAT(fullVisitorId,'-',string(visitId),'-',date,'-',ifnull(hits.eventInfo.eventLabel,'null'))) AS unique_events
FROM
TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2017-05-01'), TIMESTAMP('2017-05-07'))
WHERE
hits.type = 'EVENT' AND hits.eventInfo.eventCategory = 'BIG_Transaction'
GROUP BY
iso8601_week_number, hits.eventInfo.eventCategory, hits.eventInfo.eventAction
The results of this query should match with the data in the GA interface.

I believe the issue is that you are only counting the number of unique visitors have completed the specified action, while GA defines unique events as "The number of times during a date range that a session contained the specific dimension".
Therefore, I would just change your code to the below:
SELECT INTEGER(STRFTIME_UTC_USEC(PARSE_UTC_USEC(date),"%V")) iso8601_week_number,
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
COUNT(hits.eventInfo.eventCategory) AS total_events,
EXACT_COUNT_DISTINCT(CONCAT(fullVisitorId, STRING(visitId))) AS unique_events
FROM
TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2017-05-01'), TIMESTAMP('2017-05-07'))
WHERE
hits.type = 'EVENT' AND hits.eventInfo.eventCategory = 'BIG_Transaction'
GROUP BY
iso8601_week_number, hits.eventInfo.eventCategory, hits.eventInfo.eventAction
This should give you the distinct count of sessions that had the given events.

We did something similar to what #Martin was suggesting with some cool CTEs and we were able to get an 100% match on what was coming out of Google Analytics from BigQuery.
Checkout the code snippet below that returns a per day sum of sessions + unique Add to Cart events:
#standardSQL
WITH AN_ATC AS
(
SELECT
-- full date w/ hyphens (ie 2021-01-07)
CAST(format_date('%Y-%m-%d', parse_date("%Y%m%d", date)) AS DATE) as DATE,
-- COUNT OF SESSIONS
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS Sessions,
-- COUNT OF UNIQUE EVENTS PER SESSION
COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS string),
COALESCE(hits.eventinfo.eventaction,''),
COALESCE(hits.eventinfo.eventlabel, ''))) AS EVENTS
FROM `an-big-query.PROJECT_ID.ga_sessions_*` ,
UNNEST(hits) as hits
WHERE
-- start date
_table_suffix BETWEEN '20190101'
-- yesterday
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(),INTERVAL 1 DAY))
AND hits.eventInfo.eventAction = 'add to cart'
GROUP BY
date
)
SELECT
DATE,
SESSIONS,
EVENTS
FROM AN_ATC
ORDER BY date DESC
Where,
SESSIONS = Google Analytics ga:Sessions
and
EVENTS = Google Analytics ga:uniqueEvents
BOTH with eventAction=#add to cart
Hope that helps everyone that was searching/googling!

Related

BigQuery Google Analytics: count number of sessions and goal completion if custom dimension of the first hit of a session has a certain value

I am using Bigquery to pull Google Analytics Sessions and goal completions if the first hit's custom dimension equals certain values. The session count was right but goal completion count was way less than expected.
In GA, my segment used for QA is
include sessions sequence start first user interaction
custom dimension 10 is one of 1,5
Goal definition is
Event Action = 'event action value 1' or 'event action value 2'
Here is my code
SELECT
--- goal completion
COUNT(DISTINCT(
IF
(REGEXP_contains(hits.eventinfo.eventaction , '(event action value 1|event action value 2)'),
CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)),
NULL))) AS Goals,
---Sessions
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS Sessions,
FROM
`123456789.ga_sessions_20200201` as Data,
unnest(hits) as hits
WHERE totals.visits = 1
and EXISTS (select 1 from unnest(hits.customDimensions) where index = 10 and value in ('1', '5') and hits.time = 0)

How to query Direct returning visitor in BigQuery

I am trying to figure out how many users returned as Direct users after visiting the website as Organic using BigQuery
This is what I did so far. In order to get the number of users who came back as Direct after visiting as Organic, I used
organic_user.visitNumber < direct_user.visitNumber
in WHERE clause.
SELECT
organic_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM
(
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Organic Search'
) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Direct'
) AS direct_user ON organic_user.fullVisitorId = direct_user.fullVisitorId
WHERE
organic_user.visitNumber < direct_user.visitNumber
GROUP BY
date
ORDER BY
date ASC
Could anyone verify this query is correct?
If not, could you provide a solution for this?
With all the clarifications you provided in the comments, I was able to come up with some adaptations of your original query:
SELECT
direct_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Organic Search' ) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Direct' ) AS direct_user
ON
organic_user.fullVisitorId = direct_user.fullVisitorId
AND organic_user.visitNumber < direct_user.visitNumber
GROUP BY
direct_user.date
ORDER BY
direct_user.date ASC
Here are some considerations about the changes I made:
I noticed it was important to specify the subquery group date we
are using for the group by. Since we are counting ‘Direct’ visits
per day, it makes sense we count when they happen.
I moved the organic_user.visitNumber < direct_user.visitNumber
condition to the JOIN clause, I know for INNER JOINs it does not
make any technical difference, but for semantic reasons I thought it
belong there.
I hope this information results to be helpful to you.

Accessing Struct(s) and Array(s) in Firebase Closed Funnels through BigQuery

I stumbled unto this standard SQL BigQuery documentation this week, which got me started with a Firebase Analytics Closed Funnel. I however got the wrong results (view image below). There should be no users that had a "Tutorial_LessonCompleted" before they did not start a "Tutorial_LessonStarted >> Lesson = 1 " first. This could be because of various reasons.
Questions:
Is it wise to use the User Property = "first_open_time", or is it better to use the Event = "first_open". How would the latter implementation look like ?
I suspect I am perhaps not correctly drilling down to: Event (String = "Tutorial_LessonStarted") >> parameter (String = "LessonNumber") >> value (String = "lesson1")?
How would a filter on _TABLE_SUFFIX = '20170701' possibly work, I read this will be cheaper. Any optimised code suggestions are received with open arms and an up-vote!
#standardSQL
SELECT
step1, step2, step3, step4, step5, step6,
COUNT(*) AS funnel_count,
COUNT(DISTINCT user_id) AS users
FROM (
SELECT
user_dim.app_info.app_instance_id AS user_id,
event.timestamp_micros AS event_timestamp,
event.name AS step1,
LEAD(event.name, 1) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step2,
LEAD(event.name, 2) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step3,
LEAD(event.name, 3) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step4,
LEAD(event.name, 4) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step5,
LEAD(event.name, 5) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step6
FROM
`......`,
UNNEST(event_dim) AS event,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "first_open_time"
ORDER BY 1, 2, 3, 4, 5 ASC
)
WHERE step6 = "Tutorial_LessonStarted" AND EXISTS (
SELECT *
FROM `......`,
UNNEST(event_dim) AS event,
UNNEST(event.params)
WHERE key = 'LessonNumber' AND value.string_value = "lesson1") GROUP BY step1, step2, step3, step4, step5, step6
ORDER BY funnel_count DESC
LIMIT 100;
Note:
Enter your query table FROM, i.e:project_id.com_game_example_IOS.app_events_20170212,
I left out the funnel_count and user_count.
Output:
----------------------------------------------------------
Update since original question above:
#Elliot: I don’t understand why you said: -- ensure that an event with lesson1 precedes Tutorial_LessonStarted.
Tutorial_LessonStarted has a parameter "LessonNumber" with values lesson1,lesson2,lesson3,lesson4.
I want to count all funnels that took place with a last step in the funnel equal to LessonNumber=lesson1.
So, applied to event log-data for a brand new user's first session (aka: an user that fired first_open_time), the answer would be the table below:
View.OnboardingWelcomePage
View.OnboardingFinalPage
View.JamLoading
View.JamLoading
Jam.UserViewsJam
Jam.ProjectOpened
View.JamMixer
Tutorial.LessonStarted (This parameter “LessonNumber"'s value would be equal to “lesson1”)
Jam.ProjectPlayStarted
View.JamLoopSelector
View.JamMixer
View.JamLoopSelector
View.JamMixer
View.JamLoopSelector
View.JamMixer
Tutorial.LessonCompleted
Tutorial.LessonStarted (This parameter “LessonNumber"'s value would be equal to “lesson2”)
So it is important to firstly get all the users that had a first_open_time on a specific day, as well structure the events into a funnel so that the last event in the funnel is one which matches an event and a specific parameter value, and then form the funnel "backwards" from there.
Let me go through some explanation, then see if I can suggest a query to get you started.
It looks like you want to analyze the sequence of events in your analytics data, but the sequence is already there for you--you have an array of the events. Looking at the Firebase schema for BigQuery, event_dim is the relevant column, and unless I'm misunderstanding something, these events are ordered by time. If you want to check what the sixth event's name was, you can use:
event_dim[SAFE_ORDINAL(6)].name
This will evaluate to NULL if there were fewer than six events, or else it will give you the string with the event name.
Another observation is that you are attempting to analyze both event_dim and user_dim, but you are taking the cross product of the two, which will explode the number of rows and make it hard to reason about the results of the query. To look for a specific user property, use an expression of this form:
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>'
Combining these two filters, your FROM and WHERE clause would look something like this:
FROM `project_id.com_game_example_IOS.app_events_*`
WHERE _TABLE_SUFFIX = '20170701' AND
event_dim[SAFE_ORDINAL(6)].name = 'Tutorial_LessonStarted' AND
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>'
Using the bracket operator to access the steps from event_dim, we can do something like this:
WITH FilteredInput AS (
SELECT *
FROM `project_id.com_game_example_IOS.app_events_*`
WHERE _TABLE_SUFFIX = '20170701' AND
event_dim[SAFE_ORDINAL(6)].name = 'Tutorial_LessonStarted' AND
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>' AND
-- ensure that an event with lesson1 precedes Tutorial_LessonStarted
EXISTS (
SELECT 1
FROM UNNEST(event_dim) WITH OFFSET event_offset
CROSS JOIN UNNEST(params)
WHERE key = 'LessonNumber' AND
value.string_value = 'lesson1' AND
event_offset < 5
)
)
SELECT
event_dim[ORDINAL(1)].name AS step1,
event_dim[ORDINAL(2)].name AS step2,
event_dim[ORDINAL(3)].name AS step3,
event_dim[ORDINAL(4)].name AS step4,
event_dim[ORDINAL(5)].name AS step5,
event_dim[ORDINAL(6)].name AS step6,
COUNT(*) AS funnel_count,
COUNT(DISTINCT user_dim.user_id) AS users
FROM FilteredInput
GROUP BY step1, step2, step3, step4, step5, step6;
This will return all unique "paths" along with a count and number of distinct users for each. Note that I'm just writing this off the top of my head--I don't have representative data that I can try it on--so there may be syntax or other errors.

Joining to landing pages query doubles the sessions per source

I'm trying to query sum of visits per source from a Big Query table of Google Analytics data, but will need to filter some sessions out at landing page level. Hence I'm pre-querying visitIDs by landing page and re-joining to session data like so:
#StandardSQL
WITH landingpages AS (
SELECT
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
# filters to be added here
)
SELECT
sessions.trafficSource.source,
SUM(sessions.totals.visits) AS visits
FROM `project.dataset.ga_sessions_*` AS sessions
JOIN
landingpages
ON
landingpages.visitID = sessions.visitID
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
GROUP BY
trafficSource.source
ORDER BY
visits DESC
This roughly doubles the number of sessions per each source as reported from GA.
Can anyone point out what I've done wrong? (I suspect it is blindingly obvious)
I've tried examining the data output from the first query and can't find anything wrong with it aside from a very small proportion of duplicated visitIDs. I've also tried various different types of JOIN, all to now avail.
When querying ga data from GBQ it's imperative to know and keep in mind that a unique visit is represented by both a fullVisitorID and visitID. Only a double join on both will return a meaningful data set.
Here's what I should have written:
#StandardSQL
WITH landingpages AS (
SELECT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
),
session_data AS (
SELECT
date AS ga_date, trafficSource.source AS source, fullVisitorId, visitID, SUM(totals.visits) AS visits
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
AND
totals.visits > 0
GROUP BY ga_date, source, fullVisitorId, visitID
)
SELECT
ga_date, source, SUM(visits) AS Sessions
FROM
landingpages
JOIN
session_data
ON
landingpages.VisitID = session_data.VisitID
AND
landingpages.fullVisitorId = session_data.fullVisitorId
GROUP BY
ga_date, source
ORDER BY
Sessions DESC

How to get the Google Analytics definition of unique page views in Bigquery

https://support.google.com/analytics/answer/1257084?hl=en-GB#pageviews_vs_unique_views
I'm trying to calculate the sum of unique page views per day which Google analytics has on its interface
How do I get the equivalent using bigquery?
There are two ways how this is used:
1) One is as the original linked documentation says, to combine full visitor user id, and their different session id: visitId, and count those.
SELECT
EXACT_COUNT_DISTINCT(combinedVisitorId)
FROM (
SELECT
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE' )
2) The other is just counting distinct fullVisitorIds
SELECT
EXACT_COUNT_DISTINCT(fullVisitorId)
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
If someone wants to try out this on a sample public dataset there is a tutorial how to add the sample dataset.
The other queries didn't match the Unique Pageviews metric in my Google Analytics account, but the following did:
SELECT COUNT(1) as unique_pageviews
FROM (
SELECT
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber,
COUNT(1) as hits
FROM [my_table]
WHERE hits.type='PAGE'
GROUP BY
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber
)
For uniquePageViews you better want to use something like this:
SELECT
date,
SUM(uniquePageviews) AS uniquePageviews
FROM (
SELECT
date,
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId,
EXACT_COUNT_DISTINCT(hits.page.pagePath) AS uniquePageviews
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
GROUP BY 1,2)
GROUP EACH BY 1;
So, in 2022 EXACT_COUNT_DISTINCT() seems to be deprecated..
Also for me the following combination of fullvisitorid+visitNumber+visitStartTime+hits.page.pagePath was always more precise than the above solutions:
SELECT
SUM(Unique_PageViews)
FROM
(SELECT
COUNT(DISTINCT(CONCAT(fullvisitorid,"-",CAST(visitNumber AS string),"-",CAST(visitStartTime AS string),"-",hits.page.pagePath))) as Unique_PageViews
FROM
`mhsd-bigquery-project.8330566.ga_sessions_*`,
unnest(hits) as hits
WHERE
_table_suffix BETWEEN '20220307'
AND '20220313'
AND hits.type = 'PAGE')

Resources