Firebase Events for Newly Installed Purchaser Cohort in Bigquery - firebase

Given the install date of android users, I would like to get the users' count for all our 200+ Firebase events on day0 to dayX for users which have already made at least one purchase in a defined period after installation. The first half of this question was previously solved in this question. I thought it would be helpful to share an added "purchaser"-cohort query for others to re-use.
My first attempt (which failed):
-- STANDARD SQL
-- NEW BIGQUERY EXPORT SCHEMA
SELECT
a.event_name AS event_name,
a._TABLE_SUFFIX as day,
COUNT(1) as users
FROM `xxxx.analytics_xxxx.events_*` as c
RIGHT JOIN (SELECT user_pseudo_id, event_date, event_timestamp, event_name
FROM `xxxx.analytics_xxxx.events_*`
WHERE user_first_touch_timestamp BETWEEN 1530453600000000 AND 1530468000000000
AND _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
AND platform = "ANDROID"
AND (event_name = 'in_app_purchase' OR event_name = 'ecommerce_purchase')
) as a
ON a.user_pseudo_id = c.user_pseudo_id
WHERE _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
GROUP BY event_name, day;

Answer:
-- STANDARD SQL
-- NEW BIGQUERY EXPORT SCHEMA
SELECT
event_name AS event_name,
_TABLE_SUFFIX as day,
COUNT(1) as users
FROM `xxxx.analytics_xxxx.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
AND user_pseudo_id IN (SELECT user_pseudo_id
FROM `xxxx.analytics_xxxx.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
AND user_first_touch_timestamp BETWEEN 1530453600000000 AND 1530468000000000
AND (event_name = 'in_app_purchase' OR event_name = 'ecommerce_purchase')
AND platform = "ANDROID")
GROUP BY event_name, day;
PS: Suggestions to optimize this script are always welcome :)

Related

BigQuery - How to order by event

I'm starting using BigQuery these days for work. Until now I managed to request what I wanted but I'm stuck.
I retrieve data from Firebase on my big query console. These data are events from a mobile game we are testing.
I would like to know how many players are there in each level by ABVersion. I can't figure out how to do it.
I did this:
SELECT
param.value.string_value AS Version,
COUNT (DISTINCT user_pseudo_id) AS Players,
param2.value.string_value AS Level
FROM
`*Name of the dataset*`,
UNNEST(event_params) AS param,
UNNEST(event_params) AS param2
WHERE
event_name = 'Level_end'
AND param.key = 'ABVersion'
AND param2.key = 'Level'
GROUP BY Version,Level
And I got this:
I would like to have the number of players per level, with the ABVersion provided.
Thank you for your help!
Level is an integer parameter instead of string. So you should use value.int_value for level.
For the thing you're trying to do, it looks like a better query to me:
SELECT
highest_level,
abversion,
count(*) as players
FROM (
SELECT
user_pseudo_id,
ANY_VALUE((SELECT value.string_value FROM UNNEST(params) WHERE key = 'ABVersion')) as abversion,
MAX((SELECT value.int64_value FROM UNNEST(params) WHERE key = 'Level')) as highest_level
FROM `*Name of the dataset*`,
WHERE
event_name = 'Level_end'
AND EXISTS (SELECT 1 FROM UNNEST(params) WHERE key IN ('Level', 'ABVersion'))
GROUP BY user_pseudo_id
)
GROUP BY 1,2
ORDER BY 1,2

How to query Direct returning visitor in BigQuery

I am trying to figure out how many users returned as Direct users after visiting the website as Organic using BigQuery
This is what I did so far. In order to get the number of users who came back as Direct after visiting as Organic, I used
organic_user.visitNumber < direct_user.visitNumber
in WHERE clause.
SELECT
organic_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM
(
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Organic Search'
) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Direct'
) AS direct_user ON organic_user.fullVisitorId = direct_user.fullVisitorId
WHERE
organic_user.visitNumber < direct_user.visitNumber
GROUP BY
date
ORDER BY
date ASC
Could anyone verify this query is correct?
If not, could you provide a solution for this?
With all the clarifications you provided in the comments, I was able to come up with some adaptations of your original query:
SELECT
direct_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Organic Search' ) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Direct' ) AS direct_user
ON
organic_user.fullVisitorId = direct_user.fullVisitorId
AND organic_user.visitNumber < direct_user.visitNumber
GROUP BY
direct_user.date
ORDER BY
direct_user.date ASC
Here are some considerations about the changes I made:
I noticed it was important to specify the subquery group date we
are using for the group by. Since we are counting ‘Direct’ visits
per day, it makes sense we count when they happen.
I moved the organic_user.visitNumber < direct_user.visitNumber
condition to the JOIN clause, I know for INNER JOINs it does not
make any technical difference, but for semantic reasons I thought it
belong there.
I hope this information results to be helpful to you.

SQL Query to find users that install and uninstall an App on the same day

I am trying to find the users that install and uninstall the App on the same day using the data from Firebase Analytics in Google BigQuery
This is where I got so far.
I have a query that gives me users (or app_instance_id) who install or uninstall the App:
SELECT event.date,
user_dim.app_info.app_instance_id,
event.name
FROM `app_name.app_events_20180303`,
UNNEST(event_dim) AS event
WHERE (event.name = "app_remove" OR event.name = "first_open")
ORDER BY app_instance_id, event.date
It gives me the following result where I can see that row 1 and 2 are the same user that installs and uninstalls the App:
I´ve tried to modify the previous query by using
WHERE (event.name = "app_remove" AND event.name = "first_open")
which gives: Query returned zero records.
Do you have any suggestions on how to achieve this? Thanks.
Try this, although I did not test it;
SELECT date,
app_instance_id
FROM
(SELECT event.date,
user_dim.app_info.app_instance_id,
event.name
FROM `app_name.app_events_20180303`,
UNNEST(event_dim) AS event
WHERE (event.name = "app_remove" OR event.name = "first_open"))
GROUP BY app_instance_id, date
HAVING COUNT(*) = 2
ORDER BY app_instance_id, date
To start, it's worth noting that iOS does not yield app_remove, so this query only counts Android users who go through the install/uninstall pattern.
I created a sub-set of users who emitted first_open and app_remove, and counted those entries grouped by the date. I only kept instances where users installed and removed the app the same number of times in a day (greater than zero).
Then I tallied the distinct users.
SELECT COUNT(DISTINCT(user_id)) as transient_user_count
FROM (
SELECT event_date,
user_id,
COUNT(if(event_name = "first_open", user_id, NULL)) as user_first_open,
COUNT(if(event_name = "app_remove", user_id, NULL)) as user_app_remove
FROM `your_app.analytics_123456.events_*`
-- WHERE (_TABLE_SUFFIX between '20191201' and '20191211')
GROUP BY user_id, event_date
HAVING user_first_open > 0 AND user_first_open = user_app_remove
)
If you're not able to rely on user_id, then the documentation suggests that you may be able to rely on the user_pseudo_id
Usually we can join the table by itself to find out such result, something like:
SELECT t1.date, t1.app_instance_id
FROM event as t1, event as t2
WHERE t1.date = t2.date and t1.app_instance_id = t2.app_instance_id and t1.name = "app_remove" and t2.name = "first_open"
ORDER by t1.app_instance_id, t1.date

Joining to landing pages query doubles the sessions per source

I'm trying to query sum of visits per source from a Big Query table of Google Analytics data, but will need to filter some sessions out at landing page level. Hence I'm pre-querying visitIDs by landing page and re-joining to session data like so:
#StandardSQL
WITH landingpages AS (
SELECT
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
# filters to be added here
)
SELECT
sessions.trafficSource.source,
SUM(sessions.totals.visits) AS visits
FROM `project.dataset.ga_sessions_*` AS sessions
JOIN
landingpages
ON
landingpages.visitID = sessions.visitID
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
GROUP BY
trafficSource.source
ORDER BY
visits DESC
This roughly doubles the number of sessions per each source as reported from GA.
Can anyone point out what I've done wrong? (I suspect it is blindingly obvious)
I've tried examining the data output from the first query and can't find anything wrong with it aside from a very small proportion of duplicated visitIDs. I've also tried various different types of JOIN, all to now avail.
When querying ga data from GBQ it's imperative to know and keep in mind that a unique visit is represented by both a fullVisitorID and visitID. Only a double join on both will return a meaningful data set.
Here's what I should have written:
#StandardSQL
WITH landingpages AS (
SELECT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
),
session_data AS (
SELECT
date AS ga_date, trafficSource.source AS source, fullVisitorId, visitID, SUM(totals.visits) AS visits
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
AND
totals.visits > 0
GROUP BY ga_date, source, fullVisitorId, visitID
)
SELECT
ga_date, source, SUM(visits) AS Sessions
FROM
landingpages
JOIN
session_data
ON
landingpages.VisitID = session_data.VisitID
AND
landingpages.fullVisitorId = session_data.fullVisitorId
GROUP BY
ga_date, source
ORDER BY
Sessions DESC

Need BigQuery SQL query to collect time on page from Google Analytics data

can anyone help with a BIgQuery SQL query to extract the time on page for a specific page from Google Analytics data please?
For every visitorId who has visited a particular page I would like the time on page for that page. This is so that I can calculate the median time on page rather than the mean.
I'm assuming that the visitorId, hits.hitsNumber and hits.time dimensions will be needed. Also that somehow the hits.time for the hit where the page was viewed will need to be subtracted from the hits.time of the following hit.
Any help much appreciated.
Try this:
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
(nextTime - hits.time) as timeOnPage
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
The key to this code is the LEAD() function, which grabs the specified value from the next row in the partition, based on the PARTITION BY and ORDER BY qualifiers.
Hope that helps!
To Account for last page time, this query can be used, and it will give zero time on last page,since BQ doesn't have way to calculate time spent on last page, but it will at least give zero instead of null.
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
CASE
WHEN hits.isExit IS NOT NULL THEN last_interaction - hit_time
ELSE next_pageview - hit_time
END
AS time_on_page
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.isExit,
hits.time/1000 as hits_time,
LEAD(hits.time/1000, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime,
MAX(IF(hits.isInteraction = TRUE,hits.time / 1000,0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
Google Analytics 4
Time-Spent-On-Page
SELECT
user_pseudo_id,
event_timestamp,
ga_session_id,
page_location,
page_title,
next_hit_in_the_same_session,
(next_hit_in_the_same_session - event_timestamp)/1000000 AS time_on_page_in_seconds
FROM (
SELECT
user_pseudo_id,
event_timestamp,
(
SELECT value.int_value FROM
UNNEST(event_params)
WHERE
key = 'ga_session_id') AS ga_session_id,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_location') AS page_location,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_title') AS page_title,
LEAD(event_timestamp) OVER (PARTITION BY (SELECT value.int_value FROM UNNEST(event_params)
WHERE
key = 'ga_session_id')
ORDER BY
event_timestamp ASC) AS next_hit_in_the_same_session
FROM
-- Replace table name.
`cloud-search.analytics_298504139.events_20220106` AS tableAlias
WHERE
event_name = 'page_view'
ORDER BY
user_pseudo_id,
ga_session_id,
event_timestamp ASC )

Resources