How to flatten Google Analytics custom dimensions with a UDF in BigQuery? - google-analytics

Based on a post by Robert Sahlin, I want to use a BigQuery UDF to access any Google Analytics custom dimension in BigQuery by its index. In the proposed solution Robert uses a JavaScript UDF, and I'm wondering if it's possible to do the same with a SQL UDF - since a SQL UDF should perform better than a JS one.
The proposed JS UDF:
CREATE TEMPORARY FUNCTION customDimensionByIndex(index INT64, arr ARRAY<STRUCT<index INT64, value STRING>>)
RETURNS STRING
LANGUAGE js AS """
for (var j = 0; j < arr.length; j++){
if(arr[j].index == index){
return arr[j].value;
}
}
""";
SELECT
fullvisitorId,
visitId,
hit.hitnumber,
customDimensionByIndex(6, hit.customDimensions) as author,
customDimensionByIndex(7, hit.customDimensions) as category
FROM `123456.ga_sessions_YYYYMMDD`
JOIN
UNNEST(hits) as hit

With a SQL UDF:
#standardSQL
CREATE TEMP FUNCTION customDimensionByIndex(indx INT64, arr ARRAY<STRUCT<index INT64, value STRING>>) AS (
(SELECT x.value FROM UNNEST(arr) x WHERE indx=x.index)
);
SELECT
fullvisitorId,
visitId,
hit.hitnumber,
customDimensionByIndex(1, hit.customDimensions),
customDimensionByIndex(2, hit.customDimensions),
customDimensionByIndex(3, hit.customDimensions)
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`, UNNEST(hits) hit
LIMIT 1000
I'm not sure why the original solution looks at "hit" instead of the column "hits" on the sample dataset - so to get to individual hits I had to UNNEST() them too.

Related

Calculate nested field without loosing export schema in BigQuery

Calculation on a field leads to a loss of the original export schema in BigQuery.
I have a standard enhanced e-commerce schema and want to change the transactionRevenue to a different currency. I want to keep the general export schema structure. The calculated field "transactionRevenueNewCurrency" should be in hits.transaction.transactionRevenueNewCurrency.
#standardSQL
SELECT
s.*,
ARRAY(SELECT COALESCE( x.transaction.transactionRevenue*1.17,0)
FROM UNNEST(hits) AS x) AS transactionRevenueNewCurrency
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` as s , UNNEST(hits) as h
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20160831'
AND transaction.transactionRevenue >0
LIMIT 10000
The new field is attached to the session instead each hit.
Below is for BigQuery Standard SQL
#standardSQL
SELECT * REPLACE(
ARRAY(
SELECT AS STRUCT * REPLACE(
(SELECT AS STRUCT * REPLACE(
COALESCE(CAST(transactionRevenue * 1.17 AS INT64), 0
) AS transactionRevenue)
FROM UNNEST([transaction])
) AS transaction)
FROM UNNEST(hits) hit
) AS hits)
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160831'

Google BigQuery - Updating nested Revenue fields

I tried to apply the solution in Google BigQuery - Updating a nested repeated field to the field hits.transaction.transactionRevenue, but I receive error message:
Scalar subquery produced more than one element
I have tried to run the following query:
UPDATE `project_id.dataset_id.table`
SET hits = ARRAY(
SELECT AS STRUCT * REPLACE (
(SELECT AS STRUCT transaction.* REPLACE (1 AS transactionRevenue)) AS transaction
)
FROM UNNEST(hits) as transactionRevenue
)
WHERE (select h.transaction.transactionId from unnest(hits) as h) LIKE 'ABC123XYZ'
Are there any obvious mistakes on my part? Would be great if anyone could share some tips or experiences that could help me with this.
What I basically want to do is to set the revenue of a specific transaction to 1.
Many thanks in advance,
David
This is the problem:
WHERE (select h.transaction.transactionId from unnest(hits) as h) LIKE 'ABC123XYZ'
If there is more than one hit in the array, this will cause the error that you are seeing. You probably want this instead:
WHERE EXISTS (select 1 from unnest(hits) as h WHERE h.transaction.transactionId LIKE 'ABC123XYZ')
But note that your UPDATE will now replace all elements of the array for any row where this condition is true. What you may want is to move the condition inside the ARRAY function call instead:
UPDATE `project_id.dataset_id.table`
SET hits = ARRAY(
SELECT AS STRUCT * REPLACE (
(SELECT AS STRUCT transaction.* REPLACE (1 AS transactionRevenue)) AS transaction
)
FROM UNNEST(hits) as h
WHERE h.transaction.transactionId LIKE 'ABC123XYZ'
)
WHERE true
Now the replacement will only apply to hits with a transaction ID matching the pattern.

Accessing Struct(s) and Array(s) in Firebase Closed Funnels through BigQuery

I stumbled unto this standard SQL BigQuery documentation this week, which got me started with a Firebase Analytics Closed Funnel. I however got the wrong results (view image below). There should be no users that had a "Tutorial_LessonCompleted" before they did not start a "Tutorial_LessonStarted >> Lesson = 1 " first. This could be because of various reasons.
Questions:
Is it wise to use the User Property = "first_open_time", or is it better to use the Event = "first_open". How would the latter implementation look like ?
I suspect I am perhaps not correctly drilling down to: Event (String = "Tutorial_LessonStarted") >> parameter (String = "LessonNumber") >> value (String = "lesson1")?
How would a filter on _TABLE_SUFFIX = '20170701' possibly work, I read this will be cheaper. Any optimised code suggestions are received with open arms and an up-vote!
#standardSQL
SELECT
step1, step2, step3, step4, step5, step6,
COUNT(*) AS funnel_count,
COUNT(DISTINCT user_id) AS users
FROM (
SELECT
user_dim.app_info.app_instance_id AS user_id,
event.timestamp_micros AS event_timestamp,
event.name AS step1,
LEAD(event.name, 1) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step2,
LEAD(event.name, 2) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step3,
LEAD(event.name, 3) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step4,
LEAD(event.name, 4) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step5,
LEAD(event.name, 5) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step6
FROM
`......`,
UNNEST(event_dim) AS event,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "first_open_time"
ORDER BY 1, 2, 3, 4, 5 ASC
)
WHERE step6 = "Tutorial_LessonStarted" AND EXISTS (
SELECT *
FROM `......`,
UNNEST(event_dim) AS event,
UNNEST(event.params)
WHERE key = 'LessonNumber' AND value.string_value = "lesson1") GROUP BY step1, step2, step3, step4, step5, step6
ORDER BY funnel_count DESC
LIMIT 100;
Note:
Enter your query table FROM, i.e:project_id.com_game_example_IOS.app_events_20170212,
I left out the funnel_count and user_count.
Output:
----------------------------------------------------------
Update since original question above:
#Elliot: I don’t understand why you said: -- ensure that an event with lesson1 precedes Tutorial_LessonStarted.
Tutorial_LessonStarted has a parameter "LessonNumber" with values lesson1,lesson2,lesson3,lesson4.
I want to count all funnels that took place with a last step in the funnel equal to LessonNumber=lesson1.
So, applied to event log-data for a brand new user's first session (aka: an user that fired first_open_time), the answer would be the table below:
View.OnboardingWelcomePage
View.OnboardingFinalPage
View.JamLoading
View.JamLoading
Jam.UserViewsJam
Jam.ProjectOpened
View.JamMixer
Tutorial.LessonStarted (This parameter “LessonNumber"'s value would be equal to “lesson1”)
Jam.ProjectPlayStarted
View.JamLoopSelector
View.JamMixer
View.JamLoopSelector
View.JamMixer
View.JamLoopSelector
View.JamMixer
Tutorial.LessonCompleted
Tutorial.LessonStarted (This parameter “LessonNumber"'s value would be equal to “lesson2”)
So it is important to firstly get all the users that had a first_open_time on a specific day, as well structure the events into a funnel so that the last event in the funnel is one which matches an event and a specific parameter value, and then form the funnel "backwards" from there.
Let me go through some explanation, then see if I can suggest a query to get you started.
It looks like you want to analyze the sequence of events in your analytics data, but the sequence is already there for you--you have an array of the events. Looking at the Firebase schema for BigQuery, event_dim is the relevant column, and unless I'm misunderstanding something, these events are ordered by time. If you want to check what the sixth event's name was, you can use:
event_dim[SAFE_ORDINAL(6)].name
This will evaluate to NULL if there were fewer than six events, or else it will give you the string with the event name.
Another observation is that you are attempting to analyze both event_dim and user_dim, but you are taking the cross product of the two, which will explode the number of rows and make it hard to reason about the results of the query. To look for a specific user property, use an expression of this form:
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>'
Combining these two filters, your FROM and WHERE clause would look something like this:
FROM `project_id.com_game_example_IOS.app_events_*`
WHERE _TABLE_SUFFIX = '20170701' AND
event_dim[SAFE_ORDINAL(6)].name = 'Tutorial_LessonStarted' AND
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>'
Using the bracket operator to access the steps from event_dim, we can do something like this:
WITH FilteredInput AS (
SELECT *
FROM `project_id.com_game_example_IOS.app_events_*`
WHERE _TABLE_SUFFIX = '20170701' AND
event_dim[SAFE_ORDINAL(6)].name = 'Tutorial_LessonStarted' AND
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>' AND
-- ensure that an event with lesson1 precedes Tutorial_LessonStarted
EXISTS (
SELECT 1
FROM UNNEST(event_dim) WITH OFFSET event_offset
CROSS JOIN UNNEST(params)
WHERE key = 'LessonNumber' AND
value.string_value = 'lesson1' AND
event_offset < 5
)
)
SELECT
event_dim[ORDINAL(1)].name AS step1,
event_dim[ORDINAL(2)].name AS step2,
event_dim[ORDINAL(3)].name AS step3,
event_dim[ORDINAL(4)].name AS step4,
event_dim[ORDINAL(5)].name AS step5,
event_dim[ORDINAL(6)].name AS step6,
COUNT(*) AS funnel_count,
COUNT(DISTINCT user_dim.user_id) AS users
FROM FilteredInput
GROUP BY step1, step2, step3, step4, step5, step6;
This will return all unique "paths" along with a count and number of distinct users for each. Note that I'm just writing this off the top of my head--I don't have representative data that I can try it on--so there may be syntax or other errors.

How to get the Google Analytics definition of unique page views in Bigquery

https://support.google.com/analytics/answer/1257084?hl=en-GB#pageviews_vs_unique_views
I'm trying to calculate the sum of unique page views per day which Google analytics has on its interface
How do I get the equivalent using bigquery?
There are two ways how this is used:
1) One is as the original linked documentation says, to combine full visitor user id, and their different session id: visitId, and count those.
SELECT
EXACT_COUNT_DISTINCT(combinedVisitorId)
FROM (
SELECT
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE' )
2) The other is just counting distinct fullVisitorIds
SELECT
EXACT_COUNT_DISTINCT(fullVisitorId)
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
If someone wants to try out this on a sample public dataset there is a tutorial how to add the sample dataset.
The other queries didn't match the Unique Pageviews metric in my Google Analytics account, but the following did:
SELECT COUNT(1) as unique_pageviews
FROM (
SELECT
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber,
COUNT(1) as hits
FROM [my_table]
WHERE hits.type='PAGE'
GROUP BY
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber
)
For uniquePageViews you better want to use something like this:
SELECT
date,
SUM(uniquePageviews) AS uniquePageviews
FROM (
SELECT
date,
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId,
EXACT_COUNT_DISTINCT(hits.page.pagePath) AS uniquePageviews
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
GROUP BY 1,2)
GROUP EACH BY 1;
So, in 2022 EXACT_COUNT_DISTINCT() seems to be deprecated..
Also for me the following combination of fullvisitorid+visitNumber+visitStartTime+hits.page.pagePath was always more precise than the above solutions:
SELECT
SUM(Unique_PageViews)
FROM
(SELECT
COUNT(DISTINCT(CONCAT(fullvisitorid,"-",CAST(visitNumber AS string),"-",CAST(visitStartTime AS string),"-",hits.page.pagePath))) as Unique_PageViews
FROM
`mhsd-bigquery-project.8330566.ga_sessions_*`,
unnest(hits) as hits
WHERE
_table_suffix BETWEEN '20220307'
AND '20220313'
AND hits.type = 'PAGE')

SQLite Group By Limit

I have a web service that generates radio station playlists and I'm trying to ensure that playlists never have tracks from the same artist more than n times.
So for example (unless it is Mandatory Metallica --haha) then no artist should ever dominate any 8 hour programming segment.
Today we use a query similar to this which generates smaller randomized playlists out of existing very large playlists:
SELECT FilePath FROM vwPlaylistTracks
WHERE Owner='{0}' COLLATE NOCASE AND
Playlist='{1}' COLLATE NOCASE
ORDER BY RANDOM()
LIMIT {2};
Someone then has to manually review the playlists and do some manual editing if the same artist appears consecutively or more than the desired limit.
Supposing the producer wants to ensure that no artist appears more than twice in the span of the playlist generated in this query (and assuming there is an artist field in the vwPlaylistTracks view; which there is) is GROUP BY the correct way to accomplish this?
I've been messing around with the view trying to accomplish this but this query always only returns 1 track from each artist.
SELECT
a.Name as 'Artist',
f.parentPath || '\' || f.fileName as 'FilePath',
p.name as 'Playlist',
u.username as 'Owner'
FROM mp3_file f,
mp3_track t,
mp3_artist a,
mp3_playlist_track pt,
mp3_playlist p,
mp3_user u
WHERE f.file_id = t.track_id
AND t.artist_id = a.artist_id
AND t.track_id = pt.track_id
AND pt.playlist_id = p.playlist_id
AND p.user_id = u.user_id
--AND p.Name = 'Alternative Rock'
GROUP BY a.Name
--HAVING Count(a.Name) < 3
--ORDER BY RANDOM()
--LIMIT 50;
GROUP BY creates exactly one result record for each distinct value in the grouped column, so this is not what you want.
You have to count any previous records with the same artist, which is not easy because the random ordering is not stable.
However, this is possible with a temporary table, which is ordered by its rowid:
CREATE TEMPORARY TABLE RandomTracks AS
SELECT a.Name as Artist, parentPath, name, username
FROM ...
WHERE ...
ORDER BY RANDOM();
CREATE INDEX RandomTracks_Artist on RandomTracks(Artist);
SELECT *
FROM RandomTracks AS r1
WHERE -- filter out if there are any two previous records with the same artist
(SELECT COUNT(*)
FROM RandomTracks AS r2
WHERE r2.Artist = r1.Artist
AND r2.rowid < r1.rowid
) < 2
AND -- filter out if the directly previous record has the same artist
r1.Artist IS NOT (SELECT Artist
FROM RandomTracks AS r3
WHERE r3.rowid = r1.rowid - 1)
LIMIT 50;
DROP TABLE RandomTracks;
It might be easier and faster to just read the entire playlist and to filter and reorder it in your code.

Resources