Firestore BigQuery extension - performance - firebase

I am using the Firestore BigQuery extension to stream data to Google BigQuery.
This data is stored in json format so I think best practice is to generate schema views with this library
When I am now running my BI tool on those views to aggregate and filter some data I see a really poor performance for the resulting queries in BigQuery.
Is there an approach to get this done better? I was thinking to use materialized views but the schema view scripts is already building view upon other views and you can't do that with materialized views. I think I am missing something in my whole setup because I am only talking about collections with a few thousands of records in it
EDIT
A real time example from my prod environment
Raw data table transaction_raw_changelog, coming directly from the Firestore extension
Generating schema views creates 2 views
transaction_raw_latest
-- Retrieves the latest document change events for all live documents.
-- timestamp: The Firestore timestamp at which the event took place.
-- operation: One of INSERT, UPDATE, DELETE, IMPORT.
-- event_id: The id of the event that triggered the cloud function mirrored the event.
-- data: A raw JSON payload of the current state of the document.
-- document_id: The document id as defined in the Firestore database
SELECT
document_name,
document_id,
timestamp,
event_id,
operation,
data
FROM
(
SELECT
document_name,
document_id,
FIRST_VALUE(timestamp) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS timestamp,
FIRST_VALUE(event_id) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS event_id,
FIRST_VALUE(operation) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS operation,
FIRST_VALUE(data) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS data,
FIRST_VALUE(operation) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) = "DELETE" AS is_deleted
FROM
`swipedrinks-app.transaction.transaction_raw_changelog`
ORDER BY
document_name,
timestamp DESC
)
WHERE
NOT is_deleted
GROUP BY
document_name,
document_id,
timestamp,
event_id,
operation,
data
transaction_schema_transaction_schema_latest
-- Given a user-defined schema over a raw JSON changelog, returns the
-- schema elements of the latest set of live documents in the collection.
-- timestamp: The Firestore timestamp at which the event took place.
-- operation: One of INSERT, UPDATE, DELETE, IMPORT.
-- event_id: The event that wrote this row.
-- <schema-fields>: This can be one, many, or no typed-columns
-- corresponding to fields defined in the schema.
SELECT
*
EXCEPT
(orderitem)
FROM
(
SELECT
document_name,
document_id,
timestamp,
operation,
amount,
bartenderId,
eventStandId,
event_id,
paymentMethod,
type,
orderitem,
toUserId
FROM
(
SELECT
document_name,
document_id,
FIRST_VALUE(timestamp) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS timestamp,
FIRST_VALUE(operation) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS operation,
FIRST_VALUE(operation) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) = "DELETE" AS is_deleted,
`swipedrinks-app.transaction.firestoreNumber`(
FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.amount')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
)
) AS amount,
FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.bartenderId')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS bartenderId,
FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.eventStandId')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS eventStandId,
FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.event_id')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS event_id,
`swipedrinks-app.transaction.firestoreNumber`(
FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.paymentMethod')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
)
) AS paymentMethod,
FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.type')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS type,
`swipedrinks-app.transaction.firestoreArray`(
FIRST_VALUE(JSON_EXTRACT(data, '$.order')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
)
) AS orderitem,
FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.toUserId')) OVER(
PARTITION BY document_name
ORDER BY
timestamp DESC
) AS toUserId
FROM
`swipedrinks-app.transaction.transaction_raw_latest`
)
WHERE
NOT is_deleted
) transaction_raw_latest
LEFT JOIN UNNEST(transaction_raw_latest.orderitem) AS orderitem_member WITH OFFSET orderitem_index
GROUP BY
document_name,
document_id,
timestamp,
operation,
amount,
bartenderId,
eventStandId,
event_id,
paymentMethod,
type,
toUserId,
orderitem_index,
orderitem_member
The view transaction_schema_transaction_schema latest is the view easy to query with all my recent data and columns per collection document property.
I'd like to query for the sum of amounts per transaction per event_id
SELECT sum(amount) FROM `swipedrinks-app.transaction.transaction_schema_transaction_schema_latest`
This query takes around 12s, I have 196697 rows in this table

This may help:
Try to use Standard JSON extraction functions like JSON_VALUE instead of Legacy JSON extraction functions like JSON_EXTRACT_SCALA, as the docs says:
While these functions are supported by Google Standard SQL, we recommend using the functions in the previous table.
Try to use BigQuery BI Engine, in some cases with this optimization activated it helps to reduze the actual BigQuery time to response. As far as I know 12s to a response in BQ is not an issue, you have to look if this time scale as your data growths, maybe is not great for your solution, have you consider Cloud Bigtable? Take a look at this material.

Related

Finding the oldest customers in a sql database

I'm trying to find the oldest person in a SQL database that has the following configuration:
Customers (
cardNo INTEGER PRIMARY KEY,
first TEXT,
last TEXT,
sex CHAR,
dob DATE
)
I'm trying to find the oldest customers in the database, of which there are 28 (they have the same dob). I'm not sure how to get multiple results from the min() keyword.
You can do this with a subquery.
Something like:
SELECT first, last FROM Customers
WHERE
dob = (SELECT MIN(dob) FROM Customers);
I believe that MIN() / MAX() is an aggregate function which means that it returns a single scalar value.
More info for the aggregate functions can be found here: Aggregate functions info
But to solve your problem, The query should be like this.
MS SQL
SELECT
c.first,
c.last
FROM Customers c
WHERE c.dob IS NOT NULL
AND c.dob = (
SELECT TOP 1 cc.dob
FROM Customers cc
WHERE cc.dob IS NOT NULL
ORDER BY cc.dob
)
GROUP BY c.dob
ORDER BY c.dob
SQL LITE
SELECT
c.first,
c.last
FROM Customers c
WHERE c.dob IS NOT NULL
AND c.dob = (
SELECT cc.dob
FROM Customers cc
WHERE cc.dob IS NOT NULL
ORDER BY cc.dob
LIMIT 1
)
GROUP BY c.dob
ORDER BY c.dob
I think it will still need optimization. Hope this helps. :)

Firebase Events for Newly Installed Purchaser Cohort in Bigquery

Given the install date of android users, I would like to get the users' count for all our 200+ Firebase events on day0 to dayX for users which have already made at least one purchase in a defined period after installation. The first half of this question was previously solved in this question. I thought it would be helpful to share an added "purchaser"-cohort query for others to re-use.
My first attempt (which failed):
-- STANDARD SQL
-- NEW BIGQUERY EXPORT SCHEMA
SELECT
a.event_name AS event_name,
a._TABLE_SUFFIX as day,
COUNT(1) as users
FROM `xxxx.analytics_xxxx.events_*` as c
RIGHT JOIN (SELECT user_pseudo_id, event_date, event_timestamp, event_name
FROM `xxxx.analytics_xxxx.events_*`
WHERE user_first_touch_timestamp BETWEEN 1530453600000000 AND 1530468000000000
AND _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
AND platform = "ANDROID"
AND (event_name = 'in_app_purchase' OR event_name = 'ecommerce_purchase')
) as a
ON a.user_pseudo_id = c.user_pseudo_id
WHERE _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
GROUP BY event_name, day;
Answer:
-- STANDARD SQL
-- NEW BIGQUERY EXPORT SCHEMA
SELECT
event_name AS event_name,
_TABLE_SUFFIX as day,
COUNT(1) as users
FROM `xxxx.analytics_xxxx.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
AND user_pseudo_id IN (SELECT user_pseudo_id
FROM `xxxx.analytics_xxxx.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20180630' AND '20180707'
AND user_first_touch_timestamp BETWEEN 1530453600000000 AND 1530468000000000
AND (event_name = 'in_app_purchase' OR event_name = 'ecommerce_purchase')
AND platform = "ANDROID")
GROUP BY event_name, day;
PS: Suggestions to optimize this script are always welcome :)

Firebase BigQuery schema migration: Move into a partitioned table?

I got the email with instructions to migrate my previous Firebase tables in BigQuery to the new schema. They point to these instructions:
https://support.google.com/analytics/answer/7029846?#migrationscript
But I'd prefer to:
Instead of running a bash script, I'd rather run only one query that executes the migration.
Instead of creating a number of new tables, I'd rather move all the previous results to a new date partitioned table.
I took the script on the documentation and made some changes.
Look at all the --Fh comments. Those are my modifications.
Choose your destination table.
Choose your date range for Android and IOS.
Note that I'm adding a new column with a real timestamp for partitioning (and your convenience).
Instead of getting a number of new tables, you'll only get one - but partitioned by date.
Modified script:
#standardSQL
CREATE OR REPLACE TABLE `fh-bigquery.deleting.delete`
PARTITION BY DATE(ts)
AS
WITH sources AS ( --Fh
SELECT * FROM (
SELECT *, _table_suffix event_date, 'ANDROID' operating_system
FROM `firebase-public-project.com_firebase_demo_ANDROID.app_events_*`
UNION ALL SELECT *, _table_suffix event_date, 'IOS' operating_system
FROM `firebase-public-project.com_firebase_demo_IOS.app_events_*`
)
WHERE event_date BETWEEN '20180503' AND '20180504' --Fh: choose your timerange
)
SELECT
event_date, --Fh: extracted from original table name
TIMESTAMP_MICROS(event.timestamp_micros) ts, --Fh: adding a real timestamp column
event.timestamp_micros AS event_timestamp,
event.previous_timestamp_micros AS event_previous_timestamp,
event.name AS event_name,
event.value_in_usd AS event_value_in_usd,
user_dim.bundle_info.bundle_sequence_id AS event_bundle_sequence_id,
user_dim.bundle_info.server_timestamp_offset_micros as event_server_timestamp_offset,
(
SELECT
ARRAY_AGG(STRUCT(event_param.key AS key,
STRUCT(event_param.value.string_value AS string_value,
event_param.value.int_value AS int_value,
event_param.value.double_value AS double_value,
event_param.value.float_value AS float_value) AS value))
FROM
UNNEST(event.params) AS event_param) AS event_params,
user_dim.first_open_timestamp_micros AS user_first_touch_timestamp,
user_dim.user_id AS user_id,
user_dim.app_info.app_instance_id AS user_pseudo_id,
"" AS stream_id,
user_dim.app_info.app_platform AS platform,
STRUCT( user_dim.ltv_info.revenue AS revenue,
user_dim.ltv_info.currency AS currency ) AS user_ltv,
STRUCT( user_dim.traffic_source.user_acquired_campaign AS name,
user_dim.traffic_source.user_acquired_medium AS medium,
user_dim.traffic_source.user_acquired_source AS source ) AS traffic_source,
STRUCT( user_dim.geo_info.continent AS continent,
user_dim.geo_info.country AS country,
user_dim.geo_info.region AS region,
user_dim.geo_info.city AS city ) AS geo,
STRUCT( user_dim.device_info.device_category AS category,
user_dim.device_info.mobile_brand_name,
user_dim.device_info.mobile_model_name,
user_dim.device_info.mobile_marketing_name,
user_dim.device_info.device_model AS mobile_os_hardware_model,
operating_system, --Fh
user_dim.device_info.platform_version AS operating_system_version,
user_dim.device_info.device_id AS vendor_id,
user_dim.device_info.resettable_device_id AS advertising_id,
user_dim.device_info.user_default_language AS language,
user_dim.device_info.device_time_zone_offset_seconds AS time_zone_offset_seconds,
IF(user_dim.device_info.limited_ad_tracking, "Yes", "No") AS is_limited_ad_tracking ) AS device,
STRUCT( user_dim.app_info.app_id AS id,
'app_id' AS firebase_app_id, --Fh: choose your app id
user_dim.app_info.app_version AS version,
user_dim.app_info.app_store AS install_source ) AS app_info,
( SELECT ARRAY_AGG(STRUCT(user_property.key AS key,
STRUCT(user_property.value.value.string_value AS string_value,
user_property.value.value.int_value AS int_value,
user_property.value.value.double_value AS double_value,
user_property.value.value.float_value AS float_value,
user_property.value.set_timestamp_usec AS set_timestamp_micros ) AS value))
FROM UNNEST(user_dim.user_properties) AS user_property
) AS user_properties
FROM sources -- Fh
, UNNEST(event_dim) AS event

Accessing Struct(s) and Array(s) in Firebase Closed Funnels through BigQuery

I stumbled unto this standard SQL BigQuery documentation this week, which got me started with a Firebase Analytics Closed Funnel. I however got the wrong results (view image below). There should be no users that had a "Tutorial_LessonCompleted" before they did not start a "Tutorial_LessonStarted >> Lesson = 1 " first. This could be because of various reasons.
Questions:
Is it wise to use the User Property = "first_open_time", or is it better to use the Event = "first_open". How would the latter implementation look like ?
I suspect I am perhaps not correctly drilling down to: Event (String = "Tutorial_LessonStarted") >> parameter (String = "LessonNumber") >> value (String = "lesson1")?
How would a filter on _TABLE_SUFFIX = '20170701' possibly work, I read this will be cheaper. Any optimised code suggestions are received with open arms and an up-vote!
#standardSQL
SELECT
step1, step2, step3, step4, step5, step6,
COUNT(*) AS funnel_count,
COUNT(DISTINCT user_id) AS users
FROM (
SELECT
user_dim.app_info.app_instance_id AS user_id,
event.timestamp_micros AS event_timestamp,
event.name AS step1,
LEAD(event.name, 1) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step2,
LEAD(event.name, 2) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step3,
LEAD(event.name, 3) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step4,
LEAD(event.name, 4) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step5,
LEAD(event.name, 5) OVER (
PARTITION BY user_dim.app_info.app_instance_id
ORDER BY event.timestamp_micros ASC) as step6
FROM
`......`,
UNNEST(event_dim) AS event,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "first_open_time"
ORDER BY 1, 2, 3, 4, 5 ASC
)
WHERE step6 = "Tutorial_LessonStarted" AND EXISTS (
SELECT *
FROM `......`,
UNNEST(event_dim) AS event,
UNNEST(event.params)
WHERE key = 'LessonNumber' AND value.string_value = "lesson1") GROUP BY step1, step2, step3, step4, step5, step6
ORDER BY funnel_count DESC
LIMIT 100;
Note:
Enter your query table FROM, i.e:project_id.com_game_example_IOS.app_events_20170212,
I left out the funnel_count and user_count.
Output:
----------------------------------------------------------
Update since original question above:
#Elliot: I don’t understand why you said: -- ensure that an event with lesson1 precedes Tutorial_LessonStarted.
Tutorial_LessonStarted has a parameter "LessonNumber" with values lesson1,lesson2,lesson3,lesson4.
I want to count all funnels that took place with a last step in the funnel equal to LessonNumber=lesson1.
So, applied to event log-data for a brand new user's first session (aka: an user that fired first_open_time), the answer would be the table below:
View.OnboardingWelcomePage
View.OnboardingFinalPage
View.JamLoading
View.JamLoading
Jam.UserViewsJam
Jam.ProjectOpened
View.JamMixer
Tutorial.LessonStarted (This parameter “LessonNumber"'s value would be equal to “lesson1”)
Jam.ProjectPlayStarted
View.JamLoopSelector
View.JamMixer
View.JamLoopSelector
View.JamMixer
View.JamLoopSelector
View.JamMixer
Tutorial.LessonCompleted
Tutorial.LessonStarted (This parameter “LessonNumber"'s value would be equal to “lesson2”)
So it is important to firstly get all the users that had a first_open_time on a specific day, as well structure the events into a funnel so that the last event in the funnel is one which matches an event and a specific parameter value, and then form the funnel "backwards" from there.
Let me go through some explanation, then see if I can suggest a query to get you started.
It looks like you want to analyze the sequence of events in your analytics data, but the sequence is already there for you--you have an array of the events. Looking at the Firebase schema for BigQuery, event_dim is the relevant column, and unless I'm misunderstanding something, these events are ordered by time. If you want to check what the sixth event's name was, you can use:
event_dim[SAFE_ORDINAL(6)].name
This will evaluate to NULL if there were fewer than six events, or else it will give you the string with the event name.
Another observation is that you are attempting to analyze both event_dim and user_dim, but you are taking the cross product of the two, which will explode the number of rows and make it hard to reason about the results of the query. To look for a specific user property, use an expression of this form:
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>'
Combining these two filters, your FROM and WHERE clause would look something like this:
FROM `project_id.com_game_example_IOS.app_events_*`
WHERE _TABLE_SUFFIX = '20170701' AND
event_dim[SAFE_ORDINAL(6)].name = 'Tutorial_LessonStarted' AND
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>'
Using the bracket operator to access the steps from event_dim, we can do something like this:
WITH FilteredInput AS (
SELECT *
FROM `project_id.com_game_example_IOS.app_events_*`
WHERE _TABLE_SUFFIX = '20170701' AND
event_dim[SAFE_ORDINAL(6)].name = 'Tutorial_LessonStarted' AND
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'first_open_time') = '<expected property value>' AND
-- ensure that an event with lesson1 precedes Tutorial_LessonStarted
EXISTS (
SELECT 1
FROM UNNEST(event_dim) WITH OFFSET event_offset
CROSS JOIN UNNEST(params)
WHERE key = 'LessonNumber' AND
value.string_value = 'lesson1' AND
event_offset < 5
)
)
SELECT
event_dim[ORDINAL(1)].name AS step1,
event_dim[ORDINAL(2)].name AS step2,
event_dim[ORDINAL(3)].name AS step3,
event_dim[ORDINAL(4)].name AS step4,
event_dim[ORDINAL(5)].name AS step5,
event_dim[ORDINAL(6)].name AS step6,
COUNT(*) AS funnel_count,
COUNT(DISTINCT user_dim.user_id) AS users
FROM FilteredInput
GROUP BY step1, step2, step3, step4, step5, step6;
This will return all unique "paths" along with a count and number of distinct users for each. Note that I'm just writing this off the top of my head--I don't have representative data that I can try it on--so there may be syntax or other errors.

group_concat sqlite and order by

I am trying to order within group_concat in sqlite3. This is not supported in the way that it is for mysql, so I CAN NOT do this:
group_concat(logID order by logDateTime DESC)
My full query is here, I need logID to be ordered by logDateTime. I read about a subquery method here: Sqlite group_concat ordering but I was unable to get it to work in my query.
SELECT logID,
NULL AS sessionID,
logDateTime,
NULL AS sessionName,
NULL AS NamesInRandomOrder
FROM logs
WHERE sessionID IS NULL
UNION ALL
SELECT NULL,
sessions.sessionID,
MAX(logDateTime),
sessions.sessionName,
group_concat(logID)
FROM logs
JOIN sessions ON logs.sessionID = sessions.sessionID
GROUP BY sessions.sessionID
ORDER BY 3 DESC
This orders the subquery, but SQLite doesn't guarantee that the order remains in all cases:
SELECT logID,
NULL AS sessionID,
logDateTime,
logName,
NULL AS NamesInRandomOrder
FROM logs
WHERE sessionID IS NULL
UNION ALL
SELECT NULL,
sessionID,
MAX(logDateTime),
sessionName,
group_concat(logName)
FROM (SELECT sessions.sessionID AS sessionID,
logDateTime,
sessionName,
logName
FROM logs JOIN sessions ON logs.sessionID = sessions.sessionID
ORDER BY sessions.sessionID,
logName)
GROUP BY sessionID
ORDER BY 3 DESC

Resources