Daily schedule in BigQuery using data from Firebase analytics - firebase

So I have created a daily schedule in BigQuery using "Append to table" preference, so every day it adds yesterday's data to my specified table. I have scheduled to run this query every day at 9AM, but the issue is that sometimes Firebase creates previous day data table in BigQuery later then 9AM.
The example of daily scheduled SELECT I would be using is:
SELECT * FROM `analytics.events_*` WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
What would be the best practice to schedule a daily update for the previous day in BigQuery from Firebase, so there are no times where I am missing days?

Bigquery Schedules are set to run at fixed times. If your incoming data is varying in delivery time then BigQuery Schedules are not what you're looking for.
But if you insist in using BigQuery Schedules, you could just relax the WHERE condition and catch "missing" days the next time the schedule runs. Then you flipped your problem and instead need to handle the case of not appending already appended rows (also increasing query cost):
SELECT *
FROM `analytics.events_*`
LEFT JOIN [target dataset].[target table] AS T
USING (event_name, event_timestamp, user_pseudo_id)
WHERE T.event_name IS NULL
AND T.event_timestamp IS NULL
AND T.user_pseudo_id IS NULL
AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY))
Or you could alternatively modify the query into an INSERT statement where you insert records and handle duplications similarly:
INSERT `[target dataset].[target table]`
SELECT *
FROM `analytics.events_*`
LEFT JOIN `[target dataset].[target table]` AS T
USING (event_name, event_timestamp, user_pseudo_id)
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY))
AND T.event_name IS NULL
AND T.event_timestamp IS NULL
AND T.user_pseudo_id IS NULL
Then you wouldn't need to configure a destination table for the schedule.
Futhermore, if your target table is timestamp partitioned, you can reduce amount of data scanned by limiting the range in which you scan in the target table by adding an additional WHERE condition that strictly limits to a single date instead of the entire table:
...
AND DATE(T.event_timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 2)
...

Related

Is there a way to duplicate sessions from the app-warming issue in Firebase < 8.11.0 via BigQuery?

In Version 8.12.1 of the Firebase Apple SDK an issue with session_start events being logged during app prewarming on iOS 15+ which inserts additional 'session_start' events. I've noticed that as a result of additional session start rows which inserts additional 'ga_session_id' values into the BigQuery table.
ga_session_id is a unique session identifier associated with each event that occurs within a session and is thus created when this additional session_start fires when the app_warming occurs - using the session_number field and calculating session length it's possible to remove sessions with just one session_start and a small session length but this does not seem to reduce the overall count of sessions by much.
This has impacted the reported number of sessions when querying the BigQuery table when counting distinct user_psuedo_id||ga_session_id.
Is there a way to isolate these sessions in a separate table or constrict them from the query using an additional clause in said query to remove the sessions which are not truly sessions.
https://github.com/firebase/firebase-ios-sdk/issues/6161
https://firebase.google.com/support/release-notes/ios
A simplified version of said query I'm using:
with windowTemp as
(
select
PARSE_DATE("%Y%m%d",event_date) as date_formatted,
event_name,
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id
from
`firebase-XXXX.analytics_XXX.events_*`
where
_table_suffix between '20210201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
1,2,3,4
)
SELECT
date_formatted,
Count(DISTINCT user_pseudo_id) AS users,
Count(DISTINCT Concat(user_pseudo_id,session_id)) AS sessions,
FROM
windowTemp
GROUP by 1
ORDER BY 1

How to find time spent (engagement_time) on our app by the users in BigQuery?

I am trying to calculate the total time spent by users on my app. We have integrated firebase analytics data in BigQuery. Can I use the sum of the values of engagement_time_msec/1000 in the select statement of my query? This is what I am trying :
SELECT SUM(x.value.int_value) FROM "[dataset]", UNNEST(event_params) AS x WHERE x.key = "engagement_time_msec"
I am getting very big values after executing this query(it giving huge hours per day). I am not sure if is it ok to use SUM("engagement_time_msec") for calculating the total time spent by users on the app.
I am not expecting that users are spending this much time on the app. Is it the right way to calculate engagement_time, or which is the best event to calculate the engagement_time?
Any help would be highly appreciated.
As per google analytics docs in regards to engagement_time_sec, this field is defined as "The additional engagement time (ms) since the last user_engagement event". Therefore, if you only look at this, you are losing all the previous time spent by users before the mentioned user_engagement event is triggered.
What I'd do, since now ga_session_id is defined, would be to grab the maximum and minimum for each ga_session_id timestamp, use the TIMESTAMP_DIFF() function for each case, and sum the results of all the sessions for a given day:
WITH ga_sessions AS (
SELECT
event_timestamp,
event_date,
params.value.int_value AS ga_session_id
FROM
`analytics_123456789.events_*`, UNNEST(event_params) AS params
WHERE
params.key = "ga_session_id"
),
session_length AS (
SELECT
event_date,
TIMESTAMP_DIFF(MAX(TIMESTAMP_MICROS(event_timestamp)), MIN(TIMESTAMP_MICROS(event_timestamp)), SECOND) AS session_duration_seconds
FROM
ga_sessions
WHERE
ga_session_id IS NOT NULL
GROUP BY
1
),
final AS (
SELECT
event_date,
SUM(session_duration_seconds) as total_seconds_in_app
FROM
session_length
GROUP BY
1
ORDER BY
1 DESC
)
SELECT * FROM final
OUTPUT (data extracted from the app I work at):
event_date | total_seconds_in_app
-----------+--------------------
20210920 | 45600
20210919 | 43576
20210918 | 44539

BigQuery -firebase export working different when using wildcard character and _TABLE_SUFFIX compared to without using it

My Requirement:
To append unnested data in a separate table and use it for visualization and analytics
Implementing it :
As I am not sure at what time exactly events_intraday_YYYYMMDD syncs into events_YYYYMMDD for reference check here
0- Created an events_normalized table once at the start by using (It is done once not daily)
create analytics_data_export.events_normalized AS
SELECT .....
FROM
`analytics_xxxxxx.events_*
to collect all the data from events_YYYYMMDD
1- Creating/Replacing a daily temp table with
create or replace table analytics_data_export.daily_data_temp AS
SELECT...
_TABLE_SUFFIX BETWEEN
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY)) AND
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
as I have seen multiple days data syncing together so to be on the safe side I am using 1-4 days data
2- Deleting the inner join of both the tables(daily_data_temp,events_normalized) from events_normalized to remove any duplicates it might have like let's say events_normalized has data till 18th but daily_data_temp has data from 16-19th so all the rows till 18th from events_normalized will be removed
4- Reinserting daily_data_temp in the events_normalized
Questions:
1- Is there any optimized way of implementing the requirements
2- In the 0th step while creating events_normalized table if I use :
WHERE
_TABLE_SUFFIX <=
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY))
I get different results as compared to when I am using
create analytics_data_export.events_normalized AS
SELECT .....
FROM
`analytics_xxxxxx.events_*
The difference is the latter one has the current date data as well wherein events_YYYYMMDD I can only see data of yesterday. I don't understand this behavior
Like if the current day is 20th July in events_YYYYMMDD I can see only till events_20200719
To optimize you can follow below steps:
Create hash out of event_time_stamp and other unique fields, use this to filter the data
Instead of deleting duplicate rows from the larger initial table delete them from small temp table and then insert the table.
its because the filter analytics_xxxxxx.events_* will match both per day events table and intraday event tables which are name
like events_intraday_20200721

Firebase vs BigQuery Active Users Discrepancies

I've integrated my Firebase project with BigQuery. Now I'm facing a data discrepancy issue while trying to get 1 day active users, for the selected date i.e. 20190210, with following query from BigQuery;
SELECT COUNT(DISTINCT user_pseudo_id) AS 1_day_active_users_count
FROM `MY_TABLE.events_*`
WHERE event_name = 'user_engagement' AND _TABLE_SUFFIX = '20190210'
But the figures returned from BigQuery doesn't match with the ones reported on Firebase Analytics Dashboard for the same date. Any clue what's possibly going wrong here?
The following sample query mentioned my Firebase Team, here https://support.google.com/firebase/answer/9037342?hl=en&ref_topic=7029512, is not so helpful as its taking into consideration the current time and getting users accordingly.
N-day active users
/**
* Builds an audience of N-Day Active Users.
*
* N-day active users = users who have logged at least one user_engagement
* event in the last N days.
*/
SELECT
COUNT(DISTINCT user_id) AS n_day_active_users_count
FROM
-- PLEASE REPLACE WITH YOUR TABLE NAME.
`YOUR_TABLE.events_*`
WHERE
event_name = 'user_engagement'
-- Pick events in the last N = 20 days.
AND event_timestamp >
UNIX_MICROS(TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 20 DAY))
-- PLEASE REPLACE WITH YOUR DESIRED DATE RANGE.
AND _TABLE_SUFFIX BETWEEN '20180521' AND '20240131';
So given the small discrepancy here, I believe the issue is one of timezones.
When you're looking at a "day" in the Firebase Console, you're looking at the time interval from midnight to midnight in whatever time zone you've specified when you first set up your project. When you're looking at a "day" in BigQuery, you're looking at the time interval from midnight to midnight in UTC.
If you want to make sure you're looking at the events that match up with what's in your console, you should query the event_timestamp value in your BigQuery table (and remember that it might span multiple tables) to match up with what's in your timezone.

Optimize the performance of product scoped query

I would like to build the following table every day, to store some aggregate data on page performance of a website. However, each days worth of data is over 15 million rows.
What steps can I take to improve performance? I am intending to save them as sharded tables, but I would like to improve further, could I nest the data within each table to improve performance further? What would be the best way to do this?
SELECT
device.devicecategory AS device,
hits_product.productListName AS list_name,
UPPER(hits_product.productSKU) AS SKU,
AVG(hits_product.productListPosition) AS avg_plp_position
FROM `mindful-agency-136314.43786551.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(hits.product) AS hits_product
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND hits_product.productListName != "(not set)"
GROUP BY
device,
list_name,
SKU
Since you're using productSku and productListName as dimensions/groups there is no way around cross joining with product array.
You're also cross joining with product which can be dangerous because sometimes this array is missing and you destroy the whole row - typically you'd use a left join. But in this case, it's fine because you're only interested in product fields.
You should, however, be clear about whether you want to see list clicks or list impressions using hits.product.isImpression and hits.product.isClick. Atm I don't see a distinction there. Maybe filter for WHERE hits_product.isImpression in case of list views?
Instead of shards you might want to consider adding a date field and PARTITION BY date as well as CLUSTER BY list_name. See INSERT Statement for updating
and DDL Syntax to start the table. This is more performant than shards when it comes to querying the table later.
Starting the table could look something like this:
CREATE TABLE `x.y.z`
PARTITION BY date
CLUSTER BY list_name
AS (
SELECT
PARSE_DATE('%Y%m%d',date) AS date,
device.devicecategory AS device,
hits_product.productListName AS list_name,
UPPER(hits_product.productSKU) AS SKU,
AVG(IF(hits_product.isClick, hits_product.productListPosition, NULL)) AS avg_plp_click_position,
AVG(IF(hits_product.isImpression, hits_product.productListPosition, NULL)) AS avg_plp_view_position
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(hits.product) AS hits_product
WHERE
parse_date('%y%m%d', _table_suffix)
between
DATE_sub(current_date(), interval 1 day)
and DATE_sub(current_date(), interval 1 day)
AND hits_product.productListName != "(not set)"
GROUP BY
date,
device,
list_name,
SKU
)
Inserting new records is quite similar, you just need to mention the fields upfront as described in the documentation.

Resources