Discrepancies between Firebase console and BigQuery - firebase

I have recently linked a Firebase project to BigQuery, the project contains both an iOS app and an Android app.
I need to run a query to export the count of the "session_start" event grouped by platform.
My query looks like this:
SELECT
event_date,
platform,
count(case when event_name = 'session_start' then 1 else null end) as app_sessions
from
`xxx.analytics_xxx.events_*`
WHERE
_table_suffix BETWEEN "20191028" AND FORMAT_DATE('%Y%m%d', date_sub(current_date(), INTERVAL 1 DAY))
GROUP BY
event_date, platform
Order by
event_date
I found differences between the query results and the Firebase console, the discrepancy does not occur for all the dates so I'm wondering why this is happening.
Am I querying the count of session_start in the wrong way?
Update:
The discrepancy is around 1% and the numbers I got from the query are greater than the ones I see in the console (I attached a table with some data from one of the platforms for clarification).
I read the post
Discrepancies on “active users metric” between Firebase Analytics dashboard and BigQuery export
especially the part regarding the needed time for data to be fully uploaded. In my case I noticed the discrepancy for dates older than three days even if the difference is very little (I ran the query on the 14th).
I can live with these variances, since I'm new to BigQuery I would like to know whether I'm querying data in the right way or not.
Indeed, I don't know if I should expect exactly the same numbers from BigQuery and the Firebase console or data from the two sources can be very close but small differences may occur.
Thank you

Related

BigQuery New Users count strongly differ from displayed Firebase Analytics data

During my inspections, I have found that after the 14th of January, new users count strongly differs between Google BigQuery and Google Firebase Analytics.
The discrepancy is higher than the traditional 0.5-2% rate that can be attributed to the HyperLogLog algorithm used to make computation faster.
I wasn't able to find a precise answer on how exactly new users are computed on Firebase Analytics to create the same query and get identical queries results. Since the discrepancy is above the 30% range, now the problem magnitude is more significant.
Do you have the same problem? How can I explain better this strange behavior? (by run other queries and try to find more details about the issue)
This is the query used to compare results:
SELECT APPROX_COUNT_DISTINCT(user_pseudo_id),event_date FROM `practical-bot-198011.analytics_184597160.events_*`
where event_name = 'first_open' and _TABLE_SUFFIX BETWEEN '20200110' AND '20200127'
GROUP BY event_date
ORDER BY event_date ASC
and this is the result I get:
but in the Google Firebase Analytics Dashboard:
One of the reason of count in Analytics dashboard doesn't match BigQuery results is that the data for the most recent three days is being updated every 4-5 hours in Analytics. In BigQuery data is only exported once per day. Queries which include the most recent three days will show different results between Analytics and BigQuery.
Count(distinct) is an approximation. To get an exact count of unique IDs, try to use EXACT_COUNT_DISTINCT(). Refer to this Stackoverflow thread.
Additionally, take a look to official documentation.

When should I run daily ETL jobs for Firebase Analytics data exported to BigQuery?

We use Firebase Analytics to collect events from our apps. We have enabled events export to BigQuery. Every day we run some ETL jobs to create more friendly analytics tables in BigQuery (e.g. sessions, purchases).
The question is when should we run these ETL jobs?
We know that Firebase Analytics creates in BigQuery 'events_intraday_' table which is changed to 'events_' after some hours after midnight. We also understand that some events might be reported later if client is not connected with the internet, but this is not the problem.
Our theory is that 'events_intraday_' table is some kind of temporary table and we should run ETL jobs when it changes to 'events_'. Unfortunately we could not find any documentation about it. Is this good solution?
From Announcing Realtime Exporting of your Analytics Data into BigQuery:
At the end of the day [1], this data will be moved into its permanent appevents_ home, and the old intraday table will be automatically cleaned up for you.
With:
[1] This is determined by looking at the developer's time zone.
So it looks like the daily table is created at midnight for your timezone.
Thanks to Frank van Puffelen I've found article on Firebase Blog
How Long Does it Take for My Firebase Analytics Data to Show Up?, which says that analytics data exported to BigQuery can be delayed up to little more than 1 hour. So based on this information ETL jobs should be runned about, lets say 2 AM UTC+0 and query should just UNION ALL events with events_intraday table.
So if today is 2019-04-02 and I want to query data from last month, the query should look like:
SELECT * FROM
(
SELECT *
FROM `<PROJECT_ID>.analytics_<ANALYTICS_ID>.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20190301' AND '20190401'
)
UNION ALL
(
SELECT *
FROM `<PROJECT_ID>.analytics_<ANALYTICS_ID>.events_intraday_*`
WHERE _TABLE_SUFFIX = '20190401'
)

Calculate time spent by users on app using Google BigQuery Firebase analytics data

I am trying to calculate total time spent by users on my app. We have integrated firebase analytics data in BigQuery. Can I use sum to the values of engagement_time_msec in select statement of my query? This is what I am trying :
SELECT SUM(x.value.int_value)
FROM "[dataset]",
UNNEST(event_params) AS x WHERE x.key = "engagement_time_msec"
I am getting very big values after executing this query. I am not sure if is it ok to use SUM("engagement_time_msec") for calculating total time spent by users on app.
Any help would be highly appreciated.
It really depends on what dataset you have. Ideally, you would want login and logout timestamps if you have. Take the time_diff between the values, grouping by user, device, loadsequence etc. Anything which defines a single event

Discrepancy in Daily user engagement between Firebase Analytics dashboard and BigQuery

I'm trying to develop a query against Firebase Analytics data linked to BigQuery to reproduce the "Daily user engagement" graph from the Firebase Analytics dashboard (to include in a Google Data Studio report).
According to Firebase Help documentation, Daily user engagement is defined as "Average daily engagement per user for the date range, including the fluctuation by percentage from the previous date range." So, my attempt is to sum the engagement_time_msec (the additional engagement time (ms) since the last user_engagement event according to https://support.google.com/firebase/answer/7061705?hl=en) for user_engagement events, divided by the count of users (identified by user_dim.app_info.app_instance_id) per day. The query looks like this:
SELECT ((total_engagement_time_msec / 1000) / users) as average_engagement_time_sec, date FROM
(SELECT
SUM(params.value.int_value) as total_engagement_time_msec,
COUNT(DISTINCT(user_dim.app_info.app_instance_id)) as users,
e.date
FROM `com_artermobilize_alertable_IOS.app_events_*`, UNNEST(event_dim) as e, UNNEST(e.params) as params
WHERE e.name = 'user_engagement'
AND params.key = 'engagement_time_msec'
GROUP BY e.date)
ORDER BY date desc
The results are close to what's displayed in the Firebase console graph of Daily user engagement, but the values from my query are consistently a few seconds higher (BigQuery results shown here on the left, Firebase Console graph values on the right).
To note, we're not setting user_dim.user_id and not using IDFA, so my understanding is the correct/only way to count "users" is the user_dim.app_info.app_instance_id, and I imagine the same would be true for the Firebase console.
Can anyone suggest what might be different between how I'm determining the average engagement time from BigQuery, and how that's being determined in the Firebase console graph?
To note, I've seen a similar question posed here, but I don't believe the suggested answer applies for my query since 1) the discrepancies are present over multiple days, 2) I'm already querying for user_engagement events and 3) the event date being used in the query is stated to be based on the registered timezone of your app (according to this).

Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export

According to Firebase Analytics docs (https://support.google.com/firebase/answer/6317517#active-users), the active number of users is the number of unique users who initiated sessions on a given day. Also according to the docs, every time a session is started an event with session_start name is sent. I am trying to get that metric using BigQuery's export, but my query is giving me different results (15636 on BigQuery, 14908 on FB analytics)
I have also tried converting to different timezones to see if that might be the issue, but no matter which timezone I try I never get the same (or similar) results
Which query should I run to get the same results I get on Firebase Analytics dashboard for active users?
My query is
SELECT EXACT_COUNT_DISTINCT(user_dim.app_info.app_instance_id)
FROM table_date_range([XXXXX.app_events_], timestamp('2016-11-26'), timestamp('2016-11-29'))
WHERE DATE(event_dim.timestamp_micros) = '2016-11-27'
AND event_dim.name ='session_start'
Thanks
Update
After #djabi's answer I changed my query to use user_engagement rather than session_start and it works much better now. Still some minor differences though (they range from under ten to under 50 out of 16K, depending on the date).
I have tried once again using different timezones by playing around with DATE(date_add(event_dim.timestamp_micros,1,'hour')) but I never got the exact number I get on Firebase Analytics dashboard.
The new numbers are good enough to be considered statistically acceptable, but wondering if anyone has a suggestion to improve the query and get exact results?
The current query is:
SELECT
COUNT(*) AS active_users
FROM (
SELECT
COALESCE(user_dim.user_id, user_dim.app_info.app_instance_id) AS user_id
FROM
TABLE_DATE_RANGE([XXXXX.app_events_], TIMESTAMP('2016-11-24'), TIMESTAMP('2016-11-29'))
WHERE
DATE(event_dim.timestamp_micros) = '2016-11-25'
AND event_dim.name ='user_engagement'
GROUP BY
user_id )
Note: At the moment we are not sending user_id, so the COALESCE will always return the app_instance_id, in case anyone was going to suggest that could be the problem
You need to wait for full 3 days for data from offline devices to be uploaded. Your query correctly filter the events based on the event timestamp and you pull data from 3 days but that is only day and half from today and that is enough for all data to be uploaded. Try including 3 days from yesterday.
Also try using user_engagement event instead of session_start. I believe active user count is based on user_engagement and not on session_start events.
Also FB reports take a bit to process so you wight want and check the FB reports the next day.
FB reports are done on the time zone on the account and events are timestamped in UTC so the day in FB reports is different from UTC calendar day. You want to control for that discrepancy as well to get matching numbers.
Sessions are by-default measured after user activity of 10 seconds in the respective app which you can change. Try changing the sessions start time count to the least number possible and then you may arrive at a number closer to what you are expecting.
For Android stats I used:
user_dim.device_info.resettable_device_id
instead of
user_dim.app_info.app_instance_id
and it produced better results.

Resources