During my inspections, I have found that after the 14th of January, new users count strongly differs between Google BigQuery and Google Firebase Analytics.
The discrepancy is higher than the traditional 0.5-2% rate that can be attributed to the HyperLogLog algorithm used to make computation faster.
I wasn't able to find a precise answer on how exactly new users are computed on Firebase Analytics to create the same query and get identical queries results. Since the discrepancy is above the 30% range, now the problem magnitude is more significant.
Do you have the same problem? How can I explain better this strange behavior? (by run other queries and try to find more details about the issue)
This is the query used to compare results:
SELECT APPROX_COUNT_DISTINCT(user_pseudo_id),event_date FROM `practical-bot-198011.analytics_184597160.events_*`
where event_name = 'first_open' and _TABLE_SUFFIX BETWEEN '20200110' AND '20200127'
GROUP BY event_date
ORDER BY event_date ASC
and this is the result I get:
but in the Google Firebase Analytics Dashboard:
One of the reason of count in Analytics dashboard doesn't match BigQuery results is that the data for the most recent three days is being updated every 4-5 hours in Analytics. In BigQuery data is only exported once per day. Queries which include the most recent three days will show different results between Analytics and BigQuery.
Count(distinct) is an approximation. To get an exact count of unique IDs, try to use EXACT_COUNT_DISTINCT(). Refer to this Stackoverflow thread.
Additionally, take a look to official documentation.
Related
I have recently linked a Firebase project to BigQuery, the project contains both an iOS app and an Android app.
I need to run a query to export the count of the "session_start" event grouped by platform.
My query looks like this:
SELECT
event_date,
platform,
count(case when event_name = 'session_start' then 1 else null end) as app_sessions
from
`xxx.analytics_xxx.events_*`
WHERE
_table_suffix BETWEEN "20191028" AND FORMAT_DATE('%Y%m%d', date_sub(current_date(), INTERVAL 1 DAY))
GROUP BY
event_date, platform
Order by
event_date
I found differences between the query results and the Firebase console, the discrepancy does not occur for all the dates so I'm wondering why this is happening.
Am I querying the count of session_start in the wrong way?
Update:
The discrepancy is around 1% and the numbers I got from the query are greater than the ones I see in the console (I attached a table with some data from one of the platforms for clarification).
I read the post
Discrepancies on “active users metric” between Firebase Analytics dashboard and BigQuery export
especially the part regarding the needed time for data to be fully uploaded. In my case I noticed the discrepancy for dates older than three days even if the difference is very little (I ran the query on the 14th).
I can live with these variances, since I'm new to BigQuery I would like to know whether I'm querying data in the right way or not.
Indeed, I don't know if I should expect exactly the same numbers from BigQuery and the Firebase console or data from the two sources can be very close but small differences may occur.
Thank you
In a typical GA session, after picking a View ID and a date range,
We can get a week's worth of data like this:
Users
146,207
New Users
124,582
Sessions
186,191
The question is, what BQ field(s) to query in order to get this Users value?
Here is an example query with 2 methods (the 2nd method is commented out).
SELECT
count(DISTINCT(CONCAT(CAST(visitID as STRING),cast(visitNumber as
STRING)))) as visitors,
-- count(DISTINCT(fullVisitorId)) as visitors
I noticed the FVID method was fairly close to what I see in GA (with Users being a little understated by a 3% in BQ) and if I use the commented out method, I get a value that is about 15% overstated as compared to GA. Is there a more reliable method in BQ to acquire the Users value in GA?
The COUNT(DISTINCT fullVisitorId) method is the most correct method, but it won't match what Analytics 360 reports by default. Since last year, Google Analytics 360 by default uses a different calculation for the Users metric than it previously did. The old calculation, which is still used in unsampled reports, is more likely to match what you get out of BigQuery. You can verify this by exporting your report as an unsampled report, or using the unsampled reporting features in the Management API.
If you want the numbers to match exactly, you can turn off the new calculation by using the instructions here. The new calculation's precise details are not public, so duplicating that value in BigQuery is quite difficult.
There are still some reasons you might see different numbers, even with the old calculation. One is if the site has implemented User ID, in which case the GA number will be lower than BigQuery for fullVisitorId. Another is sampling, though that's unlikely in Analytics 360 at the volumes you're talking about.
I'm trying to develop a query against Firebase Analytics data linked to BigQuery to reproduce the "Daily user engagement" graph from the Firebase Analytics dashboard (to include in a Google Data Studio report).
According to Firebase Help documentation, Daily user engagement is defined as "Average daily engagement per user for the date range, including the fluctuation by percentage from the previous date range." So, my attempt is to sum the engagement_time_msec (the additional engagement time (ms) since the last user_engagement event according to https://support.google.com/firebase/answer/7061705?hl=en) for user_engagement events, divided by the count of users (identified by user_dim.app_info.app_instance_id) per day. The query looks like this:
SELECT ((total_engagement_time_msec / 1000) / users) as average_engagement_time_sec, date FROM
(SELECT
SUM(params.value.int_value) as total_engagement_time_msec,
COUNT(DISTINCT(user_dim.app_info.app_instance_id)) as users,
e.date
FROM `com_artermobilize_alertable_IOS.app_events_*`, UNNEST(event_dim) as e, UNNEST(e.params) as params
WHERE e.name = 'user_engagement'
AND params.key = 'engagement_time_msec'
GROUP BY e.date)
ORDER BY date desc
The results are close to what's displayed in the Firebase console graph of Daily user engagement, but the values from my query are consistently a few seconds higher (BigQuery results shown here on the left, Firebase Console graph values on the right).
To note, we're not setting user_dim.user_id and not using IDFA, so my understanding is the correct/only way to count "users" is the user_dim.app_info.app_instance_id, and I imagine the same would be true for the Firebase console.
Can anyone suggest what might be different between how I'm determining the average engagement time from BigQuery, and how that's being determined in the Firebase console graph?
To note, I've seen a similar question posed here, but I don't believe the suggested answer applies for my query since 1) the discrepancies are present over multiple days, 2) I'm already querying for user_engagement events and 3) the event date being used in the query is stated to be based on the registered timezone of your app (according to this).
According to Firebase Analytics docs (https://support.google.com/firebase/answer/6317517#active-users), the active number of users is the number of unique users who initiated sessions on a given day. Also according to the docs, every time a session is started an event with session_start name is sent. I am trying to get that metric using BigQuery's export, but my query is giving me different results (15636 on BigQuery, 14908 on FB analytics)
I have also tried converting to different timezones to see if that might be the issue, but no matter which timezone I try I never get the same (or similar) results
Which query should I run to get the same results I get on Firebase Analytics dashboard for active users?
My query is
SELECT EXACT_COUNT_DISTINCT(user_dim.app_info.app_instance_id)
FROM table_date_range([XXXXX.app_events_], timestamp('2016-11-26'), timestamp('2016-11-29'))
WHERE DATE(event_dim.timestamp_micros) = '2016-11-27'
AND event_dim.name ='session_start'
Thanks
Update
After #djabi's answer I changed my query to use user_engagement rather than session_start and it works much better now. Still some minor differences though (they range from under ten to under 50 out of 16K, depending on the date).
I have tried once again using different timezones by playing around with DATE(date_add(event_dim.timestamp_micros,1,'hour')) but I never got the exact number I get on Firebase Analytics dashboard.
The new numbers are good enough to be considered statistically acceptable, but wondering if anyone has a suggestion to improve the query and get exact results?
The current query is:
SELECT
COUNT(*) AS active_users
FROM (
SELECT
COALESCE(user_dim.user_id, user_dim.app_info.app_instance_id) AS user_id
FROM
TABLE_DATE_RANGE([XXXXX.app_events_], TIMESTAMP('2016-11-24'), TIMESTAMP('2016-11-29'))
WHERE
DATE(event_dim.timestamp_micros) = '2016-11-25'
AND event_dim.name ='user_engagement'
GROUP BY
user_id )
Note: At the moment we are not sending user_id, so the COALESCE will always return the app_instance_id, in case anyone was going to suggest that could be the problem
You need to wait for full 3 days for data from offline devices to be uploaded. Your query correctly filter the events based on the event timestamp and you pull data from 3 days but that is only day and half from today and that is enough for all data to be uploaded. Try including 3 days from yesterday.
Also try using user_engagement event instead of session_start. I believe active user count is based on user_engagement and not on session_start events.
Also FB reports take a bit to process so you wight want and check the FB reports the next day.
FB reports are done on the time zone on the account and events are timestamped in UTC so the day in FB reports is different from UTC calendar day. You want to control for that discrepancy as well to get matching numbers.
Sessions are by-default measured after user activity of 10 seconds in the respective app which you can change. Try changing the sessions start time count to the least number possible and then you may arrive at a number closer to what you are expecting.
For Android stats I used:
user_dim.device_info.resettable_device_id
instead of
user_dim.app_info.app_instance_id
and it produced better results.
As a Premium Google Analytics/BigQuery customer, our question is, Which data is more accurate?
I tend to want to lean toward BigQuery being more accurate because we can actually see the raw data, but we have no insight into the method Google Analyitcs is using to calculate its numbers.
I also think that a lot of it has to do with SAMPLING.
When you calculate something simple like Total Pageviews for a single page, the Google Analytics numbers line up to BigQuery within .00001%:
sum(case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then 1 else 0 end) as total_pageviews
When you calculate something more complex like Unique Pageviews for a single page, Google Analytics numbers are 5% greater than BigQuery. Note that it is sampling by the max 1 Million:
count(distinct (case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then concat(fullvisitorid, string(visitid)) end), 1000000) as unique_pageviews
I would love to know what others think or what the Google Developers themselves can explain.
If you are a premium customer I am assuming that's because you have a large website with a lot of data. The Google Analytics API will sample your data if there's too much of it. This is something you can try and prevent by putting the sampling level up. Even with the sampling level set to high precision you will still get sampled data back from the API.
Check the Json coming back from the API, it will tell you if your data is being sampled.
Big Query wont sample your data, there is a way for premium customers to use the API with out sampling data but I think you have to contact Google about setting that up.
The bigger point in Big Queries favor is that you aren't limited to 7 dimensions and 10 metrics like you are with the Google Analytics API.
Note: I am not a Google Developer but I am a Google Developer Expert for Google Analytics.
I am a big fan of BigQuery. I have also used Google Analytics quite a lot. So the question is about where the data is more accurate.
Well, the answer to such a question is always: "data is more accurate, the closer it is to where it originates". BigQuery is an underlying storage of all of Google's data. This is where data is collected, indexed, and then made accessible through a SQL interface.
Google Analytics is a tool that was developed with a lot of free accounts in mind. To support free accounts, GA needed to scale well. To scale, companies optimize on storage by pre-aggregating data.
So you are really comparing two things: pre-summarized/pre-aggregated data (GA) and raw accumulated data (BigQuery). Which would you trust?
Now, it sounds like there is also a 2nd question: "how to get accurate aggregates from BigQuery?" BigQuery is full on ANSI incompatible SQL that is hard to remember for ad-hoc queries. You are better off connecting a BI tool on top of BigQuery, so that you can explore data in a consistent manner (i.e. same threshold/rounding).