I have google analytics report for goal completions which is different from what i see from BigQuery. I am using below query to get the goal completions. discrepancy is very low varies from 1 to 20 approximately.
SELECT
distinct visitId
FROM
`gcp_project.ganalytics.ga_sessions_*` AS sessions,
UNNEST(hits) AS hits
WHERE
regexp_contains(hits.page.pagepath, '/booking/complete*')
and _table_suffix = '20210424'
and totals.visits=1
gcp_project is in US Region and Goals report is based on france webpage. Does timezone makes the difference?
Timezone is one factor. The BigQuery export dataset will be in UTC whereas in the GA UI, the timezone is user defined for each property.
In GA UI, the count is an approximation done using HyperLogLog functions whereas in your BigQuery query, you are doing a regular COUNT DISTINCT. An equivalent in BigQuery would be to use Approximate aggregate functions in Standard SQL. However, even then you might have minor discrepancies due to the different implementation of HyperLogLog in GA and BigQuery.
Related
During my inspections, I have found that after the 14th of January, new users count strongly differs between Google BigQuery and Google Firebase Analytics.
The discrepancy is higher than the traditional 0.5-2% rate that can be attributed to the HyperLogLog algorithm used to make computation faster.
I wasn't able to find a precise answer on how exactly new users are computed on Firebase Analytics to create the same query and get identical queries results. Since the discrepancy is above the 30% range, now the problem magnitude is more significant.
Do you have the same problem? How can I explain better this strange behavior? (by run other queries and try to find more details about the issue)
This is the query used to compare results:
SELECT APPROX_COUNT_DISTINCT(user_pseudo_id),event_date FROM `practical-bot-198011.analytics_184597160.events_*`
where event_name = 'first_open' and _TABLE_SUFFIX BETWEEN '20200110' AND '20200127'
GROUP BY event_date
ORDER BY event_date ASC
and this is the result I get:
but in the Google Firebase Analytics Dashboard:
One of the reason of count in Analytics dashboard doesn't match BigQuery results is that the data for the most recent three days is being updated every 4-5 hours in Analytics. In BigQuery data is only exported once per day. Queries which include the most recent three days will show different results between Analytics and BigQuery.
Count(distinct) is an approximation. To get an exact count of unique IDs, try to use EXACT_COUNT_DISTINCT(). Refer to this Stackoverflow thread.
Additionally, take a look to official documentation.
I have recently linked a Firebase project to BigQuery, the project contains both an iOS app and an Android app.
I need to run a query to export the count of the "session_start" event grouped by platform.
My query looks like this:
SELECT
event_date,
platform,
count(case when event_name = 'session_start' then 1 else null end) as app_sessions
from
`xxx.analytics_xxx.events_*`
WHERE
_table_suffix BETWEEN "20191028" AND FORMAT_DATE('%Y%m%d', date_sub(current_date(), INTERVAL 1 DAY))
GROUP BY
event_date, platform
Order by
event_date
I found differences between the query results and the Firebase console, the discrepancy does not occur for all the dates so I'm wondering why this is happening.
Am I querying the count of session_start in the wrong way?
Update:
The discrepancy is around 1% and the numbers I got from the query are greater than the ones I see in the console (I attached a table with some data from one of the platforms for clarification).
I read the post
Discrepancies on “active users metric” between Firebase Analytics dashboard and BigQuery export
especially the part regarding the needed time for data to be fully uploaded. In my case I noticed the discrepancy for dates older than three days even if the difference is very little (I ran the query on the 14th).
I can live with these variances, since I'm new to BigQuery I would like to know whether I'm querying data in the right way or not.
Indeed, I don't know if I should expect exactly the same numbers from BigQuery and the Firebase console or data from the two sources can be very close but small differences may occur.
Thank you
In a typical GA session, after picking a View ID and a date range,
We can get a week's worth of data like this:
Users
146,207
New Users
124,582
Sessions
186,191
The question is, what BQ field(s) to query in order to get this Users value?
Here is an example query with 2 methods (the 2nd method is commented out).
SELECT
count(DISTINCT(CONCAT(CAST(visitID as STRING),cast(visitNumber as
STRING)))) as visitors,
-- count(DISTINCT(fullVisitorId)) as visitors
I noticed the FVID method was fairly close to what I see in GA (with Users being a little understated by a 3% in BQ) and if I use the commented out method, I get a value that is about 15% overstated as compared to GA. Is there a more reliable method in BQ to acquire the Users value in GA?
The COUNT(DISTINCT fullVisitorId) method is the most correct method, but it won't match what Analytics 360 reports by default. Since last year, Google Analytics 360 by default uses a different calculation for the Users metric than it previously did. The old calculation, which is still used in unsampled reports, is more likely to match what you get out of BigQuery. You can verify this by exporting your report as an unsampled report, or using the unsampled reporting features in the Management API.
If you want the numbers to match exactly, you can turn off the new calculation by using the instructions here. The new calculation's precise details are not public, so duplicating that value in BigQuery is quite difficult.
There are still some reasons you might see different numbers, even with the old calculation. One is if the site has implemented User ID, in which case the GA number will be lower than BigQuery for fullVisitorId. Another is sampling, though that's unlikely in Analytics 360 at the volumes you're talking about.
I'm trying to develop a query against Firebase Analytics data linked to BigQuery to reproduce the "Daily user engagement" graph from the Firebase Analytics dashboard (to include in a Google Data Studio report).
According to Firebase Help documentation, Daily user engagement is defined as "Average daily engagement per user for the date range, including the fluctuation by percentage from the previous date range." So, my attempt is to sum the engagement_time_msec (the additional engagement time (ms) since the last user_engagement event according to https://support.google.com/firebase/answer/7061705?hl=en) for user_engagement events, divided by the count of users (identified by user_dim.app_info.app_instance_id) per day. The query looks like this:
SELECT ((total_engagement_time_msec / 1000) / users) as average_engagement_time_sec, date FROM
(SELECT
SUM(params.value.int_value) as total_engagement_time_msec,
COUNT(DISTINCT(user_dim.app_info.app_instance_id)) as users,
e.date
FROM `com_artermobilize_alertable_IOS.app_events_*`, UNNEST(event_dim) as e, UNNEST(e.params) as params
WHERE e.name = 'user_engagement'
AND params.key = 'engagement_time_msec'
GROUP BY e.date)
ORDER BY date desc
The results are close to what's displayed in the Firebase console graph of Daily user engagement, but the values from my query are consistently a few seconds higher (BigQuery results shown here on the left, Firebase Console graph values on the right).
To note, we're not setting user_dim.user_id and not using IDFA, so my understanding is the correct/only way to count "users" is the user_dim.app_info.app_instance_id, and I imagine the same would be true for the Firebase console.
Can anyone suggest what might be different between how I'm determining the average engagement time from BigQuery, and how that's being determined in the Firebase console graph?
To note, I've seen a similar question posed here, but I don't believe the suggested answer applies for my query since 1) the discrepancies are present over multiple days, 2) I'm already querying for user_engagement events and 3) the event date being used in the query is stated to be based on the registered timezone of your app (according to this).
As a Premium Google Analytics/BigQuery customer, our question is, Which data is more accurate?
I tend to want to lean toward BigQuery being more accurate because we can actually see the raw data, but we have no insight into the method Google Analyitcs is using to calculate its numbers.
I also think that a lot of it has to do with SAMPLING.
When you calculate something simple like Total Pageviews for a single page, the Google Analytics numbers line up to BigQuery within .00001%:
sum(case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then 1 else 0 end) as total_pageviews
When you calculate something more complex like Unique Pageviews for a single page, Google Analytics numbers are 5% greater than BigQuery. Note that it is sampling by the max 1 Million:
count(distinct (case when regexp_match(hits.page.pagepath,r'(?i:/contact.aspx)') and hits.type = "page" then concat(fullvisitorid, string(visitid)) end), 1000000) as unique_pageviews
I would love to know what others think or what the Google Developers themselves can explain.
If you are a premium customer I am assuming that's because you have a large website with a lot of data. The Google Analytics API will sample your data if there's too much of it. This is something you can try and prevent by putting the sampling level up. Even with the sampling level set to high precision you will still get sampled data back from the API.
Check the Json coming back from the API, it will tell you if your data is being sampled.
Big Query wont sample your data, there is a way for premium customers to use the API with out sampling data but I think you have to contact Google about setting that up.
The bigger point in Big Queries favor is that you aren't limited to 7 dimensions and 10 metrics like you are with the Google Analytics API.
Note: I am not a Google Developer but I am a Google Developer Expert for Google Analytics.
I am a big fan of BigQuery. I have also used Google Analytics quite a lot. So the question is about where the data is more accurate.
Well, the answer to such a question is always: "data is more accurate, the closer it is to where it originates". BigQuery is an underlying storage of all of Google's data. This is where data is collected, indexed, and then made accessible through a SQL interface.
Google Analytics is a tool that was developed with a lot of free accounts in mind. To support free accounts, GA needed to scale well. To scale, companies optimize on storage by pre-aggregating data.
So you are really comparing two things: pre-summarized/pre-aggregated data (GA) and raw accumulated data (BigQuery). Which would you trust?
Now, it sounds like there is also a 2nd question: "how to get accurate aggregates from BigQuery?" BigQuery is full on ANSI incompatible SQL that is hard to remember for ad-hoc queries. You are better off connecting a BI tool on top of BigQuery, so that you can explore data in a consistent manner (i.e. same threshold/rounding).