There seems to be 1-2% of duplicates in the Firebase analytics events exported to Big Query. What are the best practices to remove these?
Atm the client does not send a counter with the events (per session). This would provide an unambiguous way of removing duplicate events, so I recommend Firebase implementing that. However, at the moment, what would be a good way to remove the duplicates? Look at client user_pseudo_id, event_timestamp, and event_name - fields and remove all except one with same triple?
How does event_bundle_sequence_id -field work? Will duplicates have the same value in this field, or different? That is, are duplicate events sent within the same bundle, or in different bundles?
Is Firebase planning to remove these duplicates earlier in the processing, either for Firebase analytics itself, or in the export to Big Query?
Standard SQL to check for duplicates in one days events:
with n_dups as
(
SELECT event_name, event_timestamp, user_pseudo_id, count(1)-1 as n_duplicates
FROM `project.dataset.events_20190610`
group by event_name, event_timestamp, user_pseudo_id
)
select n_duplicates, count(1) as n_cases
from n_dups
group by n_duplicates
order by n_cases desc
We use the QUALIFY clause for deduplication Firebase events in BigQuery:
SELECT
*
FROM
`project.dataset.events_*`
QUALIFY
ROW_NUMBER() OVER (
PARTITION BY
user_pseudo_id,
event_name,
event_timestamp,
TO_JSON_STRING(event_params)
) = 1
Qualifying columns:
- name: user_pseudo_id
description: Autogenerated pseudonymous ID for the user -
Unique identifier for a specific installation of application on a client device,
e.g. "938642951.1666427135".
All events generated by that device will be tagged with this pseudonymous ID,
so that you can relate events from the same user together.
- name: event_name
description: Event name, e.g. "app_launch", "session_start", "login", "logout" etc.
- name: event_timestamp
description: The time (in microseconds, UTC) at which the event was logged on the client,
e.g. "1666529002225262".
- name: event_params
description: A repeated record (ARRAY) of the parameters associated with this event.
Related
In Version 8.12.1 of the Firebase Apple SDK an issue with session_start events being logged during app prewarming on iOS 15+ which inserts additional 'session_start' events. I've noticed that as a result of additional session start rows which inserts additional 'ga_session_id' values into the BigQuery table.
ga_session_id is a unique session identifier associated with each event that occurs within a session and is thus created when this additional session_start fires when the app_warming occurs - using the session_number field and calculating session length it's possible to remove sessions with just one session_start and a small session length but this does not seem to reduce the overall count of sessions by much.
This has impacted the reported number of sessions when querying the BigQuery table when counting distinct user_psuedo_id||ga_session_id.
Is there a way to isolate these sessions in a separate table or constrict them from the query using an additional clause in said query to remove the sessions which are not truly sessions.
https://github.com/firebase/firebase-ios-sdk/issues/6161
https://firebase.google.com/support/release-notes/ios
A simplified version of said query I'm using:
with windowTemp as
(
select
PARSE_DATE("%Y%m%d",event_date) as date_formatted,
event_name,
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id
from
`firebase-XXXX.analytics_XXX.events_*`
where
_table_suffix between '20210201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
1,2,3,4
)
SELECT
date_formatted,
Count(DISTINCT user_pseudo_id) AS users,
Count(DISTINCT Concat(user_pseudo_id,session_id)) AS sessions,
FROM
windowTemp
GROUP by 1
ORDER BY 1
We have configured a linking between the GA 4 property and GoogleBigQuery via the GA interface (without any additional code). It works fine, we see a migrated data in GBQ tables, but however, we face an issue with how this data is written in those tables.
If we look at any table we could see that events from different users can be recorded in one session (and there can be different clientIDs (and even usedIDs, which we pass when authorizing a user)) See an example
This is a result of executing following query:
SELECT
event_name,
user_pseudo_id,
user_id,
device.category,
device.mobile_brand_name,
device.mobile_model_name,
device.operating_system_version,
geo.region,
geo.city,
params.key,
params.value.int_value
FROM `%project_name%.analytics_256374149.events_20210331`, unnest(event_params) AS params
WHERE event_name="page_view"
AND params.value.int_value=1617218965
ORDER BY event_timestamp
As a result, you can see that within one session different users from different regions, with different devices and identifiers are combined. It is, of course, impossible to use such data for reporting purposes. Once again, it is a default GA4 → BigQuery setup in the GA4 interface (no add-ons).
We do not understand what the error is (in import, in requests, or somewhere else) and would like to get advice on this issue.
Thanks.
You should look at the combination of user_pseudo_id and the event_param ga_session_id. This combination is unique and used for measuring unique sessions across a property.
For example, this query counts the number of unique event names in each session:
SELECT
user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
COUNT(DISTINCT event_name) AS unique_event_name_count
FROM `<project>.<dataset>.events_*`
GROUP BY user_pseudo_id, ga_session_id
I checked the official docs here:
https://support.google.com/firebase/answer/7061705?hl=en
But, when I checked my data, there are several user_id and user_pseudo_id in one ga_session_id.
How is it possible?
ga_session_id is not supposed to be globally unique (afaik it's based on a skewed in-device timestamp) but in most circumstances (except for edge cases) it should be locally unique for a given user_pseudo_id
session_id is not unique (two or more users can have the same session_id).
It is just timestamp when the session started, so we need to concat session_id and user_pseudo_id or user_id.
I think the problem is the query used, with this I see the unique values associated correctly in my data:
SELECT event_timestamp, user_id, user_pseudo_id, event_name, event_params.value.int_value AS session_id
FROM `MYTABLE`,
UNNEST (event_params) AS event_params
WHERE event_params.key = "ga_session_id" LIMIT 1000
I'm trying to calculate average session length using BigQuery for my Firebase + Unity setup.
I followed tutorials for default Unity setup. I can gather data, and see where new sessions begin.
However, I can't seem to find proper session length. I'm able to gather the time between sessions, however I can't seem to find an event which signals sessions expiring (I know they do after 30 minutes of inactivity).
My alternative path has proven a bit difficult...I attempted to get the last interaction event when a session starts, and subtract the event_previous_timestamp from it, no luck because session_start isn't actually the first event thrown when starting a new session!
Here is the query I attempted:
#standardsql
SELECT event_name, session_length, time_between_sessions
FROM
(SELECT user_pseudo_id, event_name, event_timestamp,
event_previous_timestamp,
LAG(event_timestamp, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS last_triggered_event,
(LAG(event_timestamp, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) - event_previous_timestamp) / 60000000 AS
session_length,
(event_timestamp - event_previous_timestamp) / 60000000 AS
time_between_sessions
FROM `insertyourtablename`
ORDER BY event_timestamp)
WHERE
event_name = "session_start"
I hope there is an easier way to do this, or I'm close! Thank you :)
We are validating a query in Big Query, and cannot get the results to match with the google analytics UI. A similar question can be found here, but in our case the the mismatch only occurs when we apply a specific filter on ecommerce_action.action_type.
Here is the query:
SELECT COUNT(distinct fullVisitorId+cast(visitid as string)) AS sessions
FROM (
SELECT
device.browserVersion,
geoNetwork.networkLocation,
geoNetwork.networkDomain,
geoNetwork.city,
geoNetwork.country,
geoNetwork.continent,
geoNetwork.region,
device.browserSize,
visitNumber,
trafficSource.source,
trafficSource.medium,
fullvisitorId,
visitId,
device.screenResolution,
device.flashVersion,
device.operatingSystem,
device.browser,
totals.pageviews,
channelGrouping,
totals.transactionRevenue,
totals.timeOnSite,
totals.newVisits,
totals.visits,
date,
hits.eCommerceAction.action_type
FROM
(select *
from TABLE_DATE_RANGE([zzzzzzzzz.ga_sessions_],
<range>) ))t
WHERE
hits.eCommerceAction.action_type = '2' and <stuff to remove bots>
)
From the UI using the built in shopping behavior report, we get 3.836M unique sessions with a product detail view, compared with 3.684M unique sessions in Big Query using the query above.
A few questions:
1) We are under the impression the shopping behavior report "Sessions with Product View" breakdown is based off of the ecommerce_action.actiontype filter. Is that true?
2) Is there a .totals pre-aggregated table that the UI maybe pulling from?
It sounds like the issue is that COUNT(DISTINCT ...) is approximate when using legacy SQL, as noted in the migration guide, so the counts are not accurate. Either use standard SQL instead (preferred) or use EXACT_COUNT_DISTINCT with legacy SQL.
You're including product list views in your query.
As described in https://support.google.com/analytics/answer/3437719 you need to make sure, that no product has isImpression = TRUE because that would mean it is a product list view.
This query sums all sessions which contain any action_type='2' for which all isProduct are null or false:
SELECT
SUM(totals.visits) AS sessions
FROM
`project.123456789.ga_sessions_20180101` AS t
WHERE
(
SELECT
LOGICAL_OR(h.ecommerceaction.action_type='2')
FROM
t.hits AS h
WHERE
(SELECT LOGICAL_AND(isimpression IS NULL OR isimpression = FALSE) FROM h.product))
For legacySQL you can adapt the example in the documentation.
In addition to the fact that COUNT(DISTINCT ...) is approximate when using legacy SQL, there could be sessions in which there are only non-interactive hits, which will not be counted as sessions in the Google Analytics UI but they are counted by both COUNT(DISTINCT ...) and EXACT_COUNT_DISTINCT(...) because in your query they count visit id's.
Using SUM(totals.visits) you should get the same result as in the UI because SUM does not take into account NULL values of totals.visits (corresponding to sessions in which there are only non-interactive hits).