Streaming Google Analytics 4 data to BigQuery causing data collection issues - google-analytics

We have configured a linking between the GA 4 property and GoogleBigQuery via the GA interface (without any additional code). It works fine, we see a migrated data in GBQ tables, but however, we face an issue with how this data is written in those tables.
If we look at any table we could see that events from different users can be recorded in one session (and there can be different clientIDs (and even usedIDs, which we pass when authorizing a user)) See an example
This is a result of executing following query:
SELECT
event_name,
user_pseudo_id,
user_id,
device.category,
device.mobile_brand_name,
device.mobile_model_name,
device.operating_system_version,
geo.region,
geo.city,
params.key,
params.value.int_value
FROM `%project_name%.analytics_256374149.events_20210331`, unnest(event_params) AS params
WHERE event_name="page_view"
AND params.value.int_value=1617218965
ORDER BY event_timestamp
As a result, you can see that within one session different users from different regions, with different devices and identifiers are combined. It is, of course, impossible to use such data for reporting purposes. Once again, it is a default GA4 → BigQuery setup in the GA4 interface (no add-ons).
We do not understand what the error is (in import, in requests, or somewhere else) and would like to get advice on this issue.
Thanks.

You should look at the combination of user_pseudo_id and the event_param ga_session_id. This combination is unique and used for measuring unique sessions across a property.
For example, this query counts the number of unique event names in each session:
SELECT
user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
COUNT(DISTINCT event_name) AS unique_event_name_count
FROM `<project>.<dataset>.events_*`
GROUP BY user_pseudo_id, ga_session_id

Related

Is there a way to duplicate sessions from the app-warming issue in Firebase < 8.11.0 via BigQuery?

In Version 8.12.1 of the Firebase Apple SDK an issue with session_start events being logged during app prewarming on iOS 15+ which inserts additional 'session_start' events. I've noticed that as a result of additional session start rows which inserts additional 'ga_session_id' values into the BigQuery table.
ga_session_id is a unique session identifier associated with each event that occurs within a session and is thus created when this additional session_start fires when the app_warming occurs - using the session_number field and calculating session length it's possible to remove sessions with just one session_start and a small session length but this does not seem to reduce the overall count of sessions by much.
This has impacted the reported number of sessions when querying the BigQuery table when counting distinct user_psuedo_id||ga_session_id.
Is there a way to isolate these sessions in a separate table or constrict them from the query using an additional clause in said query to remove the sessions which are not truly sessions.
https://github.com/firebase/firebase-ios-sdk/issues/6161
https://firebase.google.com/support/release-notes/ios
A simplified version of said query I'm using:
with windowTemp as
(
select
PARSE_DATE("%Y%m%d",event_date) as date_formatted,
event_name,
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id
from
`firebase-XXXX.analytics_XXX.events_*`
where
_table_suffix between '20210201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
1,2,3,4
)
SELECT
date_formatted,
Count(DISTINCT user_pseudo_id) AS users,
Count(DISTINCT Concat(user_pseudo_id,session_id)) AS sessions,
FROM
windowTemp
GROUP by 1
ORDER BY 1

How to write query to find firebase event details in last 28days using bigquery with platform, steam_id, event_name filter in power bi?

firebase console and its first_open output
Firebase.console Result first_open 8,787- 8,575 -
I'm trying to create a query to get the event details using the big query but not it produces an exact result
My Query is
select platform, count(s.platform) from (SELECT * FROM `Table.events_*` where event_name = "first_open" and stream_id = "1757261196" or stream_id = "1759866139"
UNION ALL
SELECT * FROM `Table.events_intraday_*` where event_name = "first_open" and stream_id = "1757261196" or stream_id = "1759866139" ) s where and event_date between "20191204" and "20200101" group by s.platform
My filter is
Stream_id = ["1757261196","1759866139"]
platform = ["ios","android"]
dateRanges = last 28days
event_name = first_open
BigQuery Result:
[
{
"platform": "ANDROID",
"f0_": "428"
},
{
"platform": "IOS",
"f0_": "38"
}
]
But firebase console output and bigquery output are different I think it is due to query issue, please help me to write the correct query.
Your query has some missing pieces at the WHERE statement. So I'm not even sure if that QUERY that you shared with us works.
From what I could observe from the QUERY your WHERE statement has some issues with the precedence of the AND, OR operators.
What you have is:
SELECT * FROM my_table WHERE event_name = "first_open" AND stream_id = "1757261196" OR stream_id = "1759866139"
This is returning two sets:
One with the event_name="first_open", and the stream_id = "1757261196"
A second one, with the stream_id="1759866139"
This means that the first condition is not being applied correctly.
I recommend you to use the following structure:
SELECT * FROM my_table WHERE event_name = "first_open" AND (stream_id = "1757261196" OR stream_id = "1759866139")
This way you put together the conditionals of the stream_id which are the only ones that should be affected by the OR operator, and the first conditional is always applied.
After this take a good look at the final WHERE:
...) s where and event_date between "20191204" and "20200101" group by s.platform
This may not be working as you are expecting because of the data type of that column, and how you are passing the data to the BETWEEN. Be sure is a DATE type and is in the same format, you can always cast the values with DATE() if is something else.
EDIT:
After you link a project to BigQuery, the first daily export of events
creates a corresponding dataset in the associated BigQuery project.
Then, each day, raw event data for each linked app populates a new
daily table in the associated dataset, and raw event data is streamed
into a separate intraday BigQuery table in real-time. Data prior to
linking to BigQuery is not available for import (except for
Performance Monitoring data). By default, all web data from your App +
Web properties in Google Analytics will be exported as well.
Source
The query seems to be OK, there is only one more consideration to make:
Be careful when you use wildcards Table.events_* includes Table.events_intraday_*, if they are in the same Dataset in BigQuery. This could lead to duplicated data in your query and will cause a mismatch in your counts.
Besides that, I recommend you to follow the next steps, with the idea that the issue is not the query:
Verify that the tables for every single day that you are querying exist in BigQuery. The smaller count seems to be from a single day vs the bigger amount.
Validate that the tables from BigQuery contain the same data as the "Events" dataset from Firebase, you could be comparing two different datasets and hence the numbers will never match.

How to get gender and age in BigQuery from Firebase Analytics?

I am using Analytics Events and trying to take advantage of the user data.
I can get pretty much data.
With this Query.
SELECT
*
FROM
`test-project-23471.analytics_205774787.events_20191120`,
UNNEST(event_params) AS event_params
WHERE
event_name ='select_content'
AND event_params.value.string_value = 'a_item_open'
However, I don't need all. So, I did
SELECT
event_params.value.string_value,
event_previous_timestamp,
device,
geo,
app_info
FROM
`test-project-23471.analytics_205774787.events_20191120`,
UNNEST(event_params) AS event_params
WHERE
event_name ='select_content'
AND event_params.value.string_value = 'a_item_open'
And then, I realized that the result doesn't have gender data and age data. And in the document, it says Firebase automatically gets the information. I'd like to combine sex, age(or age group) with the result from the query above.
How can I get it?
Note that this document is just an example on how to query BigTable data by using BigQuery and not from Firebase.
The Firebase layout mentions that it has a RECORD field named "user_properties" which has a "key" STRING field.
Thus, you could try:
SELECT DISTINCT user_properties.key
FROM
`test-project-23471.analytics_205774787.events_20191120`
To retrieve the correct name for the gender/sex property an include it in your query. For instance:
SELECT
event_params.value.string_value,
event_previous_timestamp,
device,
geo,
app_info,
user_properties.value.string_value as gender
FROM
`test-project-23471.analytics_205774787.events_20191120`,
UNNEST(event_params) AS event_params
WHERE
event_name ='select_content'
AND event_params.value.string_value = 'a_item_open'
AND user_properties.key = "Gender"
Nevertheless, if you don't find the Gender info, please consider this. Otherwise, I suggest reaching the Firebase support.
Hope it helps.
For privacy reasons these fields are not available in BigQuery export. You can only see aggregated data for gender and age in Firebase Analytics console.
You can't even use them for targeting in other Firebase features, like RemoteConfig, so user-level granularity is not possible.

Firebase events dedup in Big Query - best practices?

There seems to be 1-2% of duplicates in the Firebase analytics events exported to Big Query. What are the best practices to remove these?
Atm the client does not send a counter with the events (per session). This would provide an unambiguous way of removing duplicate events, so I recommend Firebase implementing that. However, at the moment, what would be a good way to remove the duplicates? Look at client user_pseudo_id, event_timestamp, and event_name - fields and remove all except one with same triple?
How does event_bundle_sequence_id -field work? Will duplicates have the same value in this field, or different? That is, are duplicate events sent within the same bundle, or in different bundles?
Is Firebase planning to remove these duplicates earlier in the processing, either for Firebase analytics itself, or in the export to Big Query?
Standard SQL to check for duplicates in one days events:
with n_dups as
(
SELECT event_name, event_timestamp, user_pseudo_id, count(1)-1 as n_duplicates
FROM `project.dataset.events_20190610`
group by event_name, event_timestamp, user_pseudo_id
)
select n_duplicates, count(1) as n_cases
from n_dups
group by n_duplicates
order by n_cases desc
We use the QUALIFY clause for deduplication Firebase events in BigQuery:
SELECT
*
FROM
`project.dataset.events_*`
QUALIFY
ROW_NUMBER() OVER (
PARTITION BY
user_pseudo_id,
event_name,
event_timestamp,
TO_JSON_STRING(event_params)
) = 1
Qualifying columns:
- name: user_pseudo_id
description: Autogenerated pseudonymous ID for the user -
Unique identifier for a specific installation of application on a client device,
e.g. "938642951.1666427135".
All events generated by that device will be tagged with this pseudonymous ID,
so that you can relate events from the same user together.
- name: event_name
description: Event name, e.g. "app_launch", "session_start", "login", "logout" etc.
- name: event_timestamp
description: The time (in microseconds, UTC) at which the event was logged on the client,
e.g. "1666529002225262".
- name: event_params
description: A repeated record (ARRAY) of the parameters associated with this event.

Big Query and Google Analytics UI do not match when ecommerce action filter applied

We are validating a query in Big Query, and cannot get the results to match with the google analytics UI. A similar question can be found here, but in our case the the mismatch only occurs when we apply a specific filter on ecommerce_action.action_type.
Here is the query:
SELECT COUNT(distinct fullVisitorId+cast(visitid as string)) AS sessions
FROM (
SELECT
device.browserVersion,
geoNetwork.networkLocation,
geoNetwork.networkDomain,
geoNetwork.city,
geoNetwork.country,
geoNetwork.continent,
geoNetwork.region,
device.browserSize,
visitNumber,
trafficSource.source,
trafficSource.medium,
fullvisitorId,
visitId,
device.screenResolution,
device.flashVersion,
device.operatingSystem,
device.browser,
totals.pageviews,
channelGrouping,
totals.transactionRevenue,
totals.timeOnSite,
totals.newVisits,
totals.visits,
date,
hits.eCommerceAction.action_type
FROM
(select *
from TABLE_DATE_RANGE([zzzzzzzzz.ga_sessions_],
<range>) ))t
WHERE
hits.eCommerceAction.action_type = '2' and <stuff to remove bots>
)
From the UI using the built in shopping behavior report, we get 3.836M unique sessions with a product detail view, compared with 3.684M unique sessions in Big Query using the query above.
A few questions:
1) We are under the impression the shopping behavior report "Sessions with Product View" breakdown is based off of the ecommerce_action.actiontype filter. Is that true?
2) Is there a .totals pre-aggregated table that the UI maybe pulling from?
It sounds like the issue is that COUNT(DISTINCT ...) is approximate when using legacy SQL, as noted in the migration guide, so the counts are not accurate. Either use standard SQL instead (preferred) or use EXACT_COUNT_DISTINCT with legacy SQL.
You're including product list views in your query.
As described in https://support.google.com/analytics/answer/3437719 you need to make sure, that no product has isImpression = TRUE because that would mean it is a product list view.
This query sums all sessions which contain any action_type='2' for which all isProduct are null or false:
SELECT
SUM(totals.visits) AS sessions
FROM
`project.123456789.ga_sessions_20180101` AS t
WHERE
(
SELECT
LOGICAL_OR(h.ecommerceaction.action_type='2')
FROM
t.hits AS h
WHERE
(SELECT LOGICAL_AND(isimpression IS NULL OR isimpression = FALSE) FROM h.product))
For legacySQL you can adapt the example in the documentation.
In addition to the fact that COUNT(DISTINCT ...) is approximate when using legacy SQL, there could be sessions in which there are only non-interactive hits, which will not be counted as sessions in the Google Analytics UI but they are counted by both COUNT(DISTINCT ...) and EXACT_COUNT_DISTINCT(...) because in your query they count visit id's.
Using SUM(totals.visits) you should get the same result as in the UI because SUM does not take into account NULL values of totals.visits (corresponding to sessions in which there are only non-interactive hits).

Resources