Calculate nested field without loosing export schema in BigQuery - google-analytics

Calculation on a field leads to a loss of the original export schema in BigQuery.
I have a standard enhanced e-commerce schema and want to change the transactionRevenue to a different currency. I want to keep the general export schema structure. The calculated field "transactionRevenueNewCurrency" should be in hits.transaction.transactionRevenueNewCurrency.
#standardSQL
SELECT
s.*,
ARRAY(SELECT COALESCE( x.transaction.transactionRevenue*1.17,0)
FROM UNNEST(hits) AS x) AS transactionRevenueNewCurrency
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` as s , UNNEST(hits) as h
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20160831'
AND transaction.transactionRevenue >0
LIMIT 10000
The new field is attached to the session instead each hit.

Below is for BigQuery Standard SQL
#standardSQL
SELECT * REPLACE(
ARRAY(
SELECT AS STRUCT * REPLACE(
(SELECT AS STRUCT * REPLACE(
COALESCE(CAST(transactionRevenue * 1.17 AS INT64), 0
) AS transactionRevenue)
FROM UNNEST([transaction])
) AS transaction)
FROM UNNEST(hits) hit
) AS hits)
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160831'

Related

How to query Direct returning visitor in BigQuery

I am trying to figure out how many users returned as Direct users after visiting the website as Organic using BigQuery
This is what I did so far. In order to get the number of users who came back as Direct after visiting as Organic, I used
organic_user.visitNumber < direct_user.visitNumber
in WHERE clause.
SELECT
organic_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM
(
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Organic Search'
) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Direct'
) AS direct_user ON organic_user.fullVisitorId = direct_user.fullVisitorId
WHERE
organic_user.visitNumber < direct_user.visitNumber
GROUP BY
date
ORDER BY
date ASC
Could anyone verify this query is correct?
If not, could you provide a solution for this?
With all the clarifications you provided in the comments, I was able to come up with some adaptations of your original query:
SELECT
direct_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Organic Search' ) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Direct' ) AS direct_user
ON
organic_user.fullVisitorId = direct_user.fullVisitorId
AND organic_user.visitNumber < direct_user.visitNumber
GROUP BY
direct_user.date
ORDER BY
direct_user.date ASC
Here are some considerations about the changes I made:
I noticed it was important to specify the subquery group date we
are using for the group by. Since we are counting ‘Direct’ visits
per day, it makes sense we count when they happen.
I moved the organic_user.visitNumber < direct_user.visitNumber
condition to the JOIN clause, I know for INNER JOINs it does not
make any technical difference, but for semantic reasons I thought it
belong there.
I hope this information results to be helpful to you.

Firebase BigQuery schema migration: Move into a partitioned table?

I got the email with instructions to migrate my previous Firebase tables in BigQuery to the new schema. They point to these instructions:
https://support.google.com/analytics/answer/7029846?#migrationscript
But I'd prefer to:
Instead of running a bash script, I'd rather run only one query that executes the migration.
Instead of creating a number of new tables, I'd rather move all the previous results to a new date partitioned table.
I took the script on the documentation and made some changes.
Look at all the --Fh comments. Those are my modifications.
Choose your destination table.
Choose your date range for Android and IOS.
Note that I'm adding a new column with a real timestamp for partitioning (and your convenience).
Instead of getting a number of new tables, you'll only get one - but partitioned by date.
Modified script:
#standardSQL
CREATE OR REPLACE TABLE `fh-bigquery.deleting.delete`
PARTITION BY DATE(ts)
AS
WITH sources AS ( --Fh
SELECT * FROM (
SELECT *, _table_suffix event_date, 'ANDROID' operating_system
FROM `firebase-public-project.com_firebase_demo_ANDROID.app_events_*`
UNION ALL SELECT *, _table_suffix event_date, 'IOS' operating_system
FROM `firebase-public-project.com_firebase_demo_IOS.app_events_*`
)
WHERE event_date BETWEEN '20180503' AND '20180504' --Fh: choose your timerange
)
SELECT
event_date, --Fh: extracted from original table name
TIMESTAMP_MICROS(event.timestamp_micros) ts, --Fh: adding a real timestamp column
event.timestamp_micros AS event_timestamp,
event.previous_timestamp_micros AS event_previous_timestamp,
event.name AS event_name,
event.value_in_usd AS event_value_in_usd,
user_dim.bundle_info.bundle_sequence_id AS event_bundle_sequence_id,
user_dim.bundle_info.server_timestamp_offset_micros as event_server_timestamp_offset,
(
SELECT
ARRAY_AGG(STRUCT(event_param.key AS key,
STRUCT(event_param.value.string_value AS string_value,
event_param.value.int_value AS int_value,
event_param.value.double_value AS double_value,
event_param.value.float_value AS float_value) AS value))
FROM
UNNEST(event.params) AS event_param) AS event_params,
user_dim.first_open_timestamp_micros AS user_first_touch_timestamp,
user_dim.user_id AS user_id,
user_dim.app_info.app_instance_id AS user_pseudo_id,
"" AS stream_id,
user_dim.app_info.app_platform AS platform,
STRUCT( user_dim.ltv_info.revenue AS revenue,
user_dim.ltv_info.currency AS currency ) AS user_ltv,
STRUCT( user_dim.traffic_source.user_acquired_campaign AS name,
user_dim.traffic_source.user_acquired_medium AS medium,
user_dim.traffic_source.user_acquired_source AS source ) AS traffic_source,
STRUCT( user_dim.geo_info.continent AS continent,
user_dim.geo_info.country AS country,
user_dim.geo_info.region AS region,
user_dim.geo_info.city AS city ) AS geo,
STRUCT( user_dim.device_info.device_category AS category,
user_dim.device_info.mobile_brand_name,
user_dim.device_info.mobile_model_name,
user_dim.device_info.mobile_marketing_name,
user_dim.device_info.device_model AS mobile_os_hardware_model,
operating_system, --Fh
user_dim.device_info.platform_version AS operating_system_version,
user_dim.device_info.device_id AS vendor_id,
user_dim.device_info.resettable_device_id AS advertising_id,
user_dim.device_info.user_default_language AS language,
user_dim.device_info.device_time_zone_offset_seconds AS time_zone_offset_seconds,
IF(user_dim.device_info.limited_ad_tracking, "Yes", "No") AS is_limited_ad_tracking ) AS device,
STRUCT( user_dim.app_info.app_id AS id,
'app_id' AS firebase_app_id, --Fh: choose your app id
user_dim.app_info.app_version AS version,
user_dim.app_info.app_store AS install_source ) AS app_info,
( SELECT ARRAY_AGG(STRUCT(user_property.key AS key,
STRUCT(user_property.value.value.string_value AS string_value,
user_property.value.value.int_value AS int_value,
user_property.value.value.double_value AS double_value,
user_property.value.value.float_value AS float_value,
user_property.value.set_timestamp_usec AS set_timestamp_micros ) AS value))
FROM UNNEST(user_dim.user_properties) AS user_property
) AS user_properties
FROM sources -- Fh
, UNNEST(event_dim) AS event

How to flatten Google Analytics custom dimensions with a UDF in BigQuery?

Based on a post by Robert Sahlin, I want to use a BigQuery UDF to access any Google Analytics custom dimension in BigQuery by its index. In the proposed solution Robert uses a JavaScript UDF, and I'm wondering if it's possible to do the same with a SQL UDF - since a SQL UDF should perform better than a JS one.
The proposed JS UDF:
CREATE TEMPORARY FUNCTION customDimensionByIndex(index INT64, arr ARRAY<STRUCT<index INT64, value STRING>>)
RETURNS STRING
LANGUAGE js AS """
for (var j = 0; j < arr.length; j++){
if(arr[j].index == index){
return arr[j].value;
}
}
""";
SELECT
fullvisitorId,
visitId,
hit.hitnumber,
customDimensionByIndex(6, hit.customDimensions) as author,
customDimensionByIndex(7, hit.customDimensions) as category
FROM `123456.ga_sessions_YYYYMMDD`
JOIN
UNNEST(hits) as hit
With a SQL UDF:
#standardSQL
CREATE TEMP FUNCTION customDimensionByIndex(indx INT64, arr ARRAY<STRUCT<index INT64, value STRING>>) AS (
(SELECT x.value FROM UNNEST(arr) x WHERE indx=x.index)
);
SELECT
fullvisitorId,
visitId,
hit.hitnumber,
customDimensionByIndex(1, hit.customDimensions),
customDimensionByIndex(2, hit.customDimensions),
customDimensionByIndex(3, hit.customDimensions)
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`, UNNEST(hits) hit
LIMIT 1000
I'm not sure why the original solution looks at "hit" instead of the column "hits" on the sample dataset - so to get to individual hits I had to UNNEST() them too.

Joining to landing pages query doubles the sessions per source

I'm trying to query sum of visits per source from a Big Query table of Google Analytics data, but will need to filter some sessions out at landing page level. Hence I'm pre-querying visitIDs by landing page and re-joining to session data like so:
#StandardSQL
WITH landingpages AS (
SELECT
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
# filters to be added here
)
SELECT
sessions.trafficSource.source,
SUM(sessions.totals.visits) AS visits
FROM `project.dataset.ga_sessions_*` AS sessions
JOIN
landingpages
ON
landingpages.visitID = sessions.visitID
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
GROUP BY
trafficSource.source
ORDER BY
visits DESC
This roughly doubles the number of sessions per each source as reported from GA.
Can anyone point out what I've done wrong? (I suspect it is blindingly obvious)
I've tried examining the data output from the first query and can't find anything wrong with it aside from a very small proportion of duplicated visitIDs. I've also tried various different types of JOIN, all to now avail.
When querying ga data from GBQ it's imperative to know and keep in mind that a unique visit is represented by both a fullVisitorID and visitID. Only a double join on both will return a meaningful data set.
Here's what I should have written:
#StandardSQL
WITH landingpages AS (
SELECT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
),
session_data AS (
SELECT
date AS ga_date, trafficSource.source AS source, fullVisitorId, visitID, SUM(totals.visits) AS visits
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
AND
totals.visits > 0
GROUP BY ga_date, source, fullVisitorId, visitID
)
SELECT
ga_date, source, SUM(visits) AS Sessions
FROM
landingpages
JOIN
session_data
ON
landingpages.VisitID = session_data.VisitID
AND
landingpages.fullVisitorId = session_data.fullVisitorId
GROUP BY
ga_date, source
ORDER BY
Sessions DESC

Need BigQuery SQL query to collect time on page from Google Analytics data

can anyone help with a BIgQuery SQL query to extract the time on page for a specific page from Google Analytics data please?
For every visitorId who has visited a particular page I would like the time on page for that page. This is so that I can calculate the median time on page rather than the mean.
I'm assuming that the visitorId, hits.hitsNumber and hits.time dimensions will be needed. Also that somehow the hits.time for the hit where the page was viewed will need to be subtracted from the hits.time of the following hit.
Any help much appreciated.
Try this:
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
(nextTime - hits.time) as timeOnPage
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
The key to this code is the LEAD() function, which grabs the specified value from the next row in the partition, based on the PARTITION BY and ORDER BY qualifiers.
Hope that helps!
To Account for last page time, this query can be used, and it will give zero time on last page,since BQ doesn't have way to calculate time spent on last page, but it will at least give zero instead of null.
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
CASE
WHEN hits.isExit IS NOT NULL THEN last_interaction - hit_time
ELSE next_pageview - hit_time
END
AS time_on_page
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.isExit,
hits.time/1000 as hits_time,
LEAD(hits.time/1000, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime,
MAX(IF(hits.isInteraction = TRUE,hits.time / 1000,0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
Google Analytics 4
Time-Spent-On-Page
SELECT
user_pseudo_id,
event_timestamp,
ga_session_id,
page_location,
page_title,
next_hit_in_the_same_session,
(next_hit_in_the_same_session - event_timestamp)/1000000 AS time_on_page_in_seconds
FROM (
SELECT
user_pseudo_id,
event_timestamp,
(
SELECT value.int_value FROM
UNNEST(event_params)
WHERE
key = 'ga_session_id') AS ga_session_id,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_location') AS page_location,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_title') AS page_title,
LEAD(event_timestamp) OVER (PARTITION BY (SELECT value.int_value FROM UNNEST(event_params)
WHERE
key = 'ga_session_id')
ORDER BY
event_timestamp ASC) AS next_hit_in_the_same_session
FROM
-- Replace table name.
`cloud-search.analytics_298504139.events_20220106` AS tableAlias
WHERE
event_name = 'page_view'
ORDER BY
user_pseudo_id,
ga_session_id,
event_timestamp ASC )

Resources