Unnesting the custom dimensions is duplicating/inflating transaction revenue in BigQuery - google-analytics

Unnesting hits.customdimension and hits.product.customdimension is inflating the transaction revenue
SELECT
sum(totals.totalTransactionRevenue)/1000000 as revenue,
(SELECT MAX(IF(index=10,value,NULL)) FROM UNNEST(product.customDimensions)) AS product_CD10,
(SELECT MAX(IF(index=1,value,NULL)) FROM UNNEST(hits.customDimensions)) AS CD1
FROM
`XXXXXXXXXXXXXXX.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) as product
WHERE
_TABLE_SUFFIX BETWEEN "20180608"
AND "20180608"
group by product_CD10,CD1
Is there a way I could get a flat table in such a way that if I apply sum of revenue, its should give the correct result.

Move your UNNEST() to the top sub-queries - then the rows won't get duplicated:
SELECT row
, (SELECT MAX(letter) FROM UNNEST(row), UNNEST(qq)) max_letter
, (SELECT MAX(n) FROM UNNEST(row), UNNEST(qq), UNNEST(qb) n) max_number
FROM (
SELECT [
STRUCT(1 AS p,[STRUCT('a' AS letter, [4,5,6] AS qb)] AS qq)
, STRUCT(2,[STRUCT('b', [7,8,9])])
, STRUCT(3,[STRUCT('c', [10,11,12])])
] AS row
)
Haven't tested this tho:
SELECT
sum(totals.totalTransactionRevenue)/1000000 as revenue,
(SELECT MAX(IF(index=10,value,NULL)) FROM UNNEST(hits) AS hit, UNNEST(hit.products) product, UNNEST(product.customDimensions)) AS product_CD10,
(SELECT MAX(IF(index=1,value,NULL)) FROM UNNEST(hits) AS hit, UNNEST(hit.customDimensions)) AS CD1
FROM `XXXXXXXXXXXXXXX.ga_sessions_*`,
WHERE _TABLE_SUFFIX BETWEEN "20180608" AND "20180608"
group by product_CD10,CD1

Related

Google Analytics BigQuery Export Event Count Issue

I'm trying to get a total events count for a particular event in BigQuery along with a custom dimension for versions of the site.
This query works perfect but does not include my custom dimension:
SELECT
hits.eventInfo.eventCategory AS eventCategory,
COUNT(*) AS total_events
FROM `ga_sessions_*`,
UNNEST(hits) AS hits
WHERE _TABLE_SUFFIX = '20200630'
AND totals.visits = 1
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Cart CTA'
GROUP BY
hits.eventInfo.eventCategory
But when I add the UNNEST for customDimensions, I get a total that is twice the correct total.
SELECT
hits.eventInfo.eventCategory AS eventCategory,
COUNT(*) AS total_events
FROM `ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
WHERE _TABLE_SUFFIX = '20200630'
AND totals.visits = 1
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Cart CTA'
GROUP BY
hits.eventInfo.eventCategory
I think there is something wrong with customDimension unnest, but I don't know how to solve. I've tried using LEFT JOIN with the UNNEST but I get the same result.
I think there is something wrong with customDimension unnest
This is by design! When you do UNNEST for the row that has N records in that unnested column - you actually generate N rows in place of that one row. So, obviously COUNT(*) will be different ...
I don't know how to solve
... unless you filter by specific value of that unnested field

Unnest hits and Unnesting session scoped custom dimension BigQuery code filter

I am trying to filter a funnel based on users who have certain custom dimension values. Sadly, the custom dimension in question is session-scoped and not hit-based, so i cannot use hits.customDimensions in this particular query. What is the best way to do this and achieve the desired result?
Find my progress so far:
#standardSQL
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page' LIMIT 1)) One_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/two - Page' LIMIT 1)) Two_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/three - Page' LIMIT 1)) Three_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/four - Page' LIMIT 1)) Four_Page
FROM `xxxxxxx.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(customDimensions) AS cusDim
WHERE
_TABLE_SUFFIX BETWEEN '20190320' AND '20190323'
AND h.hitNumber = 1
AND cusDim.index = 6
AND cusDim.value IN ('60','70)
Segmentation with Custom Dimensions
You can filter for sessions based on conditions in custom dimensions. Simply write a sub-query counting cases of interest and set to ">0". Example for sample data:
SELECT
fullvisitorid,
visitstarttime,
customdimensions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
-- there should be at least one case with index=4 and value='EMEA' ... you can use your index and desired value
-- unnest() turns customdimensions into table format, so we can apply SQL to this array
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
limit 100
You comment the WHERE statement to see all the data.
Funnel
First you might want to get an overview of what is going on in your hits array:
SELECT
fullvisitorid,
visitstarttime,
-- get an overview over relevant hits data
-- select as struct feeds hits fields into a new array created by array()-function
ARRAY(select as struct hitnumber, page from unnest(hits) where type='PAGE') hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Now that you made sure the data makes sense you can create a funnel array containing the hit numbers of the relevant steps:
SELECT
fullvisitorid,
visitstarttime,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Put this into a WITH statement for more clarity and run your analysis by summarizing the corresponding cases:
WITH f AS (
SELECT
fullvisitorid,
visitstarttime,
totals.visits,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
)
SELECT
COUNT(DISTINCT fullvisitorid) as users,
SUM(visits) as allSessions,
SUM( IF(array_length(funnel)>0,visits,0) ) sessionsWithFunnelPages,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null ) ,visits,0) ) sessionsWithStep1,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 ) ,visits,0) ) sessionsFunnelToStep2,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 and step2<step3 and step1<step3) ,visits,0) ) sessionsFunnelToStep3
FROM f
Please test before using.

ga:itemQuantity in ga_sessions_YYYMMDD (Big Query)

I'm trying to replicate the GA Quantity metric (ga:itemQuantity) using standardSQL and querying the GA export to BigQuery date partitioned tables (ga_sessions_YYYYMMDD).
I have tried the following, but 'quantity' is always null:
#standardSQL
SELECT
sum(hit.item.itemQuantity) as quantity
FROM `precise-armor-133520.1500218.ga_sessions_20170801` t
CROSS JOIN
UNNEST(t.hits) AS hit
order by 1 ASC;
Other metrics work and match 100% with the GA UI so I am assuming it's not a data export problem. For example:
SELECT
sum( totals.totalTransactionRevenue ) as revenue, sum( totals.transactions ) as transactions
FROM `precise-armor-133520.1500218.ga_sessions_201708*` t
CROSS JOIN
UNNEST(t.hits) AS hit
group by `date`
order by `date` asc
These totals match Revenue and Transactions (metrics) in GA UI respectively.
What is the standardSQL query for the GA metric quantity (ga:itemQuantity)?
In order to match "Quantity" in GA's web UI by each date, use the following standard SQL:
SELECT
SUM(product.productQuantity)
,`date`
FROM
`precise-armor-133520.1500218.ga_sessions_*`
,UNNEST(hits) AS hits
,UNNEST(hits.product) AS product
WHERE hits.eCommerceAction.action_type = "6"
and _TABLE_SUFFIX between '20170801' and FORMAT_DATE("%Y%m%d", CURRENT_DATE)
group by 2
order by 2 asc
Does this work?
#standardSQL
SELECT
sku,
SUM(qtd) qtd
FROM(
SELECT
ARRAY(SELECT AS STRUCT productSKU sku, productQuantity qtd FROM UNNEST(hits), UNNEST(product) WHERE ecommerceAction.action_type = '6') data
FROM `precise-armor-133520.1500218.ga_sessions_20170801`
),
UNNEST(data)
GROUP BY sku
ORDER BY qtd DESC
LIMIT 1000
Not sure how you managed to unnest the product fields, maybe this solves your issue.

Recreate GA Funnel on BigQuery

I am trying to recreate the GA funnel (custom report on Google360) using BigQuery. The funnel on GA is using the unique count of events that happen on each page. I found this query online that is working for the most part:
SELECT
COUNT( s0.firstHit) AS Landing_Page,
COUNT( s1.firstHit) AS Model_Selection
from(
SELECT
s0.fullvisitorID,
s0.firstHit,
s1.firstHit,
FROM (
# Begin Subquery #1 aka s0
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROm [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('landing_page')
AND totals.visits = 1
GROUP BY
fullvisitorID
) s0
# End Subquery #1 aka s0
left join (
# Begin Subquery #2 aka s1
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROM [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('model_selection_page')
AND totals.visits = 1
GROUP BY
fullvisitorID,
) s1
ON
s0.fullvisitorID = s1.fullvisitorID
)
The query works fine and the value for landing page is the same as I can get on GA, but Model_Selection is about 10% higher. This difference also increases along the funnel (I only posted 2 steps for clarity).
Any idea what am I missing here?
This query does what you need but in Standard SQL Version:
#standardSQL
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM `64269470.ga_sessions_20170720`
Just that. 4 lines, way faster and cheaper.
You can also play with simulated data, something like:
#standardSQL
WITH data AS(
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits
)
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM data
Notice that building this type of report in GA might be a bit more difficult as you need to select visitors who had at least fired once the event 'landing_page' and then had the event 'model_selection_page' fired. Make sure you got this report built correctly as well in your GA (one way might be to first build a customized report with only customers who had 'landing_page' fired and then apply the second filter looking for 'model_selection_page').
[EDIT]:
You asked in your comment about bringing this counting on the session and user level. For counting each session, you can limit the results to 1 for each sub-query evaluation, like so:
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
For counting distinct users, the idea is the same but you'd have to apply a COUNT(DISTINCT) operation, like so:
SELECT
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data

Limit a view to select between two date partitions

I wish to define a view for Google Analytics landing pages. I've tried to set this up by saving the following query as a view:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
In the queries that join to this view I plan to limit them to between two date partitions like so:
SELECT
sessions.date,
fullVisitorId AS fv,
visitId AS v,
landing_page
FROM `project-id.dataset.ga_sessions_*` AS sessions, UNNEST(hits) AS h
JOIN `project-id.dataset.landing_pages` AS landing_pages
ON landing_pages.fv = sessions.fullVisitorId
AND landing_pages.date = sessions.date
AND landing_pages.v = sessions.visitId
WHERE
_TABLE_SUFFIX BETWEEN '20170108' AND '20170108'
This still appears to select a large volume of data ~5GB rather than ~60MB that would be expected for one day.
How can I re-write the view so that it only selects the relevant date partitions as defined by the consuming query?
Make sure to include the _TABLE_SUFFIX in the view definition so that you can reference it in queries over the view. Here's an example that converts the _TABLE_SUFFIX to a date:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS sessions_date
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1;
Now try a query over the view:
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `dataset.view_name`
WHERE sessions_date = '2017-01-08';

Resources