I'm trying to get a total events count for a particular event in BigQuery along with a custom dimension for versions of the site.
This query works perfect but does not include my custom dimension:
SELECT
hits.eventInfo.eventCategory AS eventCategory,
COUNT(*) AS total_events
FROM `ga_sessions_*`,
UNNEST(hits) AS hits
WHERE _TABLE_SUFFIX = '20200630'
AND totals.visits = 1
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Cart CTA'
GROUP BY
hits.eventInfo.eventCategory
But when I add the UNNEST for customDimensions, I get a total that is twice the correct total.
SELECT
hits.eventInfo.eventCategory AS eventCategory,
COUNT(*) AS total_events
FROM `ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
WHERE _TABLE_SUFFIX = '20200630'
AND totals.visits = 1
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Cart CTA'
GROUP BY
hits.eventInfo.eventCategory
I think there is something wrong with customDimension unnest, but I don't know how to solve. I've tried using LEFT JOIN with the UNNEST but I get the same result.
I think there is something wrong with customDimension unnest
This is by design! When you do UNNEST for the row that has N records in that unnested column - you actually generate N rows in place of that one row. So, obviously COUNT(*) will be different ...
I don't know how to solve
... unless you filter by specific value of that unnested field
Related
Looking to get the average latencyTracking for a visitid out of our GA 360 export.
Setup the following query but getting the following error and I'm not sure why since all these are all aggregate functions: SELECT list expression references hits.latencyTracking.serverResponseTime which is neither grouped nor aggregated at [3:5]
select
TIMESTAMP_SECONDS(visitStartTime) as visitStartTime,
AVG(hits.latencyTracking.serverResponseTime) OVER (PARTITION BY visitid) as avgServerResponseTime,
AVG(hits.latencyTracking.serverConnectionTime) OVER (PARTITION BY visitid) as avgServerConnectionTime,
AVG(hits.latencyTracking.domInteractiveTime) OVER (PARTITION BY visitid) as avgdomInteractiveTime,
AVG(hits.latencyTracking.pageLoadTime) OVER (PARTITION BY visitid) as avgpageLoadTime
from `xxx.xxx.ga_sessions_2018*`,
UNNEST(hits) AS hits
where hits.latencyTracking.serverResponseTime is not null
group by visitStartTime
The way your query written - AVG() is not just Aggregate Function but rather Aggregate Analytic Function.
To make it work you can remove OVER() so AVG() will really become aggregate function here corresponding to GROUP BY
select
TIMESTAMP_SECONDS(visitStartTime) as visitStartTime,
AVG(hits.latencyTracking.serverResponseTime) as avgServerResponseTime,
AVG(hits.latencyTracking.serverConnectionTime) as avgServerConnectionTime,
AVG(hits.latencyTracking.domInteractiveTime) as avgdomInteractiveTime,
AVG(hits.latencyTracking.pageLoadTime) as avgpageLoadTime
from `xxx.xxx.ga_sessions_2018*`,
UNNEST(hits) AS hits
where hits.latencyTracking.serverResponseTime is not null
group by visitStartTime
Having windows and group by in conjunction can be confusing.
In your case it is not even necessary, neither is the flattening - you can write simple subqueries to get your numbers per session:
SELECT
TIMESTAMP_SECONDS(visitStartTime) AS visitStartTime,
(
SELECT AVG(latencyTracking.serverResponseTime)
FROM t.hits
WHERE latencyTracking.serverResponseTime IS NOT NULL) AS avgServerResponseTime,
(
SELECT AVG(latencyTracking.serverConnectionTime)
FROM t.hits
WHERE latencyTracking.serverConnectionTime IS NOT NULL) AS avgServerConnectionTime,
(
SELECT AVG(latencyTracking.domInteractiveTime)
FROM t.hits
WHERE latencyTracking.domInteractiveTime IS NOT NULL ) AS avgdomInteractiveTime,
(
SELECT AVG(latencyTracking.pageLoadTime)
FROM t.hits
WHERE latencyTracking.pageLoadTime IS NOT NULL ) AS avgpageLoadTime
FROM `xxx.xxx.ga_sessions_2018*`
It also doesn't involve grouping which makes it faster.
Unnesting hits.customdimension and hits.product.customdimension is inflating the transaction revenue
SELECT
sum(totals.totalTransactionRevenue)/1000000 as revenue,
(SELECT MAX(IF(index=10,value,NULL)) FROM UNNEST(product.customDimensions)) AS product_CD10,
(SELECT MAX(IF(index=1,value,NULL)) FROM UNNEST(hits.customDimensions)) AS CD1
FROM
`XXXXXXXXXXXXXXX.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) as product
WHERE
_TABLE_SUFFIX BETWEEN "20180608"
AND "20180608"
group by product_CD10,CD1
Is there a way I could get a flat table in such a way that if I apply sum of revenue, its should give the correct result.
Move your UNNEST() to the top sub-queries - then the rows won't get duplicated:
SELECT row
, (SELECT MAX(letter) FROM UNNEST(row), UNNEST(qq)) max_letter
, (SELECT MAX(n) FROM UNNEST(row), UNNEST(qq), UNNEST(qb) n) max_number
FROM (
SELECT [
STRUCT(1 AS p,[STRUCT('a' AS letter, [4,5,6] AS qb)] AS qq)
, STRUCT(2,[STRUCT('b', [7,8,9])])
, STRUCT(3,[STRUCT('c', [10,11,12])])
] AS row
)
Haven't tested this tho:
SELECT
sum(totals.totalTransactionRevenue)/1000000 as revenue,
(SELECT MAX(IF(index=10,value,NULL)) FROM UNNEST(hits) AS hit, UNNEST(hit.products) product, UNNEST(product.customDimensions)) AS product_CD10,
(SELECT MAX(IF(index=1,value,NULL)) FROM UNNEST(hits) AS hit, UNNEST(hit.customDimensions)) AS CD1
FROM `XXXXXXXXXXXXXXX.ga_sessions_*`,
WHERE _TABLE_SUFFIX BETWEEN "20180608" AND "20180608"
group by product_CD10,CD1
I'm trying to replicate the GA Quantity metric (ga:itemQuantity) using standardSQL and querying the GA export to BigQuery date partitioned tables (ga_sessions_YYYYMMDD).
I have tried the following, but 'quantity' is always null:
#standardSQL
SELECT
sum(hit.item.itemQuantity) as quantity
FROM `precise-armor-133520.1500218.ga_sessions_20170801` t
CROSS JOIN
UNNEST(t.hits) AS hit
order by 1 ASC;
Other metrics work and match 100% with the GA UI so I am assuming it's not a data export problem. For example:
SELECT
sum( totals.totalTransactionRevenue ) as revenue, sum( totals.transactions ) as transactions
FROM `precise-armor-133520.1500218.ga_sessions_201708*` t
CROSS JOIN
UNNEST(t.hits) AS hit
group by `date`
order by `date` asc
These totals match Revenue and Transactions (metrics) in GA UI respectively.
What is the standardSQL query for the GA metric quantity (ga:itemQuantity)?
In order to match "Quantity" in GA's web UI by each date, use the following standard SQL:
SELECT
SUM(product.productQuantity)
,`date`
FROM
`precise-armor-133520.1500218.ga_sessions_*`
,UNNEST(hits) AS hits
,UNNEST(hits.product) AS product
WHERE hits.eCommerceAction.action_type = "6"
and _TABLE_SUFFIX between '20170801' and FORMAT_DATE("%Y%m%d", CURRENT_DATE)
group by 2
order by 2 asc
Does this work?
#standardSQL
SELECT
sku,
SUM(qtd) qtd
FROM(
SELECT
ARRAY(SELECT AS STRUCT productSKU sku, productQuantity qtd FROM UNNEST(hits), UNNEST(product) WHERE ecommerceAction.action_type = '6') data
FROM `precise-armor-133520.1500218.ga_sessions_20170801`
),
UNNEST(data)
GROUP BY sku
ORDER BY qtd DESC
LIMIT 1000
Not sure how you managed to unnest the product fields, maybe this solves your issue.
I wish to define a view for Google Analytics landing pages. I've tried to set this up by saving the following query as a view:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
In the queries that join to this view I plan to limit them to between two date partitions like so:
SELECT
sessions.date,
fullVisitorId AS fv,
visitId AS v,
landing_page
FROM `project-id.dataset.ga_sessions_*` AS sessions, UNNEST(hits) AS h
JOIN `project-id.dataset.landing_pages` AS landing_pages
ON landing_pages.fv = sessions.fullVisitorId
AND landing_pages.date = sessions.date
AND landing_pages.v = sessions.visitId
WHERE
_TABLE_SUFFIX BETWEEN '20170108' AND '20170108'
This still appears to select a large volume of data ~5GB rather than ~60MB that would be expected for one day.
How can I re-write the view so that it only selects the relevant date partitions as defined by the consuming query?
Make sure to include the _TABLE_SUFFIX in the view definition so that you can reference it in queries over the view. Here's an example that converts the _TABLE_SUFFIX to a date:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS sessions_date
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1;
Now try a query over the view:
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `dataset.view_name`
WHERE sessions_date = '2017-01-08';
I'm working on counting all visitors that submitted postcode on our homepage. I came up with following query in legacy SQL:
SELECT fullVisitorId, visitStartTime
FROM TABLE_DATE_RANGE([ga_sessions_], TIMESTAMP('2017-01-29'), CURRENT_TIMESTAMP())
where hits.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and hits.type = 'EVENT'
and hits.eventInfo.eventCategory = 'Homepage'
and hits.eventInfo.eventAction = 'Submit Postcode';
I then wanted to convert it to standard SQL to use within CTE and came up with this one that doesn't seem right though.
SELECT fullVisitorId, visitStartTime
FROM ``ga_sessions_*``, UNNEST(hits) as h
where
_TABLE_SUFFIX > '2017-01-29'
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
The first one processes 327 MB and returns 4117 results, the second one processes 6.98 GB and returns 60745 results.
I've looked at the migration guide, but it didn't prove very helpful for me.
ga_sessions has standard schema of GA import into Bigquery.
It looks like difference is coming from the fact that with Standard SQL you are flattening the table on hits when you CROSS JOIN UNNEST(hits) in the FROM clause, and therefore adding more rows to the result. More equivalent query would be:
#standardSQL
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions_*`
where
_TABLE_SUFFIX > '20170129'
and visitStartTime > 1483228800
and EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE h.type = 'EVENT'
and h.page.pagePath = '/broadband/'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode');
What happened here is that as _TABLE_SUFFIX is a string so when you do:
_TABLE_SUFFIX > '2017-01-29'
You will end up selecting way more tables then expected as string comparisons is different from number comparisons.
One possible way to fix that is by parsing the string to DATE type:
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions*`, UNNEST(hits) as h
where parse_date("%Y%m%d", regexp_extract(_table_suffix, r'.*_(.*)')) >= parse_date("%Y-%m-%d", '2017-01-29')
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
Where the parse_date operation first casts the string to DATE and then the comparison is made.
Notice as well that I changed the wildcard selection to ga_sessions and then using the REGEX_EXTRACT I consider only what comes after the "_" character. By doing so, you'll be able to select "intraday" tables as well.