Looking to get the average latencyTracking for a visitid out of our GA 360 export.
Setup the following query but getting the following error and I'm not sure why since all these are all aggregate functions: SELECT list expression references hits.latencyTracking.serverResponseTime which is neither grouped nor aggregated at [3:5]
select
TIMESTAMP_SECONDS(visitStartTime) as visitStartTime,
AVG(hits.latencyTracking.serverResponseTime) OVER (PARTITION BY visitid) as avgServerResponseTime,
AVG(hits.latencyTracking.serverConnectionTime) OVER (PARTITION BY visitid) as avgServerConnectionTime,
AVG(hits.latencyTracking.domInteractiveTime) OVER (PARTITION BY visitid) as avgdomInteractiveTime,
AVG(hits.latencyTracking.pageLoadTime) OVER (PARTITION BY visitid) as avgpageLoadTime
from `xxx.xxx.ga_sessions_2018*`,
UNNEST(hits) AS hits
where hits.latencyTracking.serverResponseTime is not null
group by visitStartTime
The way your query written - AVG() is not just Aggregate Function but rather Aggregate Analytic Function.
To make it work you can remove OVER() so AVG() will really become aggregate function here corresponding to GROUP BY
select
TIMESTAMP_SECONDS(visitStartTime) as visitStartTime,
AVG(hits.latencyTracking.serverResponseTime) as avgServerResponseTime,
AVG(hits.latencyTracking.serverConnectionTime) as avgServerConnectionTime,
AVG(hits.latencyTracking.domInteractiveTime) as avgdomInteractiveTime,
AVG(hits.latencyTracking.pageLoadTime) as avgpageLoadTime
from `xxx.xxx.ga_sessions_2018*`,
UNNEST(hits) AS hits
where hits.latencyTracking.serverResponseTime is not null
group by visitStartTime
Having windows and group by in conjunction can be confusing.
In your case it is not even necessary, neither is the flattening - you can write simple subqueries to get your numbers per session:
SELECT
TIMESTAMP_SECONDS(visitStartTime) AS visitStartTime,
(
SELECT AVG(latencyTracking.serverResponseTime)
FROM t.hits
WHERE latencyTracking.serverResponseTime IS NOT NULL) AS avgServerResponseTime,
(
SELECT AVG(latencyTracking.serverConnectionTime)
FROM t.hits
WHERE latencyTracking.serverConnectionTime IS NOT NULL) AS avgServerConnectionTime,
(
SELECT AVG(latencyTracking.domInteractiveTime)
FROM t.hits
WHERE latencyTracking.domInteractiveTime IS NOT NULL ) AS avgdomInteractiveTime,
(
SELECT AVG(latencyTracking.pageLoadTime)
FROM t.hits
WHERE latencyTracking.pageLoadTime IS NOT NULL ) AS avgpageLoadTime
FROM `xxx.xxx.ga_sessions_2018*`
It also doesn't involve grouping which makes it faster.
Related
I'm trying to get a total events count for a particular event in BigQuery along with a custom dimension for versions of the site.
This query works perfect but does not include my custom dimension:
SELECT
hits.eventInfo.eventCategory AS eventCategory,
COUNT(*) AS total_events
FROM `ga_sessions_*`,
UNNEST(hits) AS hits
WHERE _TABLE_SUFFIX = '20200630'
AND totals.visits = 1
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Cart CTA'
GROUP BY
hits.eventInfo.eventCategory
But when I add the UNNEST for customDimensions, I get a total that is twice the correct total.
SELECT
hits.eventInfo.eventCategory AS eventCategory,
COUNT(*) AS total_events
FROM `ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
WHERE _TABLE_SUFFIX = '20200630'
AND totals.visits = 1
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Cart CTA'
GROUP BY
hits.eventInfo.eventCategory
I think there is something wrong with customDimension unnest, but I don't know how to solve. I've tried using LEFT JOIN with the UNNEST but I get the same result.
I think there is something wrong with customDimension unnest
This is by design! When you do UNNEST for the row that has N records in that unnested column - you actually generate N rows in place of that one row. So, obviously COUNT(*) will be different ...
I don't know how to solve
... unless you filter by specific value of that unnested field
select
concat(fullvisitorid,cast(visitid as string)) as unique_session_id
,case
when h.item.productSku is not null then h.hitNumber
else max(h.hitnumber)
end
,h.item.transactionid
,h.item.itemrevenue/pow(10,6)
,h.item.productSku
from `myproject.mydataset.ga_sessions_20180101`, unnest(hits) as h
group by 1
Looking at case statement above (line 3)
How do I return that hitnumber where the productsku is populated
otherwise return the max hitnumber and then group this by the unique_session_id?
How to filter out transactionid's that contain '_ABC' at the same time?
I would suggest doing the grouping and finding the max hit number in a subquery. If you are going to use an aggregate function like MAX() in the select clause, then you need to group on or have aggregate functions for the other fields in the select. It can be useful to do aggregate sub-queries using common table expressions.
WITH data AS (
SELECT
CONCAT(fullvisitorid, CAST(visitid AS string)) AS unique_session_id,
h.hitNumber,
h.item.transactionid,
h.item.itemrevenue/POW(10,6) AS itemRevenue,
h.item.productSku
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
),
max_hits AS (
SELECT
unique_session_id,
MAX(hitNumber) AS max_hit_number
FROM data
GROUP BY 1
)
SELECT
d.unique_session_id,
CASE
WHEN d.productSku IS NOT NULL THEN d.hitNumber
ELSE m.max_hit_number
END,
d.transactionid,
d.itemrevenue,
d.productSku
FROM
data AS d JOIN max_hits AS m
ON d.unique_session_id = m.unique_session_id
I have Google Analytics export set-up for Bigquery activated.
This is a query for previous page path, page:
SELECT
LAG(hits.page.pagePath, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous,
hits.page.pagePath AS Page
FROM
[xxxxxxxx.table]
WHERE
hits.type="PAGE"
LIMIT
100
I am trying to also get a custom dimension for the previous page request but I am stuck.
Basically I want to retrieve a custom dimension (which is a nested value) with LAG.
This works but it also throws a lot of extra null rows:
LAG ( IF (hits.customDimensions.index = 10, hits.customDimensions.value, NULL)) ,1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous_PT
If I use max (https://support.google.com/analytics/answer/4419694?hl=en#query7_MultipleCDs ) it throws an error.
Any help would be much appreciated.
Thanks.
Does it work if you just move the "hits.customDimensions.index = 10" into WHERE clause?
For future reference & seekers, I managed to solve this:
Max is an analytic function and you cannot use analytical functions in LAG.
The only way I managed to get the custom dimension X for the previous request is by self joining the same table ON hitnumber:
SELECT
hits.page.pagePath AS Page,
fullVisitorId,
visitId,
LAG(hits.hitNumber, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous_Hit,
LAG(hits.page.pagePath, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous,
MAX(IF (hits.customDimensions.index = 6, hits.customDimensions.value, NULL)) WITHIN RECORD AS BLABLA1,
MAX(IF (hits.customDimensions.index = 8, hits.customDimensions.value, NULL)) WITHIN RECORD AS BLABLA2,
MAX(IF (hits.customDimensions.index = 10, hits.customDimensions.value, NULL)) WITHIN RECORD AS BLABLA3,
hits.hitNumber AS hitNumber
FROM
FLATTEN([xxxxxxxxx], hits)
WHERE
hits.type="PAGE" ) AS T1
LEFT JOIN
FLATTEN(xxxxxxxxxx], hits) AS T2
ON
T2.hits.hitNumber = T1.Previous_Hit
AND T1.fullVisitorId = T2.fullVisitorId
AND T1.visitId = T2.visitId
I'm working on counting all visitors that submitted postcode on our homepage. I came up with following query in legacy SQL:
SELECT fullVisitorId, visitStartTime
FROM TABLE_DATE_RANGE([ga_sessions_], TIMESTAMP('2017-01-29'), CURRENT_TIMESTAMP())
where hits.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and hits.type = 'EVENT'
and hits.eventInfo.eventCategory = 'Homepage'
and hits.eventInfo.eventAction = 'Submit Postcode';
I then wanted to convert it to standard SQL to use within CTE and came up with this one that doesn't seem right though.
SELECT fullVisitorId, visitStartTime
FROM ``ga_sessions_*``, UNNEST(hits) as h
where
_TABLE_SUFFIX > '2017-01-29'
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
The first one processes 327 MB and returns 4117 results, the second one processes 6.98 GB and returns 60745 results.
I've looked at the migration guide, but it didn't prove very helpful for me.
ga_sessions has standard schema of GA import into Bigquery.
It looks like difference is coming from the fact that with Standard SQL you are flattening the table on hits when you CROSS JOIN UNNEST(hits) in the FROM clause, and therefore adding more rows to the result. More equivalent query would be:
#standardSQL
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions_*`
where
_TABLE_SUFFIX > '20170129'
and visitStartTime > 1483228800
and EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE h.type = 'EVENT'
and h.page.pagePath = '/broadband/'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode');
What happened here is that as _TABLE_SUFFIX is a string so when you do:
_TABLE_SUFFIX > '2017-01-29'
You will end up selecting way more tables then expected as string comparisons is different from number comparisons.
One possible way to fix that is by parsing the string to DATE type:
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions*`, UNNEST(hits) as h
where parse_date("%Y%m%d", regexp_extract(_table_suffix, r'.*_(.*)')) >= parse_date("%Y-%m-%d", '2017-01-29')
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
Where the parse_date operation first casts the string to DATE and then the comparison is made.
Notice as well that I changed the wildcard selection to ga_sessions and then using the REGEX_EXTRACT I consider only what comes after the "_" character. By doing so, you'll be able to select "intraday" tables as well.
I have been trying to do a calculation based on the result of either the LAG or LEAD functions.
Encapsulating the function in the INTEGER() casting function seems to cause an issue with the OVER function within and throws the following error:
Unrecognized Analytic Function: INT64 cannot be used with an OVER() clause
The following is the base code that works just fine, but when I add a function, it produces an error:
LEAD(hits.hitNumber, 1) OVER (PARTITION BY fullvisitorID, visitid, visitnumber ORDER BY hits.hitNumber DESC) as nextHit
The code that I was using to produce this error is as follows:
INTEGER(LEAD(hits.hitNumber, 1)) OVER (PARTITION BY fullvisitorID, visitid ORDER BY hits.hitNumber DESC) as nextHit
The following doesn't seem to work either:
INTEGER(LEAD(hits.hitNumber, 1) OVER (PARTITION BY fullvisitorID, visitid ORDER BY hits.hitNumber DESC))as nextHit
Encountered " "OVER" "OVER "" at line 8, column 36. Was expecting: ")"
Do I really need to make this a sub-query to make this work or is there a different solution?
2 possible solutions:
As Jordan says, bring the INTERGER() cast inside LEAD():
SELECT LEAD(INTEGER(hits.hitNumber), 1) OVER (PARTITION BY fullvisitorID, visitid, visitnumber ORDER BY hits.hitNumber DESC) as nextHit
FROM [dataset.ga_sessions_20140107]
Or as in your suggestion, with a sub-query:
SELECT INTEGER(nextHit) FROM (
SELECT LEAD(hits.hitNumber, 1) OVER (PARTITION BY fullvisitorID, visitid, visitnumber ORDER BY hits.hitNumber DESC) as nextHit
FROM [dataset.ga_sessions_20140107]
)