Lag custom dimension in Bigquery export - google-analytics

I have Google Analytics export set-up for Bigquery activated.
This is a query for previous page path, page:
SELECT
LAG(hits.page.pagePath, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous,
hits.page.pagePath AS Page
FROM
[xxxxxxxx.table]
WHERE
hits.type="PAGE"
LIMIT
100
I am trying to also get a custom dimension for the previous page request but I am stuck.
Basically I want to retrieve a custom dimension (which is a nested value) with LAG.
This works but it also throws a lot of extra null rows:
LAG ( IF (hits.customDimensions.index = 10, hits.customDimensions.value, NULL)) ,1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous_PT
If I use max (https://support.google.com/analytics/answer/4419694?hl=en#query7_MultipleCDs ) it throws an error.
Any help would be much appreciated.
Thanks.

Does it work if you just move the "hits.customDimensions.index = 10" into WHERE clause?

For future reference & seekers, I managed to solve this:
Max is an analytic function and you cannot use analytical functions in LAG.
The only way I managed to get the custom dimension X for the previous request is by self joining the same table ON hitnumber:
SELECT
hits.page.pagePath AS Page,
fullVisitorId,
visitId,
LAG(hits.hitNumber, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous_Hit,
LAG(hits.page.pagePath, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber ASC) AS Previous,
MAX(IF (hits.customDimensions.index = 6, hits.customDimensions.value, NULL)) WITHIN RECORD AS BLABLA1,
MAX(IF (hits.customDimensions.index = 8, hits.customDimensions.value, NULL)) WITHIN RECORD AS BLABLA2,
MAX(IF (hits.customDimensions.index = 10, hits.customDimensions.value, NULL)) WITHIN RECORD AS BLABLA3,
hits.hitNumber AS hitNumber
FROM
FLATTEN([xxxxxxxxx], hits)
WHERE
hits.type="PAGE" ) AS T1
LEFT JOIN
FLATTEN(xxxxxxxxxx], hits) AS T2
ON
T2.hits.hitNumber = T1.Previous_Hit
AND T1.fullVisitorId = T2.fullVisitorId
AND T1.visitId = T2.visitId

Related

Unnest hits and Unnesting session scoped custom dimension BigQuery code filter

I am trying to filter a funnel based on users who have certain custom dimension values. Sadly, the custom dimension in question is session-scoped and not hit-based, so i cannot use hits.customDimensions in this particular query. What is the best way to do this and achieve the desired result?
Find my progress so far:
#standardSQL
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page' LIMIT 1)) One_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/two - Page' LIMIT 1)) Two_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/three - Page' LIMIT 1)) Three_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/four - Page' LIMIT 1)) Four_Page
FROM `xxxxxxx.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(customDimensions) AS cusDim
WHERE
_TABLE_SUFFIX BETWEEN '20190320' AND '20190323'
AND h.hitNumber = 1
AND cusDim.index = 6
AND cusDim.value IN ('60','70)
Segmentation with Custom Dimensions
You can filter for sessions based on conditions in custom dimensions. Simply write a sub-query counting cases of interest and set to ">0". Example for sample data:
SELECT
fullvisitorid,
visitstarttime,
customdimensions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
-- there should be at least one case with index=4 and value='EMEA' ... you can use your index and desired value
-- unnest() turns customdimensions into table format, so we can apply SQL to this array
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
limit 100
You comment the WHERE statement to see all the data.
Funnel
First you might want to get an overview of what is going on in your hits array:
SELECT
fullvisitorid,
visitstarttime,
-- get an overview over relevant hits data
-- select as struct feeds hits fields into a new array created by array()-function
ARRAY(select as struct hitnumber, page from unnest(hits) where type='PAGE') hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Now that you made sure the data makes sense you can create a funnel array containing the hit numbers of the relevant steps:
SELECT
fullvisitorid,
visitstarttime,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Put this into a WITH statement for more clarity and run your analysis by summarizing the corresponding cases:
WITH f AS (
SELECT
fullvisitorid,
visitstarttime,
totals.visits,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
)
SELECT
COUNT(DISTINCT fullvisitorid) as users,
SUM(visits) as allSessions,
SUM( IF(array_length(funnel)>0,visits,0) ) sessionsWithFunnelPages,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null ) ,visits,0) ) sessionsWithStep1,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 ) ,visits,0) ) sessionsFunnelToStep2,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 and step2<step3 and step1<step3) ,visits,0) ) sessionsFunnelToStep3
FROM f
Please test before using.

Average Analytics Function On GA360 Visit ID for LatencyTracking

Looking to get the average latencyTracking for a visitid out of our GA 360 export.
Setup the following query but getting the following error and I'm not sure why since all these are all aggregate functions: SELECT list expression references hits.latencyTracking.serverResponseTime which is neither grouped nor aggregated at [3:5]
select
TIMESTAMP_SECONDS(visitStartTime) as visitStartTime,
AVG(hits.latencyTracking.serverResponseTime) OVER (PARTITION BY visitid) as avgServerResponseTime,
AVG(hits.latencyTracking.serverConnectionTime) OVER (PARTITION BY visitid) as avgServerConnectionTime,
AVG(hits.latencyTracking.domInteractiveTime) OVER (PARTITION BY visitid) as avgdomInteractiveTime,
AVG(hits.latencyTracking.pageLoadTime) OVER (PARTITION BY visitid) as avgpageLoadTime
from `xxx.xxx.ga_sessions_2018*`,
UNNEST(hits) AS hits
where hits.latencyTracking.serverResponseTime is not null
group by visitStartTime
The way your query written - AVG() is not just Aggregate Function but rather Aggregate Analytic Function.
To make it work you can remove OVER() so AVG() will really become aggregate function here corresponding to GROUP BY
select
TIMESTAMP_SECONDS(visitStartTime) as visitStartTime,
AVG(hits.latencyTracking.serverResponseTime) as avgServerResponseTime,
AVG(hits.latencyTracking.serverConnectionTime) as avgServerConnectionTime,
AVG(hits.latencyTracking.domInteractiveTime) as avgdomInteractiveTime,
AVG(hits.latencyTracking.pageLoadTime) as avgpageLoadTime
from `xxx.xxx.ga_sessions_2018*`,
UNNEST(hits) AS hits
where hits.latencyTracking.serverResponseTime is not null
group by visitStartTime
Having windows and group by in conjunction can be confusing.
In your case it is not even necessary, neither is the flattening - you can write simple subqueries to get your numbers per session:
SELECT
TIMESTAMP_SECONDS(visitStartTime) AS visitStartTime,
(
SELECT AVG(latencyTracking.serverResponseTime)
FROM t.hits
WHERE latencyTracking.serverResponseTime IS NOT NULL) AS avgServerResponseTime,
(
SELECT AVG(latencyTracking.serverConnectionTime)
FROM t.hits
WHERE latencyTracking.serverConnectionTime IS NOT NULL) AS avgServerConnectionTime,
(
SELECT AVG(latencyTracking.domInteractiveTime)
FROM t.hits
WHERE latencyTracking.domInteractiveTime IS NOT NULL ) AS avgdomInteractiveTime,
(
SELECT AVG(latencyTracking.pageLoadTime)
FROM t.hits
WHERE latencyTracking.pageLoadTime IS NOT NULL ) AS avgpageLoadTime
FROM `xxx.xxx.ga_sessions_2018*`
It also doesn't involve grouping which makes it faster.

Limit a view to select between two date partitions

I wish to define a view for Google Analytics landing pages. I've tried to set this up by saving the following query as a view:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
In the queries that join to this view I plan to limit them to between two date partitions like so:
SELECT
sessions.date,
fullVisitorId AS fv,
visitId AS v,
landing_page
FROM `project-id.dataset.ga_sessions_*` AS sessions, UNNEST(hits) AS h
JOIN `project-id.dataset.landing_pages` AS landing_pages
ON landing_pages.fv = sessions.fullVisitorId
AND landing_pages.date = sessions.date
AND landing_pages.v = sessions.visitId
WHERE
_TABLE_SUFFIX BETWEEN '20170108' AND '20170108'
This still appears to select a large volume of data ~5GB rather than ~60MB that would be expected for one day.
How can I re-write the view so that it only selects the relevant date partitions as defined by the consuming query?
Make sure to include the _TABLE_SUFFIX in the view definition so that you can reference it in queries over the view. Here's an example that converts the _TABLE_SUFFIX to a date:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS sessions_date
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1;
Now try a query over the view:
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `dataset.view_name`
WHERE sessions_date = '2017-01-08';

BigQuery filtering records in standard sql

I'm working on counting all visitors that submitted postcode on our homepage. I came up with following query in legacy SQL:
SELECT fullVisitorId, visitStartTime
FROM TABLE_DATE_RANGE([ga_sessions_], TIMESTAMP('2017-01-29'), CURRENT_TIMESTAMP())
where hits.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and hits.type = 'EVENT'
and hits.eventInfo.eventCategory = 'Homepage'
and hits.eventInfo.eventAction = 'Submit Postcode';
I then wanted to convert it to standard SQL to use within CTE and came up with this one that doesn't seem right though.
SELECT fullVisitorId, visitStartTime
FROM ``ga_sessions_*``, UNNEST(hits) as h
where
_TABLE_SUFFIX > '2017-01-29'
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
The first one processes 327 MB and returns 4117 results, the second one processes 6.98 GB and returns 60745 results.
I've looked at the migration guide, but it didn't prove very helpful for me.
ga_sessions has standard schema of GA import into Bigquery.
It looks like difference is coming from the fact that with Standard SQL you are flattening the table on hits when you CROSS JOIN UNNEST(hits) in the FROM clause, and therefore adding more rows to the result. More equivalent query would be:
#standardSQL
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions_*`
where
_TABLE_SUFFIX > '20170129'
and visitStartTime > 1483228800
and EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE h.type = 'EVENT'
and h.page.pagePath = '/broadband/'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode');
What happened here is that as _TABLE_SUFFIX is a string so when you do:
_TABLE_SUFFIX > '2017-01-29'
You will end up selecting way more tables then expected as string comparisons is different from number comparisons.
One possible way to fix that is by parsing the string to DATE type:
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions*`, UNNEST(hits) as h
where parse_date("%Y%m%d", regexp_extract(_table_suffix, r'.*_(.*)')) >= parse_date("%Y-%m-%d", '2017-01-29')
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
Where the parse_date operation first casts the string to DATE and then the comparison is made.
Notice as well that I changed the wildcard selection to ga_sessions and then using the REGEX_EXTRACT I consider only what comes after the "_" character. By doing so, you'll be able to select "intraday" tables as well.

Can't cast the result of LAG or LEAD into an Integer

I have been trying to do a calculation based on the result of either the LAG or LEAD functions.
Encapsulating the function in the INTEGER() casting function seems to cause an issue with the OVER function within and throws the following error:
Unrecognized Analytic Function: INT64 cannot be used with an OVER() clause
The following is the base code that works just fine, but when I add a function, it produces an error:
LEAD(hits.hitNumber, 1) OVER (PARTITION BY fullvisitorID, visitid, visitnumber ORDER BY hits.hitNumber DESC) as nextHit
The code that I was using to produce this error is as follows:
INTEGER(LEAD(hits.hitNumber, 1)) OVER (PARTITION BY fullvisitorID, visitid ORDER BY hits.hitNumber DESC) as nextHit
The following doesn't seem to work either:
INTEGER(LEAD(hits.hitNumber, 1) OVER (PARTITION BY fullvisitorID, visitid ORDER BY hits.hitNumber DESC))as nextHit
Encountered " "OVER" "OVER "" at line 8, column 36. Was expecting: ")"
Do I really need to make this a sub-query to make this work or is there a different solution?
2 possible solutions:
As Jordan says, bring the INTERGER() cast inside LEAD():
SELECT LEAD(INTEGER(hits.hitNumber), 1) OVER (PARTITION BY fullvisitorID, visitid, visitnumber ORDER BY hits.hitNumber DESC) as nextHit
FROM [dataset.ga_sessions_20140107]
Or as in your suggestion, with a sub-query:
SELECT INTEGER(nextHit) FROM (
SELECT LEAD(hits.hitNumber, 1) OVER (PARTITION BY fullvisitorID, visitid, visitnumber ORDER BY hits.hitNumber DESC) as nextHit
FROM [dataset.ga_sessions_20140107]
)

Resources