BigQuery: Converting Legacy SQL Query to Standard SQL - google-analytics

I am converting legal SQL Query to Standard SQL Query in Bigquery to calculate google analytics bounce rate. But on Converting the query there is some difference in the output result.
Legacy SQL Query
SELECT session_bounceRate,
t1.source as source,t1.medium as medium,
total_session
FROM (
SELECT
IFNULL(t1.session_bounceSessionCount, 0) / t2.session_sessionId_distinct_count AS session_bounceRate,
t1.source,t1.medium,
t2.total_session_distinct_count AS total_session
FROM (
SELECT
INTEGER(session_bounceSession_distinct_count) AS session_bounceSessionCount,
source,
medium
FROM (
SELECT
COUNT(DISTINCT session_sessionId, 10000000) AS session_bounceSession_distinct_count, --Changes done (Count(x) in legacy sql stands for approx ,
source, --,so replaced it with APPROX_COUNT_DISTINCT to get approx count
medium from (
SELECT
SUM(IF(session_hitsType = 'event'
AND session_isInteraction_first = 1, 1, 0)) AS session_isEventInteraction_sum,
session_sessionId AS session_sessionId,
SUM(session_pageViews_sum) AS session_pageViews_sum_sum,
source,
medium
FROM (
SELECT
hits.type AS session_hitsType,
sessionId AS session_sessionId,
SUM(totals.pageviews) AS session_pageViews_sum,
FIRST(hits.isInteraction) AS session_isInteraction_first,
trafficSource.source AS source,
trafficSource.medium AS medium,
FROM
TABLE_DATE_RANGE([[test:test.session_streaming_], TIMESTAMP('2018-04-01'), TIMESTAMP('2018-04-30')) AS session_streaming
GROUP BY
source,
medium,
session_hitsType,
session_sessionId )
GROUP BY
source,
medium,
session_sessionId )
WHERE
(session_isEventInteraction_sum = 0
AND session_pageViews_sum_sum = 1)
GROUP BY
source,
medium ) ) AS t1
JOIN EACH (
SELECT
COUNT(DISTINCT sessionId, 10000000) AS session_sessionId_distinct_count,
trafficSource.source AS source,
trafficSource.medium AS medium,
COUNT(DISTINCT sessionId, 10000000) AS total_session_distinct_count
FROM
TABLE_DATE_RANGE([test:Test.session_streaming_], TIMESTAMP('2018-04-01'), TIMESTAMP('2018-04-30')) AS session_streaming
GROUP BY
source,
medium ) AS t2
ON
t1.source = t2.source
and t1.medium=t2.medium)
where t1.medium='zadv_display'
ORDER BY
total_session DESC
Standard SQL Query
SELECT
IFNULL(t1.session_bounceSessionCount, 0) / t2.session_sessionId_distinct_count AS session_bounceRate,
t1.source,
t1.medium,
t2.total_session_distinct_count AS total_session
FROM (
SELECT
CAST(session_bounceSession_distinct_count AS INT64) AS session_bounceSessionCount,
source,
medium
FROM (
SELECT
APPROX_COUNT_DISTINCT(DISTINCT session_sessionId) AS session_bounceSession_distinct_count,
source,
medium
FROM (
SELECT
SUM(IF(session_hitsType = 'event'
AND session_isInteraction_first = 1, 1, 0)) AS session_isEventInteraction_sum,
session_sessionId AS session_sessionId,
SUM(session_pageViews_sum) AS session_pageViews_sum_sum,
source,
medium
FROM (
SELECT
session_hitsType,
session_sessionId,
source,
medium,
CASE
WHEN session_isInteraction_first = TRUE THEN 1
ELSE 0
END AS session_isInteraction_first,
SUM(session_pageViews_sum) AS session_pageViews_sum
FROM (
SELECT
hits.Type AS session_hitsType,
sessionId AS session_sessionId,
trafficSource.source AS source,
trafficSource.medium AS medium,
totals.pageviews AS session_pageViews_sum,
FIRST_VALUE(hits.isInteraction) OVER(PARTITION BY sessionId ORDER BY TIMESTAMP_SECONDS(hits.Time)) AS session_isInteraction_first
FROM
`test.Test.session_streaming_*`,
unNEST(hits) hits
WHERE
_table_suffix BETWEEN '20180401'
AND '20180430' )
GROUP BY
session_hitsType,
session_sessionId,
source,
medium,
session_isInteraction_first )
GROUP BY
source,
medium,
session_sessionId)
WHERE
(session_isEventInteraction_sum = 0
AND session_pageViews_sum_sum = 1)
GROUP BY
source,
medium)) AS t1
JOIN (
SELECT
APPROX_COUNT_DISTINCT(DISTINCT sessionId) AS session_sessionId_distinct_count,
trafficSource.source AS source,
trafficSource.medium AS medium,
APPROX_COUNT_DISTINCT(DISTINCT sessionId) AS total_session_distinct_count
FROM
`test.Test.session_streaming_*`
WHERE _table_suffix BETWEEN '20180401' AND '20180430'
GROUP BY
source,
medium
) AS t2
ON
t1.source = t2.source
AND t1.medium=t2.medium
where t1.medium='zadv_display'
order by total_session desc
We have replace First function in value with First_value in standard sql that is the noticeable change made in the query.
Could Someone guide me is there some issue on conversion as standard sql query query output should match with legacy output?

I'm not really sure you really need all those nested queries.
But you should make use of sub-queries on arrays - a lot. E.g. the first event interaction information in a session goes like this:
SELECT
date,
visitStartTime,
(SELECT isInteraction FROM t.hits WHERE type='EVENT' ORDER BY hitNumber ASC LIMIT 1) AS isInteraction
FROM
`project.dataset.ga_sessions_20180624` AS t
LIMIT
1000
So simply treat (struct-)arrays like smaller tables within a bigger table.
If you only have aggregations to session level, you shouldn't flatten the table at all.
Here's also a document explaining the migration to standard sql: https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql

Related

How to replicate the Model Comparison Tool report from Google Analytics to Google BigQuery

I have the following report in the demo account of Google Analytics:
https://analytics.google.com/analytics/web/?utm_source=demoaccount&utm_medium=demoaccount&utm_campaign=demoaccount#/report/bf-roi-calculator/a54516992w87479473p92320289/_u.date00=20211101&_u.date01=20211128&_r.attrSel2=preset6&_r.attrSel1=preset1&_r.attrSel3=preset7/
In this report, we can see the different models of conversion attribution, e.g. Last Interaction, Last Non-Direct Click, and Last Google Ads Click. There are also other models, like First Interaction, Position based. Here's Google's documentation about the multi-channel funnels report:
https://support.google.com/analytics/topic/1191164?hl=en&ref_topic=1631741
So far, I have managed to build the following query:
-- Sessions with source/medium, hits, and page path
WITH table_1 AS (
SELECT
fullVisitorId,
visitStartTime,
CONCAT(fullVisitorId, visitId, date) AS session,
trafficSource.medium,
trafficSource.source,
ANY_VALUE(social.hasSocialSourceReferral) AS social_source_referral,
trafficSource.campaign,
ARRAY_AGG(hitNumber ORDER BY hitNumber) AS hit_number,
ARRAY_AGG(page.pagePath ORDER BY hitNumber) AS page_path
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, UNNEST(hits) AS hits_
WHERE _TABLE_SUFFIX BETWEEN '20170727' AND '20170801'
GROUP BY fullVisitorId, visitStartTime, session, medium, source, campaign),
-- Adding the MCF channel grouping and creating a field that indicates sessions with conversions
table_2 AS (
SELECT
fullVisitorId,
visitStartTime,
CASE
WHEN source = '(direct)' AND (medium = '(not set)' OR medium = '(none)') THEN 'Direct'
WHEN medium = 'organic' THEN 'Organic Search'
WHEN social_source_referral = 'Yes' AND REGEXP_CONTAINS(medium, r'^(social|social-network|social-media|sm|social network|social media)$') THEN 'Social'
WHEN medium = 'email' THEN 'Email'
WHEN medium = 'affiliate' THEN 'Affiliate'
WHEN medium = 'referral' THEN 'Referral'
WHEN REGEXP_CONTAINS(medium, r'^(cpc|ppc|paidsearch)$') THEN 'Paid Search'
WHEN REGEXP_CONTAINS(medium, r'^(cpv|cpa|cpp|content-text)$') THEN 'Other Advertising'
WHEN REGEXP_CONTAINS(medium, r'^(display|cpm|banner)$') THEN 'Display'
ELSE 'Other'
END AS mcf_channel_grouping,
medium,
source,
campaign,
CAST(
EXISTS(
SELECT *
FROM UNNEST(page_path) AS x
WHERE REGEXP_CONTAINS(x, r'^/ordercompleted\.html')
)
AS INT64
) AS conversion
FROM table_1
ORDER BY fullVisitorId
),
-- Filtering by sessions with conversions
table_3 AS (
SELECT *
FROM table_2
WHERE TRUE
QUALIFY COUNTIF(conversion = 1) OVER (PARTITION BY fullVisitorId) > 0
),
-- Adding the attribution models
table_4 AS (
SELECT
fullVisitorId,
DATE(TIMESTAMP_SECONDS(visitStartTime)) AS date,
visitStartTime AS date_sec,
mcf_channel_grouping,
medium,
source,
campaign,
conversion,
CASE
WHEN conversion > 0 AND visitStartTime > LAG(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
WHEN conversion > 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
ELSE 'null'
END AS last_touch_attribution,
CASE
WHEN conversion > 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
WHEN conversion = 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN 'null'
WHEN conversion = 0 AND LAG(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN 'null'
WHEN SUM(conversion) OVER (PARTITION BY fullVisitorId) > 0 AND visitStartTime < LAST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AND LEAD(source) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) = 'direct' AND source != 'direct' THEN '1'
WHEN conversion > 0 AND visitStartTime = LAST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AND source != 'direct' THEN '1'
ELSE 'null'
END AS last_non_direct,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
ELSE 'null'
END AS first_touch_attribution,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 THEN '1'
ELSE 'null'
END AS any_touch_attribution,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 AND source = 'blog' THEN '1'
ELSE 'null'
END AS blog_only
FROM table_3
ORDER BY fullVisitorId, visitStartTime
)
SELECT *
FROM table_4
The issue I have is that the Last Non-Direct model is not calculated correctly and I don't know how to create the look-back window that allows me to set n days prior to conversion.
How could we replicate this report in BigQuery using Standard SQL? Thanks.

Unnest hits and Unnesting session scoped custom dimension BigQuery code filter

I am trying to filter a funnel based on users who have certain custom dimension values. Sadly, the custom dimension in question is session-scoped and not hit-based, so i cannot use hits.customDimensions in this particular query. What is the best way to do this and achieve the desired result?
Find my progress so far:
#standardSQL
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page' LIMIT 1)) One_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/two - Page' LIMIT 1)) Two_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/three - Page' LIMIT 1)) Three_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/four - Page' LIMIT 1)) Four_Page
FROM `xxxxxxx.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(customDimensions) AS cusDim
WHERE
_TABLE_SUFFIX BETWEEN '20190320' AND '20190323'
AND h.hitNumber = 1
AND cusDim.index = 6
AND cusDim.value IN ('60','70)
Segmentation with Custom Dimensions
You can filter for sessions based on conditions in custom dimensions. Simply write a sub-query counting cases of interest and set to ">0". Example for sample data:
SELECT
fullvisitorid,
visitstarttime,
customdimensions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
-- there should be at least one case with index=4 and value='EMEA' ... you can use your index and desired value
-- unnest() turns customdimensions into table format, so we can apply SQL to this array
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
limit 100
You comment the WHERE statement to see all the data.
Funnel
First you might want to get an overview of what is going on in your hits array:
SELECT
fullvisitorid,
visitstarttime,
-- get an overview over relevant hits data
-- select as struct feeds hits fields into a new array created by array()-function
ARRAY(select as struct hitnumber, page from unnest(hits) where type='PAGE') hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Now that you made sure the data makes sense you can create a funnel array containing the hit numbers of the relevant steps:
SELECT
fullvisitorid,
visitstarttime,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Put this into a WITH statement for more clarity and run your analysis by summarizing the corresponding cases:
WITH f AS (
SELECT
fullvisitorid,
visitstarttime,
totals.visits,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
)
SELECT
COUNT(DISTINCT fullvisitorid) as users,
SUM(visits) as allSessions,
SUM( IF(array_length(funnel)>0,visits,0) ) sessionsWithFunnelPages,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null ) ,visits,0) ) sessionsWithStep1,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 ) ,visits,0) ) sessionsFunnelToStep2,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 and step2<step3 and step1<step3) ,visits,0) ) sessionsFunnelToStep3
FROM f
Please test before using.

Recreate GA Funnel on BigQuery

I am trying to recreate the GA funnel (custom report on Google360) using BigQuery. The funnel on GA is using the unique count of events that happen on each page. I found this query online that is working for the most part:
SELECT
COUNT( s0.firstHit) AS Landing_Page,
COUNT( s1.firstHit) AS Model_Selection
from(
SELECT
s0.fullvisitorID,
s0.firstHit,
s1.firstHit,
FROM (
# Begin Subquery #1 aka s0
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROm [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('landing_page')
AND totals.visits = 1
GROUP BY
fullvisitorID
) s0
# End Subquery #1 aka s0
left join (
# Begin Subquery #2 aka s1
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROM [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('model_selection_page')
AND totals.visits = 1
GROUP BY
fullvisitorID,
) s1
ON
s0.fullvisitorID = s1.fullvisitorID
)
The query works fine and the value for landing page is the same as I can get on GA, but Model_Selection is about 10% higher. This difference also increases along the funnel (I only posted 2 steps for clarity).
Any idea what am I missing here?
This query does what you need but in Standard SQL Version:
#standardSQL
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM `64269470.ga_sessions_20170720`
Just that. 4 lines, way faster and cheaper.
You can also play with simulated data, something like:
#standardSQL
WITH data AS(
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits
)
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM data
Notice that building this type of report in GA might be a bit more difficult as you need to select visitors who had at least fired once the event 'landing_page' and then had the event 'model_selection_page' fired. Make sure you got this report built correctly as well in your GA (one way might be to first build a customized report with only customers who had 'landing_page' fired and then apply the second filter looking for 'model_selection_page').
[EDIT]:
You asked in your comment about bringing this counting on the session and user level. For counting each session, you can limit the results to 1 for each sub-query evaluation, like so:
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
For counting distinct users, the idea is the same but you'd have to apply a COUNT(DISTINCT) operation, like so:
SELECT
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data

Limit a view to select between two date partitions

I wish to define a view for Google Analytics landing pages. I've tried to set this up by saving the following query as a view:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
In the queries that join to this view I plan to limit them to between two date partitions like so:
SELECT
sessions.date,
fullVisitorId AS fv,
visitId AS v,
landing_page
FROM `project-id.dataset.ga_sessions_*` AS sessions, UNNEST(hits) AS h
JOIN `project-id.dataset.landing_pages` AS landing_pages
ON landing_pages.fv = sessions.fullVisitorId
AND landing_pages.date = sessions.date
AND landing_pages.v = sessions.visitId
WHERE
_TABLE_SUFFIX BETWEEN '20170108' AND '20170108'
This still appears to select a large volume of data ~5GB rather than ~60MB that would be expected for one day.
How can I re-write the view so that it only selects the relevant date partitions as defined by the consuming query?
Make sure to include the _TABLE_SUFFIX in the view definition so that you can reference it in queries over the view. Here's an example that converts the _TABLE_SUFFIX to a date:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS sessions_date
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1;
Now try a query over the view:
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `dataset.view_name`
WHERE sessions_date = '2017-01-08';

BigQuery filtering records in standard sql

I'm working on counting all visitors that submitted postcode on our homepage. I came up with following query in legacy SQL:
SELECT fullVisitorId, visitStartTime
FROM TABLE_DATE_RANGE([ga_sessions_], TIMESTAMP('2017-01-29'), CURRENT_TIMESTAMP())
where hits.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and hits.type = 'EVENT'
and hits.eventInfo.eventCategory = 'Homepage'
and hits.eventInfo.eventAction = 'Submit Postcode';
I then wanted to convert it to standard SQL to use within CTE and came up with this one that doesn't seem right though.
SELECT fullVisitorId, visitStartTime
FROM ``ga_sessions_*``, UNNEST(hits) as h
where
_TABLE_SUFFIX > '2017-01-29'
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
The first one processes 327 MB and returns 4117 results, the second one processes 6.98 GB and returns 60745 results.
I've looked at the migration guide, but it didn't prove very helpful for me.
ga_sessions has standard schema of GA import into Bigquery.
It looks like difference is coming from the fact that with Standard SQL you are flattening the table on hits when you CROSS JOIN UNNEST(hits) in the FROM clause, and therefore adding more rows to the result. More equivalent query would be:
#standardSQL
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions_*`
where
_TABLE_SUFFIX > '20170129'
and visitStartTime > 1483228800
and EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE h.type = 'EVENT'
and h.page.pagePath = '/broadband/'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode');
What happened here is that as _TABLE_SUFFIX is a string so when you do:
_TABLE_SUFFIX > '2017-01-29'
You will end up selecting way more tables then expected as string comparisons is different from number comparisons.
One possible way to fix that is by parsing the string to DATE type:
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions*`, UNNEST(hits) as h
where parse_date("%Y%m%d", regexp_extract(_table_suffix, r'.*_(.*)')) >= parse_date("%Y-%m-%d", '2017-01-29')
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
Where the parse_date operation first casts the string to DATE and then the comparison is made.
Notice as well that I changed the wildcard selection to ga_sessions and then using the REGEX_EXTRACT I consider only what comes after the "_" character. By doing so, you'll be able to select "intraday" tables as well.

Resources