I'm using BigQuery to report on Google Analytics data. I'm trying to recreate landing page data using BigQuery.
The following query reports 18% fewer sessions than in the Google Analytics interface:
SELECT DISTINCT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project-name.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND h.type = 'PAGE'
AND _TABLE_SUFFIX BETWEEN '20170331' AND '20170331'
ORDER BY fullVisitorId DESC
Where am I going wrong with my approach? Why can't I get to within a small margin of the number in the GA interface's reported figure?
Multiple reasons :
1.Big Query for equivalent landing page:
SELECT
LandingPage,
COUNT(sessionId) AS Sessions,
100 * SUM(totals.bounces)/COUNT(sessionId) AS BounceRate,
AVG(totals.pageviews) AS AvgPageviews,
SUM(totals.timeOnSite)/COUNT(sessionId) AS AvgTimeOnSite,
from(
SELECT
CONCAT(fullVisitorId,STRING(visitId)) AS sessionID,
totals.bounces,
totals.pageviews,
totals.timeOnSite,
hits.page.pagePath AS landingPage
FROM (
SELECT
fullVisitorId,
visitId,
hits.page.pagePath,
totals.bounces,
totals.pageviews,
totals.timeOnSite,
MIN(hits.hitNumber) WITHIN RECORD AS firstHit,
hits.hitNumber AS hitNumber
FROM (TABLE_DATE_RANGE ([XXXYYYZZZ.ga_sessions_],TIMESTAMP('2016-08-01'), TIMESTAMP ('2016-08-31')))
WHERE
hits.type = 'PAGE'
AND hits.page.pagePath'')
WHERE
hitNumber = firstHit)
GROUP BY
LandingPage
ORDER BY
Sessions DESC,
LandingPage
Next :
Pre-calculated data -- pre-aggregated tables
These are the precalculated data that Google uses to speed up the UI. Google does not specify when this is done but it can be at any point of the time. These are known as pre-aggregated tables
So if you compare the numbers from GA UI to your Big Query output, you will always see a discrepancy. Please go ahead and rely on your big query data .
You can achieve the same thing by simply adding the below to your select statement:
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
I can get a 1 to 1 match with the GA UI on my end when I run something like below, which is a bit more concise than the original answer:
SELECT DISTINCT
a.landingpage
,COUNT(DISTINCT(a.sessionId)) sessions
,SUM(a.bounces) bounces
,AVG(a.avg_pages) avg_pages
,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
FROM
(
SELECT DISTINCT
CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
,totals.bounces bounces
,totals.timeonsite tos
,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
FROM `tablename_*`
WHERE _TABLE_SUFFIX >= '20180801'
AND _TABLE_SUFFIX <= '20180808'
AND totals.visits = 1
) a
GROUP BY 1
another way here! you can get the same number :
SELECT
LandingPage,
COUNT(DISTINCT(sessionID)) AS sessions
FROM(
SELECT
CONCAT(fullVisitorId,CAST(visitId AS STRING)) AS sessionID,
FIRST_VALUE(hits.page.pagePath) OVER (PARTITION BY CONCAT(fullVisitorId,CAST(visitId AS STRING)) ORDER BY hits.hitNumber ASC ) AS LandingPage
FROM
`xxxxxxxx1.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND hits.type ='PAGE'
GROUP BY fullVisitorId, visitId, sessionID,hits.page.pagePath,hits.hitNumber
)
GROUP BY LandingPage
ORDER BY sessions DESC
There is a hit.isEntrance field in the schema that can be used for this purpose.
The example below would show you yesterday's landing pages:
#standardSQL
select
date,
hits.page.pagePath as landingPage,
sum(totals.visits) as visits,
sum(totals.bounces) as bounces,
sum(totals.transactions) as transactions
from
`project.dataset.ga_sessions_*`,
unnest(hits) as hits
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and hits.isEntrance = True
and totals.visits = 1 #avoid counting midnight-split sessions
group by
1, 2
order by 3 desc
There is still one source of discrepancy though, which comes from the sessions without a landing page (if you check in GA in the landing pages report, there will sometimes be a (not set) value.
In order to include those as well, you can do:
with
landing_pages_set as (
select
concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
hits.page.pagePath as virtualPagePath
from
`project.dataset.ga_sessions_*`,
unnest(hits) as hits
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and totals.visits = 1 #avoid counting midnight-split sessions
and hits.isEntrance = TRUE
group by 1, 2
),
landing_pages_not_set as (
select
concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
date,
"(not set)" as virtualPagePath,
count(distinct concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string))) as visits,
sum(totals.bounces) as bounces,
sum(totals.transactions) as transactions
from
`project.dataset.ga_sessions_*`
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and totals.visits = 1 #avoid counting midnight-split sessions
group by 1, 2, 3
),
landing_pages as (
select
l.fullVisitId as fullVisitId,
date,
coalesce(r.virtualPagePath, l.virtualPagePath) as virtualPagePath,
visits,
bounces,
transactions
from
landing_pages_not_set l left join landing_pages_set r on l.fullVisitId = r.fullVisitId
)
select virtualPagePath, sum(visits) from landing_pages group by 1 order by 2 desc
Related
I have the following report in the demo account of Google Analytics:
https://analytics.google.com/analytics/web/?utm_source=demoaccount&utm_medium=demoaccount&utm_campaign=demoaccount#/report/bf-roi-calculator/a54516992w87479473p92320289/_u.date00=20211101&_u.date01=20211128&_r.attrSel2=preset6&_r.attrSel1=preset1&_r.attrSel3=preset7/
In this report, we can see the different models of conversion attribution, e.g. Last Interaction, Last Non-Direct Click, and Last Google Ads Click. There are also other models, like First Interaction, Position based. Here's Google's documentation about the multi-channel funnels report:
https://support.google.com/analytics/topic/1191164?hl=en&ref_topic=1631741
So far, I have managed to build the following query:
-- Sessions with source/medium, hits, and page path
WITH table_1 AS (
SELECT
fullVisitorId,
visitStartTime,
CONCAT(fullVisitorId, visitId, date) AS session,
trafficSource.medium,
trafficSource.source,
ANY_VALUE(social.hasSocialSourceReferral) AS social_source_referral,
trafficSource.campaign,
ARRAY_AGG(hitNumber ORDER BY hitNumber) AS hit_number,
ARRAY_AGG(page.pagePath ORDER BY hitNumber) AS page_path
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, UNNEST(hits) AS hits_
WHERE _TABLE_SUFFIX BETWEEN '20170727' AND '20170801'
GROUP BY fullVisitorId, visitStartTime, session, medium, source, campaign),
-- Adding the MCF channel grouping and creating a field that indicates sessions with conversions
table_2 AS (
SELECT
fullVisitorId,
visitStartTime,
CASE
WHEN source = '(direct)' AND (medium = '(not set)' OR medium = '(none)') THEN 'Direct'
WHEN medium = 'organic' THEN 'Organic Search'
WHEN social_source_referral = 'Yes' AND REGEXP_CONTAINS(medium, r'^(social|social-network|social-media|sm|social network|social media)$') THEN 'Social'
WHEN medium = 'email' THEN 'Email'
WHEN medium = 'affiliate' THEN 'Affiliate'
WHEN medium = 'referral' THEN 'Referral'
WHEN REGEXP_CONTAINS(medium, r'^(cpc|ppc|paidsearch)$') THEN 'Paid Search'
WHEN REGEXP_CONTAINS(medium, r'^(cpv|cpa|cpp|content-text)$') THEN 'Other Advertising'
WHEN REGEXP_CONTAINS(medium, r'^(display|cpm|banner)$') THEN 'Display'
ELSE 'Other'
END AS mcf_channel_grouping,
medium,
source,
campaign,
CAST(
EXISTS(
SELECT *
FROM UNNEST(page_path) AS x
WHERE REGEXP_CONTAINS(x, r'^/ordercompleted\.html')
)
AS INT64
) AS conversion
FROM table_1
ORDER BY fullVisitorId
),
-- Filtering by sessions with conversions
table_3 AS (
SELECT *
FROM table_2
WHERE TRUE
QUALIFY COUNTIF(conversion = 1) OVER (PARTITION BY fullVisitorId) > 0
),
-- Adding the attribution models
table_4 AS (
SELECT
fullVisitorId,
DATE(TIMESTAMP_SECONDS(visitStartTime)) AS date,
visitStartTime AS date_sec,
mcf_channel_grouping,
medium,
source,
campaign,
conversion,
CASE
WHEN conversion > 0 AND visitStartTime > LAG(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
WHEN conversion > 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
ELSE 'null'
END AS last_touch_attribution,
CASE
WHEN conversion > 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
WHEN conversion = 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN 'null'
WHEN conversion = 0 AND LAG(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN 'null'
WHEN SUM(conversion) OVER (PARTITION BY fullVisitorId) > 0 AND visitStartTime < LAST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AND LEAD(source) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) = 'direct' AND source != 'direct' THEN '1'
WHEN conversion > 0 AND visitStartTime = LAST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AND source != 'direct' THEN '1'
ELSE 'null'
END AS last_non_direct,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
ELSE 'null'
END AS first_touch_attribution,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 THEN '1'
ELSE 'null'
END AS any_touch_attribution,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 AND source = 'blog' THEN '1'
ELSE 'null'
END AS blog_only
FROM table_3
ORDER BY fullVisitorId, visitStartTime
)
SELECT *
FROM table_4
The issue I have is that the Last Non-Direct model is not calculated correctly and I don't know how to create the look-back window that allows me to set n days prior to conversion.
How could we replicate this report in BigQuery using Standard SQL? Thanks.
I am trying to filter a funnel based on users who have certain custom dimension values. Sadly, the custom dimension in question is session-scoped and not hit-based, so i cannot use hits.customDimensions in this particular query. What is the best way to do this and achieve the desired result?
Find my progress so far:
#standardSQL
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page' LIMIT 1)) One_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/two - Page' LIMIT 1)) Two_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/three - Page' LIMIT 1)) Three_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/four - Page' LIMIT 1)) Four_Page
FROM `xxxxxxx.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(customDimensions) AS cusDim
WHERE
_TABLE_SUFFIX BETWEEN '20190320' AND '20190323'
AND h.hitNumber = 1
AND cusDim.index = 6
AND cusDim.value IN ('60','70)
Segmentation with Custom Dimensions
You can filter for sessions based on conditions in custom dimensions. Simply write a sub-query counting cases of interest and set to ">0". Example for sample data:
SELECT
fullvisitorid,
visitstarttime,
customdimensions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
-- there should be at least one case with index=4 and value='EMEA' ... you can use your index and desired value
-- unnest() turns customdimensions into table format, so we can apply SQL to this array
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
limit 100
You comment the WHERE statement to see all the data.
Funnel
First you might want to get an overview of what is going on in your hits array:
SELECT
fullvisitorid,
visitstarttime,
-- get an overview over relevant hits data
-- select as struct feeds hits fields into a new array created by array()-function
ARRAY(select as struct hitnumber, page from unnest(hits) where type='PAGE') hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Now that you made sure the data makes sense you can create a funnel array containing the hit numbers of the relevant steps:
SELECT
fullvisitorid,
visitstarttime,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Put this into a WITH statement for more clarity and run your analysis by summarizing the corresponding cases:
WITH f AS (
SELECT
fullvisitorid,
visitstarttime,
totals.visits,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
)
SELECT
COUNT(DISTINCT fullvisitorid) as users,
SUM(visits) as allSessions,
SUM( IF(array_length(funnel)>0,visits,0) ) sessionsWithFunnelPages,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null ) ,visits,0) ) sessionsWithStep1,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 ) ,visits,0) ) sessionsFunnelToStep2,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 and step2<step3 and step1<step3) ,visits,0) ) sessionsFunnelToStep3
FROM f
Please test before using.
I've got this query that I'd like to add additional metric of "product details views" this is hits.ecommerceaction.action_type = 2.
I understand generally how these queries work, but this one is already complicated for me, and I'm struggling to add these additional nested hits into the mix.
This query I have already works to give me landing page and additional dimensions, so all I want to do now is add in product detail views.
SELECT DISTINCT
a.date
,a.landingpage
,a.medium
,a.sources
,a.campaign
,a.device
,a.content
,a.country
,COUNT(DISTINCT(a.sessionId)) sessions
,SUM(a.bounces) bounces
,SUM(a.trans) trans
,SUM(a.rev)/1000000 rev
,AVG(a.avg_pages) avg_pages
,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
,COUNT(DISTINCT(a.user)) users
FROM
(
SELECT DISTINCT
CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
,fullvisitorid user
,(SELECT sourcePropertyInfo.sourcePropertyDisplayName FROM UNNEST(hits) where hitnumber = (SELECT MIN(hitnumber) from UNNEST(hits) where type = 'PAGE')) country
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
,date
,trafficSource.medium medium
,trafficSource.source sources
,trafficSource.campaign campaign
,trafficSource.adContent content
,device.deviceCategory device
,totals.bounces bounces
,totals.timeonsite tos
,totals.transactions trans
,totals.transactionRevenue as rev
,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
FROM `ghd-analytics-235112.132444882.ga_sessions_*`
WHERE _TABLE_SUFFIX >= '20190417' /*date start*/
AND _TABLE_SUFFIX <= '20190417' /*date end*/
AND totals.visits = 1
) a
GROUP BY landingpage,medium,device,sources,campaign,content,date,country
ORDER BY sessions desc
Any thoughts/help much appreciated!
I've found a solution, which I had tried other variations of, but this seems to work now.
,(SELECT COUNT(eventinfo.eventaction) FROM UNNEST(hits) WHERE eventinfo.eventaction = 'productDetail') pviews
Full Query here for anyone else who would like it.
/* landing page, medium, source, campaign, adcontent, device, country, sessions, bounces, avg pages per session, time on site, transactions, revenue
add additional dimensions and metrics into the second select statement, aggregate in the top select statement, order by any new dimensions
*/
SELECT DISTINCT
a.date
,a.landingpage
,a.medium
,a.sources
,a.campaign
,a.device
,a.content
,a.country
,COUNT(DISTINCT(a.sessionId)) sessions
,SUM(a.bounces) bounces
,SUM(a.trans) trans
,SUM(a.rev)/1000000 rev
,AVG(a.avg_pages) avg_pages
,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
,COUNT(DISTINCT(a.user)) users
,sum(a.pviews) pviews
FROM
(
SELECT DISTINCT
CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
,fullvisitorid user
,(SELECT sourcePropertyInfo.sourcePropertyDisplayName FROM UNNEST(hits) where hitnumber = (SELECT MIN(hitnumber) from UNNEST(hits) where type = 'PAGE')) country
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
,date
,trafficSource.medium medium
,trafficSource.source sources
,trafficSource.campaign campaign
,trafficSource.adContent content
,device.deviceCategory device
,totals.bounces bounces
,totals.timeonsite tos
,totals.transactions trans
,totals.transactionRevenue as rev
,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
,(SELECT COUNT(eventinfo.eventaction) FROM UNNEST(hits) WHERE eventinfo.eventaction = 'productDetail') pviews
FROM `ghd-analytics-XXXXXX.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX >= '20190417' /*date start*/
AND _TABLE_SUFFIX <= '20190417' /*date end*/
AND totals.visits = 1
) a
GROUP BY landingpage,medium,device,sources,campaign,content,date,country
ORDER BY sessions desc
I am trying to recreate the GA funnel (custom report on Google360) using BigQuery. The funnel on GA is using the unique count of events that happen on each page. I found this query online that is working for the most part:
SELECT
COUNT( s0.firstHit) AS Landing_Page,
COUNT( s1.firstHit) AS Model_Selection
from(
SELECT
s0.fullvisitorID,
s0.firstHit,
s1.firstHit,
FROM (
# Begin Subquery #1 aka s0
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROm [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('landing_page')
AND totals.visits = 1
GROUP BY
fullvisitorID
) s0
# End Subquery #1 aka s0
left join (
# Begin Subquery #2 aka s1
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROM [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('model_selection_page')
AND totals.visits = 1
GROUP BY
fullvisitorID,
) s1
ON
s0.fullvisitorID = s1.fullvisitorID
)
The query works fine and the value for landing page is the same as I can get on GA, but Model_Selection is about 10% higher. This difference also increases along the funnel (I only posted 2 steps for clarity).
Any idea what am I missing here?
This query does what you need but in Standard SQL Version:
#standardSQL
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM `64269470.ga_sessions_20170720`
Just that. 4 lines, way faster and cheaper.
You can also play with simulated data, something like:
#standardSQL
WITH data AS(
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits
)
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM data
Notice that building this type of report in GA might be a bit more difficult as you need to select visitors who had at least fired once the event 'landing_page' and then had the event 'model_selection_page' fired. Make sure you got this report built correctly as well in your GA (one way might be to first build a customized report with only customers who had 'landing_page' fired and then apply the second filter looking for 'model_selection_page').
[EDIT]:
You asked in your comment about bringing this counting on the session and user level. For counting each session, you can limit the results to 1 for each sub-query evaluation, like so:
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
For counting distinct users, the idea is the same but you'd have to apply a COUNT(DISTINCT) operation, like so:
SELECT
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
I wish to define a view for Google Analytics landing pages. I've tried to set this up by saving the following query as a view:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
In the queries that join to this view I plan to limit them to between two date partitions like so:
SELECT
sessions.date,
fullVisitorId AS fv,
visitId AS v,
landing_page
FROM `project-id.dataset.ga_sessions_*` AS sessions, UNNEST(hits) AS h
JOIN `project-id.dataset.landing_pages` AS landing_pages
ON landing_pages.fv = sessions.fullVisitorId
AND landing_pages.date = sessions.date
AND landing_pages.v = sessions.visitId
WHERE
_TABLE_SUFFIX BETWEEN '20170108' AND '20170108'
This still appears to select a large volume of data ~5GB rather than ~60MB that would be expected for one day.
How can I re-write the view so that it only selects the relevant date partitions as defined by the consuming query?
Make sure to include the _TABLE_SUFFIX in the view definition so that you can reference it in queries over the view. Here's an example that converts the _TABLE_SUFFIX to a date:
SELECT
date,
fullVisitorId AS fv,
visitID AS v,
h.page.pagePath AS landing_page,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS sessions_date
FROM
`project-id.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1;
Now try a query over the view:
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `dataset.view_name`
WHERE sessions_date = '2017-01-08';