For each fullvisitorId, i'm trying to get all visitId between date_1 and date_2. which is of-course differ for each user.
Can anyone give any pointers how I can go about doing this?
for example:
user_1: i'd like all visitId between 1st & 20th June
user_2: i'd like all visitId between 12th & 27th June
... and so son
date_1 and date_2 correspond to important actions (Event hits) they took on the site. Download Trial & Purchase
Thanks in advance for any leads in this.
One possible way of solving this is using analytical functions. As an example:
#standardSQL
WITH data AS(
select '1' as user, '1' as visitid, '20170520' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event1' as eventCategory) as eventInfo)] hits UNION ALL
select '1' as user, '2' as visitid, '20170521' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits UNION ALL
select '1' as user, '3' as visitid, '20170522' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event2' as eventCategory) as eventInfo)] hits UNION ALL
select '1' as user, '4' as visitid, '20170523' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits UNION ALL
select '2' as user, '1' as visitid, '20170520' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event1' as eventCategory) as eventInfo)] hits UNION ALL
select '2' as user, '2' as visitid, '20170521' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event2' as eventCategory) as eventInfo)] hits UNION ALL
select '2' as user, '3' as visitid, '20170522' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits union all
select '3' as user, '1' as visitid, '20170520' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event1' as eventCategory) as eventInfo)] hits UNION ALL
select '3' as user, '2' as visitid, '20170521' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits UNION ALL
select '3' as user, '3' as visitid, '20170522' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits
)
SELECT
user,
visitid,
date
FROM(
SELECT
user,
visitid,
date,
MIN(CASE WHEN hits.eventInfo.eventCategory = 'event1' THEN date END) OVER(PARTITION BY user) min_date,
MAX(CASE WHEN hits.eventInfo.eventCategory = 'event2' THEN date END) OVER(PARTITION BY user) max_date
FROM data,
UNNEST(hits) hits
)
WHERE date BETWEEN min_date AND max_date
Where data is a simulation of your ga_sessions data (I named 'fullvisitorid' as 'user').
This makes the assumption that a given user can have distinct events for date 1 and date 2 (so it's taking the MIN and MAX respectively) and it assumes that you are saving the event in the eventCategory field (given that your event of "Download" and "Purchase" are defined in the session level, I recommend you use the customDimensions field instead of the hits.eventInfo.eventCategory one).
Other than analytical functions, you can also work with ARRAYs and STRUCTs of the Standard SQL version:
SELECT
user,
ARRAY(SELECT AS STRUCT visitid, date FROM UNNEST(user_data) WHERE date BETWEEN min_date AND max_date) user_data
FROM(
SELECT
user,
ARRAY_AGG((SELECT AS STRUCT visitid, date)) user_data,
MIN(CASE WHEN EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE hits.eventInfo.eventCategory = 'event1') then date END) min_date,
MAX(CASE WHEN EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE hits.eventInfo.eventCategory = 'event2') THEN date END) max_date
FROM data
GROUP BY user
)
WHERE ARRAY_LENGTH(ARRAY(SELECT AS STRUCT visitid, date FROM UNNEST(user_data) WHERE date BETWEEN min_date AND max_date)) > 0
If the assumptions I made are not aligned with your data you can adapt these techniques to query for what you want. You can also use the simulated data for testing purposes (as well as adapting it to better suit your dataset).
Related
I'm having a hard time replicating the metrics found when accessing in the Analytics console. Particularly user and session metrics when split by campaign details (extracted from 'campaign_details' event and attributed on a last interaction basis over 90 days). If I query the data without considering campaign_details my values are as in the console. I'd be interested to know if anyone has worked with this previously and managed to get the data as in the console or if it's even possible to expect parity?
with initial_prep as (
SELECT
(select max(value.string_value) from unnest(user_properties) where key='store') store,
device.operating_system as operating_system,
event_date,
user_pseudo_id,
event_timestamp,
TIMESTAMP_MICROS(event_timestamp) AS ts,
LAG(TIMESTAMP_MICROS(event_timestamp)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS prev_evt_ts,
IF(event_name = "session_start", 1, 0) AS is_session_start_event,
IF(event_name = "first_open", 1, 0) AS is_first_visit_event,
IF(event_name = "screen_view", 1, 0) AS is_screen_view,
IF(event_name = "purchase",1,0) as is_purchase,
ecommerce.purchase_revenue as value,
ecommerce.shipping_value as shipping,
ecommerce.total_item_quantity as quantity,
FROM
`[PROJECT DETAILS REDACTED].events_20*`
WHERE
parse_date('%y%m%d', _table_suffix) between DATE_sub(current_date(), interval 1 day) and DATE_sub(current_date(), interval 1 day)
and
device.operating_system = 'IOS'
), user_sources as (
select
user_pseudo_id,
TIMESTAMP_MICROS(event_timestamp) AS ts,
(select max(value.string_value) from unnest(event_params) where key='source' and event_name in( 'campaign_details')) source,
(select max(value.string_value) from unnest(event_params) where key='campaign' and event_name in( 'campaign_details')) campaign
from
`[PROJECT DETAILS REDACTED].events_20*`
WHERE
parse_date('%y%m%d', _table_suffix) between DATE_sub(current_date(), interval 90 day) and DATE_sub(current_date(), interval 1 day)
and
event_name in ('campaign_details')
)
, session_id_created as (
SELECT
*,
SUM(is_session_start_event) OVER (PARTITION BY user_pseudo_id ORDER BY ts) AS session_id
FROM initial_prep
)
, session_details as (
SELECT
si.user_pseudo_id,
store,
operating_system,
event_date,
event_timestamp,
session_id,
MAX(is_session_start_event) OVER (PARTITION BY si.user_pseudo_id, session_id) AS has_session_start_event,
is_session_start_event,
MAX(is_first_visit_event) OVER (PARTITION BY si.user_pseudo_id, session_id) AS has_first_visit_event,
is_first_visit_event,
is_screen_view,
MAX(event_timestamp) OVER (PARTITION BY si.user_pseudo_id, session_id) AS max_timestamp,
MIN(event_timestamp) OVER (PARTITION BY si.user_pseudo_id, session_id) AS min_timestamp,
is_purchase,
value,
shipping,
quantity,
us.source,
us.campaign,
row_number() over (partition by si.user_pseudo_id, event_timestamp order by us.ts desc) rank,
us.ts time_campaign
from session_id_created si
left join user_sources us on us.user_pseudo_id = si.user_pseudo_id and si.ts >= us.ts -->= timestamp_sub(us.ts,interval 3600000 MICROSECOND)
)
, session_fin as (
select user_pseudo_id,
store,
operating_system,
source,
campaign,
event_date,
session_id,
has_session_start_event,
has_first_visit_event,
max_timestamp,
min_timestamp,
sum(is_session_start_event) sessions_alt,
sum(is_screen_view) screen_views,
sum(value) revenue,
sum(is_purchase) transactions,
sum(shipping) shipping,
sum(quantity) item_quantity
from session_details
where rank =1
group by
user_pseudo_id,
store,
event_date,
operating_system,
session_id,
source,
campaign,
has_session_start_event,
has_first_visit_event,
max_timestamp,
min_timestamp
)
select store
, operating_system
, source
, campaign
, event_date applicabledate
, sum(sessions_alt) sessions
, sum(transactions) transactions
, sum(revenue) local_revenue
, sum(shipping) local_shipping
, sum(item_quantity) item_quantity
, avg(max_timestamp/100000 - min_timestamp/100000) avgsessionduration
, count(distinct user_pseudo_id) users
, count(distinct case when has_first_visit_event = 1 then user_pseudo_id end) new_users
, sum(screen_views) screenviews
from session_fin
group by store, event_date, operating_system
, source
, campaign
order by users desc
/**/
I've got this query that I'd like to add additional metric of "product details views" this is hits.ecommerceaction.action_type = 2.
I understand generally how these queries work, but this one is already complicated for me, and I'm struggling to add these additional nested hits into the mix.
This query I have already works to give me landing page and additional dimensions, so all I want to do now is add in product detail views.
SELECT DISTINCT
a.date
,a.landingpage
,a.medium
,a.sources
,a.campaign
,a.device
,a.content
,a.country
,COUNT(DISTINCT(a.sessionId)) sessions
,SUM(a.bounces) bounces
,SUM(a.trans) trans
,SUM(a.rev)/1000000 rev
,AVG(a.avg_pages) avg_pages
,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
,COUNT(DISTINCT(a.user)) users
FROM
(
SELECT DISTINCT
CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
,fullvisitorid user
,(SELECT sourcePropertyInfo.sourcePropertyDisplayName FROM UNNEST(hits) where hitnumber = (SELECT MIN(hitnumber) from UNNEST(hits) where type = 'PAGE')) country
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
,date
,trafficSource.medium medium
,trafficSource.source sources
,trafficSource.campaign campaign
,trafficSource.adContent content
,device.deviceCategory device
,totals.bounces bounces
,totals.timeonsite tos
,totals.transactions trans
,totals.transactionRevenue as rev
,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
FROM `ghd-analytics-235112.132444882.ga_sessions_*`
WHERE _TABLE_SUFFIX >= '20190417' /*date start*/
AND _TABLE_SUFFIX <= '20190417' /*date end*/
AND totals.visits = 1
) a
GROUP BY landingpage,medium,device,sources,campaign,content,date,country
ORDER BY sessions desc
Any thoughts/help much appreciated!
I've found a solution, which I had tried other variations of, but this seems to work now.
,(SELECT COUNT(eventinfo.eventaction) FROM UNNEST(hits) WHERE eventinfo.eventaction = 'productDetail') pviews
Full Query here for anyone else who would like it.
/* landing page, medium, source, campaign, adcontent, device, country, sessions, bounces, avg pages per session, time on site, transactions, revenue
add additional dimensions and metrics into the second select statement, aggregate in the top select statement, order by any new dimensions
*/
SELECT DISTINCT
a.date
,a.landingpage
,a.medium
,a.sources
,a.campaign
,a.device
,a.content
,a.country
,COUNT(DISTINCT(a.sessionId)) sessions
,SUM(a.bounces) bounces
,SUM(a.trans) trans
,SUM(a.rev)/1000000 rev
,AVG(a.avg_pages) avg_pages
,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
,COUNT(DISTINCT(a.user)) users
,sum(a.pviews) pviews
FROM
(
SELECT DISTINCT
CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
,fullvisitorid user
,(SELECT sourcePropertyInfo.sourcePropertyDisplayName FROM UNNEST(hits) where hitnumber = (SELECT MIN(hitnumber) from UNNEST(hits) where type = 'PAGE')) country
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
,date
,trafficSource.medium medium
,trafficSource.source sources
,trafficSource.campaign campaign
,trafficSource.adContent content
,device.deviceCategory device
,totals.bounces bounces
,totals.timeonsite tos
,totals.transactions trans
,totals.transactionRevenue as rev
,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
,(SELECT COUNT(eventinfo.eventaction) FROM UNNEST(hits) WHERE eventinfo.eventaction = 'productDetail') pviews
FROM `ghd-analytics-XXXXXX.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX >= '20190417' /*date start*/
AND _TABLE_SUFFIX <= '20190417' /*date end*/
AND totals.visits = 1
) a
GROUP BY landingpage,medium,device,sources,campaign,content,date,country
ORDER BY sessions desc
I am newbie using BiqQuery (couple of weeks experience) and trying to improve my skills. I've got a pratical question about the following very interesting query which was posted
(Recreate GA Funnel on BigQuery) by user Willian Fuks. It is about GA data in BigQuery reproducing a funnel in an efficient way.
#standardSQL
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM `64269470.ga_sessions_20170720`
In the example, eventInfo.eventAction is used. I tried several things to get it also working with customDimension, but I failed. Does anyone know how can I reproduce the query segmenting it with a customDimension instead of eventInfo.eventAction?
I worked with this:
(SELECT MAX(IF(index=1,page1, NULL))FROM UNNEST(hits.customDimensions))
Working with customDimensions is a bit more challenging because this field is also an ARRAY type (repeated). Still, in practice, the main difference is that another UNNEST operation is required. Other than that, it's the same logic.
Here's some data to show you that:
#standardSQL
WITH data AS(
SELECT '1' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(1 AS index, 'value1' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, 2 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '2' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(1 AS index, 'value1' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '3' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '3' AS fullvisitorid, 2 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(3 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '4' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(3 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits
)
Each user (fullvisitorid) and each session (visitid) have its hits ARRAY. Notice I've separated each hit by its hitNumber which I find makes things a bit easier to understand.
The following query computes total sessions where a customDimension with index 1 and value 'landing_page' happened, as well as where index was 3 and value is 'model_selection_page':
#standardSQL
SELECT
SUM((SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page' LIMIT 1)) Landing_Page,
SUM((SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE EXISTS(SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page') AND index = 3 AND value = 'model_selection_page' LIMIT 1)) Model_Selection
FROM
data
You can play around with the simulated data to better understand what's going on here. In a nutshell, notice two UNNESTs happen, first to get the values inside the hits and the second to get the values inside customDimension.
The field Model_Selection is a bit more complex as it first has to evaluate whether the dimension 'landing_page' was fired, as you can see in this expression:
EXISTS(SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page')
If hits had somewhere the 'landing_page' dimension, then this expression returns True in the WHERE clause.
You can also bring results on the user level, like so:
#standardSQL
SELECT
COUNT(DISTINCT (SELECT fullvisitorid FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page' LIMIT 1)) Landing_Page,
COUNT(DISTINCT (SELECT fullvisitorid FROM UNNEST(hits), UNNEST(customDimension) WHERE EXISTS(SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page') AND index = 3 AND value = 'model_selection_page' LIMIT 1)) Model_Selection
FROM
data
As you are learning BigQuery, I recommend playing around with the simulated data and testing each step observing the outputs. You can play around with the UNNEST, run some queries testing its outputs and so on to get a better and deeper understanding of how to use these techniques.
I am trying to recreate the GA funnel (custom report on Google360) using BigQuery. The funnel on GA is using the unique count of events that happen on each page. I found this query online that is working for the most part:
SELECT
COUNT( s0.firstHit) AS Landing_Page,
COUNT( s1.firstHit) AS Model_Selection
from(
SELECT
s0.fullvisitorID,
s0.firstHit,
s1.firstHit,
FROM (
# Begin Subquery #1 aka s0
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROm [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('landing_page')
AND totals.visits = 1
GROUP BY
fullvisitorID
) s0
# End Subquery #1 aka s0
left join (
# Begin Subquery #2 aka s1
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROM [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('model_selection_page')
AND totals.visits = 1
GROUP BY
fullvisitorID,
) s1
ON
s0.fullvisitorID = s1.fullvisitorID
)
The query works fine and the value for landing page is the same as I can get on GA, but Model_Selection is about 10% higher. This difference also increases along the funnel (I only posted 2 steps for clarity).
Any idea what am I missing here?
This query does what you need but in Standard SQL Version:
#standardSQL
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM `64269470.ga_sessions_20170720`
Just that. 4 lines, way faster and cheaper.
You can also play with simulated data, something like:
#standardSQL
WITH data AS(
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits
)
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM data
Notice that building this type of report in GA might be a bit more difficult as you need to select visitors who had at least fired once the event 'landing_page' and then had the event 'model_selection_page' fired. Make sure you got this report built correctly as well in your GA (one way might be to first build a customized report with only customers who had 'landing_page' fired and then apply the second filter looking for 'model_selection_page').
[EDIT]:
You asked in your comment about bringing this counting on the session and user level. For counting each session, you can limit the results to 1 for each sub-query evaluation, like so:
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
For counting distinct users, the idea is the same but you'd have to apply a COUNT(DISTINCT) operation, like so:
SELECT
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
I'm using BigQuery to report on Google Analytics data. I'm trying to recreate landing page data using BigQuery.
The following query reports 18% fewer sessions than in the Google Analytics interface:
SELECT DISTINCT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project-name.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND h.type = 'PAGE'
AND _TABLE_SUFFIX BETWEEN '20170331' AND '20170331'
ORDER BY fullVisitorId DESC
Where am I going wrong with my approach? Why can't I get to within a small margin of the number in the GA interface's reported figure?
Multiple reasons :
1.Big Query for equivalent landing page:
SELECT
LandingPage,
COUNT(sessionId) AS Sessions,
100 * SUM(totals.bounces)/COUNT(sessionId) AS BounceRate,
AVG(totals.pageviews) AS AvgPageviews,
SUM(totals.timeOnSite)/COUNT(sessionId) AS AvgTimeOnSite,
from(
SELECT
CONCAT(fullVisitorId,STRING(visitId)) AS sessionID,
totals.bounces,
totals.pageviews,
totals.timeOnSite,
hits.page.pagePath AS landingPage
FROM (
SELECT
fullVisitorId,
visitId,
hits.page.pagePath,
totals.bounces,
totals.pageviews,
totals.timeOnSite,
MIN(hits.hitNumber) WITHIN RECORD AS firstHit,
hits.hitNumber AS hitNumber
FROM (TABLE_DATE_RANGE ([XXXYYYZZZ.ga_sessions_],TIMESTAMP('2016-08-01'), TIMESTAMP ('2016-08-31')))
WHERE
hits.type = 'PAGE'
AND hits.page.pagePath'')
WHERE
hitNumber = firstHit)
GROUP BY
LandingPage
ORDER BY
Sessions DESC,
LandingPage
Next :
Pre-calculated data -- pre-aggregated tables
These are the precalculated data that Google uses to speed up the UI. Google does not specify when this is done but it can be at any point of the time. These are known as pre-aggregated tables
So if you compare the numbers from GA UI to your Big Query output, you will always see a discrepancy. Please go ahead and rely on your big query data .
You can achieve the same thing by simply adding the below to your select statement:
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
I can get a 1 to 1 match with the GA UI on my end when I run something like below, which is a bit more concise than the original answer:
SELECT DISTINCT
a.landingpage
,COUNT(DISTINCT(a.sessionId)) sessions
,SUM(a.bounces) bounces
,AVG(a.avg_pages) avg_pages
,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
FROM
(
SELECT DISTINCT
CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
,totals.bounces bounces
,totals.timeonsite tos
,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
FROM `tablename_*`
WHERE _TABLE_SUFFIX >= '20180801'
AND _TABLE_SUFFIX <= '20180808'
AND totals.visits = 1
) a
GROUP BY 1
another way here! you can get the same number :
SELECT
LandingPage,
COUNT(DISTINCT(sessionID)) AS sessions
FROM(
SELECT
CONCAT(fullVisitorId,CAST(visitId AS STRING)) AS sessionID,
FIRST_VALUE(hits.page.pagePath) OVER (PARTITION BY CONCAT(fullVisitorId,CAST(visitId AS STRING)) ORDER BY hits.hitNumber ASC ) AS LandingPage
FROM
`xxxxxxxx1.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND hits.type ='PAGE'
GROUP BY fullVisitorId, visitId, sessionID,hits.page.pagePath,hits.hitNumber
)
GROUP BY LandingPage
ORDER BY sessions DESC
There is a hit.isEntrance field in the schema that can be used for this purpose.
The example below would show you yesterday's landing pages:
#standardSQL
select
date,
hits.page.pagePath as landingPage,
sum(totals.visits) as visits,
sum(totals.bounces) as bounces,
sum(totals.transactions) as transactions
from
`project.dataset.ga_sessions_*`,
unnest(hits) as hits
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and hits.isEntrance = True
and totals.visits = 1 #avoid counting midnight-split sessions
group by
1, 2
order by 3 desc
There is still one source of discrepancy though, which comes from the sessions without a landing page (if you check in GA in the landing pages report, there will sometimes be a (not set) value.
In order to include those as well, you can do:
with
landing_pages_set as (
select
concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
hits.page.pagePath as virtualPagePath
from
`project.dataset.ga_sessions_*`,
unnest(hits) as hits
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and totals.visits = 1 #avoid counting midnight-split sessions
and hits.isEntrance = TRUE
group by 1, 2
),
landing_pages_not_set as (
select
concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
date,
"(not set)" as virtualPagePath,
count(distinct concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string))) as visits,
sum(totals.bounces) as bounces,
sum(totals.transactions) as transactions
from
`project.dataset.ga_sessions_*`
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and totals.visits = 1 #avoid counting midnight-split sessions
group by 1, 2, 3
),
landing_pages as (
select
l.fullVisitId as fullVisitId,
date,
coalesce(r.virtualPagePath, l.virtualPagePath) as virtualPagePath,
visits,
bounces,
transactions
from
landing_pages_not_set l left join landing_pages_set r on l.fullVisitId = r.fullVisitId
)
select virtualPagePath, sum(visits) from landing_pages group by 1 order by 2 desc