I am newbie using BiqQuery (couple of weeks experience) and trying to improve my skills. I've got a pratical question about the following very interesting query which was posted
(Recreate GA Funnel on BigQuery) by user Willian Fuks. It is about GA data in BigQuery reproducing a funnel in an efficient way.
#standardSQL
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM `64269470.ga_sessions_20170720`
In the example, eventInfo.eventAction is used. I tried several things to get it also working with customDimension, but I failed. Does anyone know how can I reproduce the query segmenting it with a customDimension instead of eventInfo.eventAction?
I worked with this:
(SELECT MAX(IF(index=1,page1, NULL))FROM UNNEST(hits.customDimensions))
Working with customDimensions is a bit more challenging because this field is also an ARRAY type (repeated). Still, in practice, the main difference is that another UNNEST operation is required. Other than that, it's the same logic.
Here's some data to show you that:
#standardSQL
WITH data AS(
SELECT '1' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(1 AS index, 'value1' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, 2 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '2' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(1 AS index, 'value1' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '3' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '3' AS fullvisitorid, 2 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(3 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits UNION ALL
SELECT '4' AS fullvisitorid, 1 AS visitid, ARRAY<STRUCT< hitNumber INT64, customDimension ARRAY<STRUCT<index INT64, value STRING> > >> [STRUCT(1 AS hitNumber, [STRUCT(1 AS index, 'landing_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(2 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension),
STRUCT(3 AS hitNumber, [STRUCT(3 AS index, 'model_selection_page' AS value), STRUCT(2 AS index, 'value2' AS value)] AS customDimension)] AS hits
)
Each user (fullvisitorid) and each session (visitid) have its hits ARRAY. Notice I've separated each hit by its hitNumber which I find makes things a bit easier to understand.
The following query computes total sessions where a customDimension with index 1 and value 'landing_page' happened, as well as where index was 3 and value is 'model_selection_page':
#standardSQL
SELECT
SUM((SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page' LIMIT 1)) Landing_Page,
SUM((SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE EXISTS(SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page') AND index = 3 AND value = 'model_selection_page' LIMIT 1)) Model_Selection
FROM
data
You can play around with the simulated data to better understand what's going on here. In a nutshell, notice two UNNESTs happen, first to get the values inside the hits and the second to get the values inside customDimension.
The field Model_Selection is a bit more complex as it first has to evaluate whether the dimension 'landing_page' was fired, as you can see in this expression:
EXISTS(SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page')
If hits had somewhere the 'landing_page' dimension, then this expression returns True in the WHERE clause.
You can also bring results on the user level, like so:
#standardSQL
SELECT
COUNT(DISTINCT (SELECT fullvisitorid FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page' LIMIT 1)) Landing_Page,
COUNT(DISTINCT (SELECT fullvisitorid FROM UNNEST(hits), UNNEST(customDimension) WHERE EXISTS(SELECT 1 FROM UNNEST(hits), UNNEST(customDimension) WHERE index = 1 AND value = 'landing_page') AND index = 3 AND value = 'model_selection_page' LIMIT 1)) Model_Selection
FROM
data
As you are learning BigQuery, I recommend playing around with the simulated data and testing each step observing the outputs. You can play around with the UNNEST, run some queries testing its outputs and so on to get a better and deeper understanding of how to use these techniques.
Related
I have the following report in the demo account of Google Analytics:
https://analytics.google.com/analytics/web/?utm_source=demoaccount&utm_medium=demoaccount&utm_campaign=demoaccount#/report/bf-roi-calculator/a54516992w87479473p92320289/_u.date00=20211101&_u.date01=20211128&_r.attrSel2=preset6&_r.attrSel1=preset1&_r.attrSel3=preset7/
In this report, we can see the different models of conversion attribution, e.g. Last Interaction, Last Non-Direct Click, and Last Google Ads Click. There are also other models, like First Interaction, Position based. Here's Google's documentation about the multi-channel funnels report:
https://support.google.com/analytics/topic/1191164?hl=en&ref_topic=1631741
So far, I have managed to build the following query:
-- Sessions with source/medium, hits, and page path
WITH table_1 AS (
SELECT
fullVisitorId,
visitStartTime,
CONCAT(fullVisitorId, visitId, date) AS session,
trafficSource.medium,
trafficSource.source,
ANY_VALUE(social.hasSocialSourceReferral) AS social_source_referral,
trafficSource.campaign,
ARRAY_AGG(hitNumber ORDER BY hitNumber) AS hit_number,
ARRAY_AGG(page.pagePath ORDER BY hitNumber) AS page_path
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, UNNEST(hits) AS hits_
WHERE _TABLE_SUFFIX BETWEEN '20170727' AND '20170801'
GROUP BY fullVisitorId, visitStartTime, session, medium, source, campaign),
-- Adding the MCF channel grouping and creating a field that indicates sessions with conversions
table_2 AS (
SELECT
fullVisitorId,
visitStartTime,
CASE
WHEN source = '(direct)' AND (medium = '(not set)' OR medium = '(none)') THEN 'Direct'
WHEN medium = 'organic' THEN 'Organic Search'
WHEN social_source_referral = 'Yes' AND REGEXP_CONTAINS(medium, r'^(social|social-network|social-media|sm|social network|social media)$') THEN 'Social'
WHEN medium = 'email' THEN 'Email'
WHEN medium = 'affiliate' THEN 'Affiliate'
WHEN medium = 'referral' THEN 'Referral'
WHEN REGEXP_CONTAINS(medium, r'^(cpc|ppc|paidsearch)$') THEN 'Paid Search'
WHEN REGEXP_CONTAINS(medium, r'^(cpv|cpa|cpp|content-text)$') THEN 'Other Advertising'
WHEN REGEXP_CONTAINS(medium, r'^(display|cpm|banner)$') THEN 'Display'
ELSE 'Other'
END AS mcf_channel_grouping,
medium,
source,
campaign,
CAST(
EXISTS(
SELECT *
FROM UNNEST(page_path) AS x
WHERE REGEXP_CONTAINS(x, r'^/ordercompleted\.html')
)
AS INT64
) AS conversion
FROM table_1
ORDER BY fullVisitorId
),
-- Filtering by sessions with conversions
table_3 AS (
SELECT *
FROM table_2
WHERE TRUE
QUALIFY COUNTIF(conversion = 1) OVER (PARTITION BY fullVisitorId) > 0
),
-- Adding the attribution models
table_4 AS (
SELECT
fullVisitorId,
DATE(TIMESTAMP_SECONDS(visitStartTime)) AS date,
visitStartTime AS date_sec,
mcf_channel_grouping,
medium,
source,
campaign,
conversion,
CASE
WHEN conversion > 0 AND visitStartTime > LAG(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
WHEN conversion > 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
ELSE 'null'
END AS last_touch_attribution,
CASE
WHEN conversion > 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
WHEN conversion = 0 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN 'null'
WHEN conversion = 0 AND LAG(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN 'null'
WHEN SUM(conversion) OVER (PARTITION BY fullVisitorId) > 0 AND visitStartTime < LAST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AND LEAD(source) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) = 'direct' AND source != 'direct' THEN '1'
WHEN conversion > 0 AND visitStartTime = LAST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AND source != 'direct' THEN '1'
ELSE 'null'
END AS last_non_direct,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 AND visitStartTime = FIRST_VALUE(visitStartTime) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime) THEN '1'
ELSE 'null'
END AS first_touch_attribution,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 THEN '1'
ELSE 'null'
END AS any_touch_attribution,
CASE
WHEN MAX(conversion) OVER (PARTITION BY fullVisitorId) = 1 AND source = 'blog' THEN '1'
ELSE 'null'
END AS blog_only
FROM table_3
ORDER BY fullVisitorId, visitStartTime
)
SELECT *
FROM table_4
The issue I have is that the Last Non-Direct model is not calculated correctly and I don't know how to create the look-back window that allows me to set n days prior to conversion.
How could we replicate this report in BigQuery using Standard SQL? Thanks.
I am trying to filter a funnel based on users who have certain custom dimension values. Sadly, the custom dimension in question is session-scoped and not hit-based, so i cannot use hits.customDimensions in this particular query. What is the best way to do this and achieve the desired result?
Find my progress so far:
#standardSQL
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page' LIMIT 1)) One_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/two - Page' LIMIT 1)) Two_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/three - Page' LIMIT 1)) Three_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE page.pagePath = '/one - Page') AND page.pagePath = '/four - Page' LIMIT 1)) Four_Page
FROM `xxxxxxx.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(customDimensions) AS cusDim
WHERE
_TABLE_SUFFIX BETWEEN '20190320' AND '20190323'
AND h.hitNumber = 1
AND cusDim.index = 6
AND cusDim.value IN ('60','70)
Segmentation with Custom Dimensions
You can filter for sessions based on conditions in custom dimensions. Simply write a sub-query counting cases of interest and set to ">0". Example for sample data:
SELECT
fullvisitorid,
visitstarttime,
customdimensions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
-- there should be at least one case with index=4 and value='EMEA' ... you can use your index and desired value
-- unnest() turns customdimensions into table format, so we can apply SQL to this array
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
limit 100
You comment the WHERE statement to see all the data.
Funnel
First you might want to get an overview of what is going on in your hits array:
SELECT
fullvisitorid,
visitstarttime,
-- get an overview over relevant hits data
-- select as struct feeds hits fields into a new array created by array()-function
ARRAY(select as struct hitnumber, page from unnest(hits) where type='PAGE') hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Now that you made sure the data makes sense you can create a funnel array containing the hit numbers of the relevant steps:
SELECT
fullvisitorid,
visitstarttime,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
limit 100
Put this into a WITH statement for more clarity and run your analysis by summarizing the corresponding cases:
WITH f AS (
SELECT
fullvisitorid,
visitstarttime,
totals.visits,
-- create array with relevant info
-- cross join hit numbers from step pages to get all combinations so that we can check later which came after the other
ARRAY(
select as struct * from
(select hitnumber as step1 from unnest(hits) where type='PAGE' and page.pagePath='/home') left join
(select hitnumber as step2 from unnest(hits) where type='PAGE' and page.pagePath like '/google+redesign/%') on true left join
(select hitnumber as step3 from unnest(hits) where type='PAGE' and page.pagePath='/basket.html') on true
) AS funnel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170505` t
WHERE
(select count(1)>0 from unnest(customdimensions) where index=4 and value='EMEA')
and totals.pageviews>3
)
SELECT
COUNT(DISTINCT fullvisitorid) as users,
SUM(visits) as allSessions,
SUM( IF(array_length(funnel)>0,visits,0) ) sessionsWithFunnelPages,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null ) ,visits,0) ) sessionsWithStep1,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 ) ,visits,0) ) sessionsFunnelToStep2,
SUM( IF( (select count(1)>0 from unnest(funnel) where step1 is not null and step1<step2 and step2<step3 and step1<step3) ,visits,0) ) sessionsFunnelToStep3
FROM f
Please test before using.
Unnesting hits.customdimension and hits.product.customdimension is inflating the transaction revenue
SELECT
sum(totals.totalTransactionRevenue)/1000000 as revenue,
(SELECT MAX(IF(index=10,value,NULL)) FROM UNNEST(product.customDimensions)) AS product_CD10,
(SELECT MAX(IF(index=1,value,NULL)) FROM UNNEST(hits.customDimensions)) AS CD1
FROM
`XXXXXXXXXXXXXXX.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) as product
WHERE
_TABLE_SUFFIX BETWEEN "20180608"
AND "20180608"
group by product_CD10,CD1
Is there a way I could get a flat table in such a way that if I apply sum of revenue, its should give the correct result.
Move your UNNEST() to the top sub-queries - then the rows won't get duplicated:
SELECT row
, (SELECT MAX(letter) FROM UNNEST(row), UNNEST(qq)) max_letter
, (SELECT MAX(n) FROM UNNEST(row), UNNEST(qq), UNNEST(qb) n) max_number
FROM (
SELECT [
STRUCT(1 AS p,[STRUCT('a' AS letter, [4,5,6] AS qb)] AS qq)
, STRUCT(2,[STRUCT('b', [7,8,9])])
, STRUCT(3,[STRUCT('c', [10,11,12])])
] AS row
)
Haven't tested this tho:
SELECT
sum(totals.totalTransactionRevenue)/1000000 as revenue,
(SELECT MAX(IF(index=10,value,NULL)) FROM UNNEST(hits) AS hit, UNNEST(hit.products) product, UNNEST(product.customDimensions)) AS product_CD10,
(SELECT MAX(IF(index=1,value,NULL)) FROM UNNEST(hits) AS hit, UNNEST(hit.customDimensions)) AS CD1
FROM `XXXXXXXXXXXXXXX.ga_sessions_*`,
WHERE _TABLE_SUFFIX BETWEEN "20180608" AND "20180608"
group by product_CD10,CD1
I am trying to recreate the GA funnel (custom report on Google360) using BigQuery. The funnel on GA is using the unique count of events that happen on each page. I found this query online that is working for the most part:
SELECT
COUNT( s0.firstHit) AS Landing_Page,
COUNT( s1.firstHit) AS Model_Selection
from(
SELECT
s0.fullvisitorID,
s0.firstHit,
s1.firstHit,
FROM (
# Begin Subquery #1 aka s0
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROm [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('landing_page')
AND totals.visits = 1
GROUP BY
fullvisitorID
) s0
# End Subquery #1 aka s0
left join (
# Begin Subquery #2 aka s1
SELECT
fullvisitorID,
MIN(hits.hitNumber) AS firstHit
FROM [64269470.ga_sessions_20170720]
WHERE
hits.eventInfo.eventAction in ('model_selection_page')
AND totals.visits = 1
GROUP BY
fullvisitorID,
) s1
ON
s0.fullvisitorID = s1.fullvisitorID
)
The query works fine and the value for landing page is the same as I can get on GA, but Model_Selection is about 10% higher. This difference also increases along the funnel (I only posted 2 steps for clarity).
Any idea what am I missing here?
This query does what you need but in Standard SQL Version:
#standardSQL
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM `64269470.ga_sessions_20170720`
Just that. 4 lines, way faster and cheaper.
You can also play with simulated data, something like:
#standardSQL
WITH data AS(
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('landing_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits UNION ALL
SELECT '1' AS fullvisitorid, ARRAY<STRUCT<eventInfo STRUCT<eventAction STRING > >> [STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo), STRUCT(STRUCT('model_selection_page' AS eventAction) AS eventInfo)] AS hits
)
SELECT
SUM((SELECT COUNTIF(eventInfo.eventAction = 'landing_page') FROM UNNEST(hits))) Landing_Page,
SUM((SELECT COUNTIF(eventInfo.eventAction = 'model_selection_page') FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page'))) Model_Selection
FROM data
Notice that building this type of report in GA might be a bit more difficult as you need to select visitors who had at least fired once the event 'landing_page' and then had the event 'model_selection_page' fired. Make sure you got this report built correctly as well in your GA (one way might be to first build a customized report with only customers who had 'landing_page' fired and then apply the second filter looking for 'model_selection_page').
[EDIT]:
You asked in your comment about bringing this counting on the session and user level. For counting each session, you can limit the results to 1 for each sub-query evaluation, like so:
SELECT
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
SUM((SELECT 1 FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
For counting distinct users, the idea is the same but you'd have to apply a COUNT(DISTINCT) operation, like so:
SELECT
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page,
COUNT(DISTINCT(SELECT fullvisitorid FROM UNNEST(hits) WHERE EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page') AND eventInfo.eventAction = 'model_selection_page' LIMIT 1)) Model_Selection
FROM data
For each fullvisitorId, i'm trying to get all visitId between date_1 and date_2. which is of-course differ for each user.
Can anyone give any pointers how I can go about doing this?
for example:
user_1: i'd like all visitId between 1st & 20th June
user_2: i'd like all visitId between 12th & 27th June
... and so son
date_1 and date_2 correspond to important actions (Event hits) they took on the site. Download Trial & Purchase
Thanks in advance for any leads in this.
One possible way of solving this is using analytical functions. As an example:
#standardSQL
WITH data AS(
select '1' as user, '1' as visitid, '20170520' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event1' as eventCategory) as eventInfo)] hits UNION ALL
select '1' as user, '2' as visitid, '20170521' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits UNION ALL
select '1' as user, '3' as visitid, '20170522' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event2' as eventCategory) as eventInfo)] hits UNION ALL
select '1' as user, '4' as visitid, '20170523' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits UNION ALL
select '2' as user, '1' as visitid, '20170520' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event1' as eventCategory) as eventInfo)] hits UNION ALL
select '2' as user, '2' as visitid, '20170521' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event2' as eventCategory) as eventInfo)] hits UNION ALL
select '2' as user, '3' as visitid, '20170522' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits union all
select '3' as user, '1' as visitid, '20170520' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('event1' as eventCategory) as eventInfo)] hits UNION ALL
select '3' as user, '2' as visitid, '20170521' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits UNION ALL
select '3' as user, '3' as visitid, '20170522' as date, ARRAY<STRUCT<hitNumber INT64, eventInfo STRUCT<eventCategory STRING> >> [STRUCT(1 as hitNumber, STRUCT('' as eventCategory) as eventInfo)] hits
)
SELECT
user,
visitid,
date
FROM(
SELECT
user,
visitid,
date,
MIN(CASE WHEN hits.eventInfo.eventCategory = 'event1' THEN date END) OVER(PARTITION BY user) min_date,
MAX(CASE WHEN hits.eventInfo.eventCategory = 'event2' THEN date END) OVER(PARTITION BY user) max_date
FROM data,
UNNEST(hits) hits
)
WHERE date BETWEEN min_date AND max_date
Where data is a simulation of your ga_sessions data (I named 'fullvisitorid' as 'user').
This makes the assumption that a given user can have distinct events for date 1 and date 2 (so it's taking the MIN and MAX respectively) and it assumes that you are saving the event in the eventCategory field (given that your event of "Download" and "Purchase" are defined in the session level, I recommend you use the customDimensions field instead of the hits.eventInfo.eventCategory one).
Other than analytical functions, you can also work with ARRAYs and STRUCTs of the Standard SQL version:
SELECT
user,
ARRAY(SELECT AS STRUCT visitid, date FROM UNNEST(user_data) WHERE date BETWEEN min_date AND max_date) user_data
FROM(
SELECT
user,
ARRAY_AGG((SELECT AS STRUCT visitid, date)) user_data,
MIN(CASE WHEN EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE hits.eventInfo.eventCategory = 'event1') then date END) min_date,
MAX(CASE WHEN EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE hits.eventInfo.eventCategory = 'event2') THEN date END) max_date
FROM data
GROUP BY user
)
WHERE ARRAY_LENGTH(ARRAY(SELECT AS STRUCT visitid, date FROM UNNEST(user_data) WHERE date BETWEEN min_date AND max_date)) > 0
If the assumptions I made are not aligned with your data you can adapt these techniques to query for what you want. You can also use the simulated data for testing purposes (as well as adapting it to better suit your dataset).